Methodology for news scraping and tagging
Aim and approach
The DataLab offers a searchable platform of news, initially devoted to the impact of Covid-19 on food value chains and then broadened to more general food value chain issues.
The platform leverages automatic extraction from the Google News engine, with tailored queries, and subsequent tagging using natural language processing (NLP). Results can be filtered by language (English, French, Spanish, Russian), commodities, countries, focus (economics or food chains).
- Queries fired against Google APIs iteratively:
- (formerly: covid19 keywords +) food value chain keywords + general agricultural keywords for each of the 6 languages
- food value chain keywords + keywords for each of the 27 top traded commodities (with synonyms) for each of the 6 languages
- Full text of the news retrieved (using boilerplate removal methods), saved in Solr with metadata
- Results stored and tagged with natural language processing under countries/regions, main topics (economic, food chains, government response, food losses, civil unrest, prices, banking...), commodities
- Results displayed and browseable in a faceted search
- Data stored in DB and Solr is reused for further analysis and visualizations.
- Selected and tagged news are used for the daily news digest.
Querying and tagging strategy
- Querying: selected keywords are used for queries on Google News (translations of these keywords in the 6 official FAO languages are used to run multilingual queries)
- Tagging: human-selected and and machine-generated keywords with synonyms and translations, clustered under topics, plus standard classifications are used in post-processing for tagging.
See "Keywords, Tags, Classifications and standards" and the slides on "FAO’s Data Lab approach to topic- and classification- based indexing of articles".
- Identification of trends and topics, tagging by new topics leading to additional statistics and keyword filters (e.g. increasing prices, protectionism…