Food and Agriculture Organization of the United Nations
    FAO Data Lab

    Methodology for news scraping and tagging

    Aim and approach

    The DataLab offers a searchable platform of news, initially devoted to the impact of Covid-19 on food value chains and then broadened to more general food value chain issues.
    The platform leverages automatic extraction from the Google News engine, with tailored queries, and subsequent tagging using natural language processing (NLP). Results can be filtered by language (English, French, Spanish, Russian), commodities, countries, focus (economics or food chains).

    Process

    • Queries fired against Google APIs iteratively:
      • (formerly: covid19 keywords +) food value chain keywords + general agricultural keywords for each of the 6 languages
      • food value chain keywords + keywords for each of the 27 top traded commodities (with synonyms) for each of the 6 languages
    • Full text of the news retrieved (using boilerplate removal methods), saved in Solr with metadata
    • Results stored and tagged with natural language processing under countries/regions, main topics (economic, food chains, government response, food losses, civil unrest, prices, banking...), commodities
    • Results displayed and browseable in a faceted search
    • Data stored in DB and Solr is reused for further analysis and visualizations
    • Selected and tagged news are used for the daily news digest.

    Querying and tagging strategy

    Future developments

    • Identification of trends and topics, tagging by new topics leading to additional statistics and keyword filters (e.g. increasing prices, protectionism…