Food and Agriculture Organization of the United Nations
    FAO Data Lab

    Keywords, classifications and standards

    This methodology is currently used for querying and tagging articles from Google News and for tagging articles linked from tweets, but it can be reused to query and tag other textual resources from other services.

    See the slides on "FAO’s Data Lab approach to topic- and classification- based indexing of articles".

    1. Keywords for the queries

    Keywords for the queries are grouped in sub-queries, which are translated in FAO's 6 languages, combined and automatically sent to Google News. The sub-queries are linked with AND while the keywords inside each sub-query are linked with OR, which means that at least one keyword from each of the three sub-queries has to be present, ensuring that the three component of the “question” are covered: value chains, (Covid-19 initially) and food/agriculture.

    These keywords for the moment are manually selected (supervised NLP) and translated with Google Translate.

    The three general sub-queries are (in English):

    SUB-QUERY ABOUT VALUE CHAINS:
    "~supply OR 'value chain' OR market OR trade OR ~transport OR import OR export OR distribution OR customs OR borders OR ~shortage OR ~retail OR vessels OR ~trucks"

    (Formerly used) SUB-QUERY ABOUT COVID-19:
    "coronavirus OR covid OR pandemic OR lockdown"

    SUB-QUERY ABOUT FOOD AND AGRICULTURE:
    "~agriculture OR ~hunger OR food OR ~vegetable OR cereals OR ~fruit OR meat OR bread"

    The queries by commodity are composed of the same general value chain sub-query as above, plus a sub-query for each commodity containing synonyms identifying the commodity, repeated in 6 languages. The names and synonyms of the commodities used for the commodity sub-queries come from standard classifications (CPC2.1 Expanded, FAO Commodity List, HS, ICC, WCA crop list) plus translations from Yandex and synonyms from WordNet.

    2. Keywords for tagging

    Once news have been found in Google News using the queries above, the full text and metadata for all selected news is stored and tagged in Solr according to topic, commodity, country/region.

    The methodology for tagging leverages different approaches: one for topic tagging and one for geographic and commodity tagging.

    Topic tagging

    The methodology for topic tagging leverages a chain of approaches:

    A. Definition of keywords per topic:

    1. an initial manually defined list of key concepts for each topic;
    2. "bags" of words both manually proposed or automatically extracted from the texts (e.g. with LDA techniques);
    3. lexical resources of "word senses" (initially, WordNet, in the near future also AGROVOC) both to associate the words to the key concepts (and therefore the topics) through similarity calculations and to find translations, variants and synonyms;
    4. translations services (now, Google Cloud Translate API) to translate all identified keywords in the n languages considered.

    B. Indexing of full text against the keywords > topics:

    1. Loop through articles: tokenize full text, remove stopwords, lemmatize and normalize words 
    2. Find occurrences of lemmatized keywords in lemmatized text, calculate score based on no. of keywords, length of text, breadth of topic
    3. Index articles against keywords and topics in Solr.

    Geo-tagging and geographical classification

    The indexing process is the same, but the keywords are generated from standard classifications and linked authority files:  

    • M49 classification for codes and country names
    • Translations and synonyms from Wikidata and AGROVOC plus demonyms from Wikidata
    • Names of major cities and provinces/regions mapped to countries (from Wikidata)

    and articles are then tagged against the standard code and label used in FAO (M49) for harmonization.

    Commodity tagging and classification

    The indexing process is the same, but the keywords are generated from standard classifications, lexical resources and traslation services:

    • News are tagged against commodities using commodity names from different mapped classifications as synonyms (CPC2.1 Expanded, FAO Commodity List, HS, ICC, WCA crop list) plus synonyms from WordNet and translations from Google Cloud Translate API.
    • articles are then tagged against the standard code and label used in FAO (CPC 2.1 Expanded) for harmonization.