Panel session on methodologies to develop and assess content of thesauri

FAO/Pep Bonet

During the AGROVOC Editorial Community Meeting 2021, a panel session was organized with partnering organizations dedicated to the methodologies to develop and assess content of thesauri and ontologies/KOS. The panel included representatives from EuroVoc (Aniko Gerencser, Publications Office of the European Union), NALT (Jennifer Woodward-Greene, USDA ARS National Agricultural Library), Crop Ontology project of the Generation Challenge Programme - CGIAR (Elizabeth Arnaud, CGIAR), UNESCO Thesaurus (Bettina Dietinger and Meron Ewketu, UNESCO) and AGROVOC (Imma Subirats, FAO) and was chaired by Daniel Martini, KTBL

Key questions addressed were:

  • How to use cost-efficiency to plan vocabularies to grow in a sustainable way while providing a satisfactory service to their users?
  • How to detect new needs in broad but also specialized communities?
  • How to assess the growth of a vocabulary like AGROVOC in a sustainable way?
  • How to move from quantitative indicators to qualitative information?
  • What lessons can be learned from other experiences?

Beforehand, a survey was conducted among participants about the kinds of digital resources they provide, including links, subjects and language coverage, and about their content development processes and workflows.

Some key results of the survey included:

  • Panelists exclusively use semantic web technologies to develop their knowledge resources. All of them provide publication metadata repositories, some of them data set repositories as well. See Figure 1.

Google Formulare-Antwortdiagramm. Titel der Frage: What kind of digital knowledge resources do you provide?. Anzahl der Antworten: 5 Antworten.
Figure 1. Different digital knowledge services provided by panelist organizations. 

Source FAO, 2021
  • All participants cover a broad range of topics, most of which are in line with AGROVOC topics. Topics covered included agriculture, food preparation and production, nutrition and health, economics, social sciences, statistics, environment and ecology, geosciences, policy and society, education and culture, cultural heritage, law, food security, food safety and biodiversity. 
  • Regarding languages, English and Spanish receive the most attention among the participants, followed by Arabic and French. Other languages depend on the mandate of the thesaurus. 
  • Teams managing thesauri and ontologies include technical as well as content development and librarian roles. Workflows are organized along the concept life cycle of proposals, moving on to discussion and validation, approval and translation and finally revisions and eventually deprecating, see Figure 2.

Figure 2. Development approaches used for thesauri and ontologies/KOS.

Source FAO, 2021

The workflow was the first aspect to be picked up during the panel discussion. In terms of resources, core teams are usually composed of 5 to 10 people, supported by either external people or employees from other organizational units like translators or scientists. Several have working groups on content validation, and several use VocBench (AGROVOC, EuroVoc, UNESCO thesaurus) but also other tools and approaches are used or evaluated in projects for future use, and some have external translations done professionally

Anikó Gerencsér, team leader at the Publications Office of the European Union.

​​​

 

Anikó Gerencsér, coordinates the maintenance of taxonomies, thesauri, ontologies and authority lists and their publication on the EU Vocabularies website. She introduced the EuroVoc multilingual thesaurus. The Publications Office of the EU currently has seven editors and four members of the technical team, covering a range of EU Vocabularies Knowledge resources. Two of these editors work on EuroVoc, which has two releases per year. EuroVoc covers 24 official EU languages, plus three candidate country languages, and has about 7339 concepts in 21 domains, with 127 microthesauri. The scope includes topics related to EU activities, law and procurement. Development is supported by a content working group that meets three times a year. Proposals are collected via web forms or via e-mail, in English. A “Candidates” scheme in Vocbench is used for review by the content working group and validated by the editorial team before export and publication. Translations are supported by the translations service of the European Commission and imported into VocBench.

 

Meron
Meron Ewketu, UNESCO Thesaurus and the UNESDOC Database
Bettina
Bettina Dietinger,  UNESCO Thesaurus and the UNESDOC Database

Meron Ewketu from UNESCO explained that the UNESCO Thesaurus is managed by the Knowledge Sharing and Open Access Unit at UNESCO. The UNESCO thesaurus currently has about 4500 concepts. Editors are composed mainly of librarians, metadata librarians, collection development librarians etc. Validation of new concepts is done within a regular meeting, comparable to the working group approach of EuroVoc. Proposals are received from the internal and external communities and users, and terms in UNESCO publications are monitored as well as through Google analytics. New concepts (about 10-15 new concepts per year) are discussed in the editorial group, then validated and passed through the VocBench administration tool. Bettina Dietinger was also present, working on the UNESCO Thesaurus and the UNESDOC Database with Ms Ewketu. 

 

Elizabeth Arnaud, ​​​​Leader of the Crop Ontology project of the Generation Challenge Programme, CGIAR

Elizabeth Arnaud emphasized that for the different CGIAR research centres, unified data management is an important aspect. This includes leveraging ontologies as well as thesauri. All products are licensed as Creative Commons Attribution 4.0 International (CC by 4.0), to keep the products in the public domain, even if some of the elements are used by the private sector, for example the food industry. Ongoing work includes mappings between AgrOntology and AGROVOC. All data managers at CGIAR use AGROVOC as a source of keywords, and in the last three months CGIAR has created a task group to look at contributing content to AGROVOC. With regard to the ontology work, templates and guidelines are key components in the workflow, as are quality assurance and good governance. A group of curators that have been trained in the use of those elements use guidelines and templates, and consult with others before they validate any submission to the crop ontology. The curator in charge of a specific crop is nominated by the institution which has the mandate for the crop.. Curators work in close collaboration with the technical coordinator, who checks the quality and validates for publication, done in GitHub. The overall workflow is coordinated by the authority coordinator. The creation of an advisory group is planned, and it would validate the new development of new domains, including new crop-related technologies like drones and satellites. 

 

 

USDA ARS National Agricultural Library Indexing and Informatics Branch Chief

Jennifer Woodward-Greene noted that NALT is currently working on automating workflows and there are ongoing projects to improve that process. The 2021 National Agricultural Library Thesaurus (NALT) contains over 265 498 terms, including 153 006 descriptors in English and Spanish. Currently, the thesaurus gets suggestions for additions via the library annotation and indexing pipeline. This includes development on machines doing part of the work regarding the sourcing of new concepts for the thesaurus, and using machine learning to classify documents, but most concept suggestions come from the NALT team, following a new workflow.

 

 

 

 

Imma Subirats-Coll, Senior information management officer for FAO, leading AGROVOC, AGRIS and AGORA programmes

 

Imma Subirats-Coll thanked the panelists for the interesting contributions. With growth in interest and more new concepts suggested, there is a need to prioritize. Building synergies with fellow vocabularies are key, also to understand what tools can be used to assess progress. She also asked panelists if there is any analysis in their organizations with regard to the user statistics, and what languages and content areas are being used most. This could also be quantitative elements without getting into qualitative elements.

 

 

 

 

With regard to usage statistics, panelists noted that they use standard web access statistics, partly by country. EuroVoc has some statistics on the website users, and languages accessed by countries, but it is not always systematically  retrieved. The Publications Office of the EU does have an overall overview every month, which covers all of the resources being published, so it's mainly about user access. For NALT the situation is similar. Considerable bandwidth is needed to dig into detailed statistics. AGROVOC is beginning to look at not only who is using the data, but also which concepts are most requested. AGROVOC is not doing automatic indexing because the metadata production is not coming from that side, however AGROVOC has a very distributed network of data providers.