Embrapa: AGROVOC and GTermos

FAO/Giorgio Cosulich de Pecine

A use case by Embrapa

 Brazil: facts and figures

  • Population: 211,639,714 inhabitants (2020)
  • Total Area: 8,510,295.914 km2 
  • Six biomes: savannahs (Cerrado); semi arid zone (Caatinga); tropical forests (Mata Atlântica); grasslands (Pampa); swamps (Pantanal)

Brazilian Agriculture

  • 235,918.76 thousand hectares of agricultural land
  • 55,384.06 hectares of arable land
  • 7,982 hectares of land under permanent crops
  • 172,552.7 hectares of land under permanent meadows and pastures
  • Major producer of: soybean, sugar, meat (chicken, beef and pork), maize, coffee, tobacco, orange juice, fruits, cotton, sugarcane and bioproducts

Brazil speaks Portuguese

  • Around 270 million Portuguese (PT) speakers, in four continents - in Angola, Brazil, Cabo Verde, Guinea-Bissau, Mozambique, Portugal, Sao Tome and Principe, Timor-Leste
  • Orthographic Agreement signed by these nations, in 1990, to establish a single official orthography for the Portuguese (PT) language: an important step, but the language is still syntactically, semantically and lexically diverse 

Brazilian Portuguese (PT/BR)

  • Huge linguistic variety due to the country’s extension and cultural diversity and exchange 
  • Different words for the same object; e.g.: mandioca/aipim/macaxeira (manioc/cassava) (see Figure YY)
  • Same word for different objects; e.g.:

- colher (talher/spoon) e colher (verb apanhar/to harvest)

- molho (caldo/sauce) e molho (first person simple present verb molhar/to wet)

  • neologisms and foreignisms 

Figure 1.  Example of different words for ‘Manihot esculenta’ used in Brazil and its conceptual structure within Agrotermos. Source: Banco Multimídia da Embrapa, 2017;  Photo: Ronaldo Rosa,2021; Map: Zimmerman, 2014; Graph: Agrotermos,  2021; Illustration: GTermos, Suzi Carneiro, 2021

 

MapaDescrição gerada automaticamente com confiança média
Figure 2. Essential facts about Embrapa’s organizational structure. Source: GTermos, 2021

 

Currently, the main research, development and innovation themes steering Embrapa’s project portfolios are: agrochemicals; agro-ecological zoning; agroecology; Amazônia; animal health; aquaculture; automation, digital and precision agriculture; bioeconomy; biological control; biological nitrogen fixation; biological supplies; biotechnology; Brazilian soils; Climate Change; coffee; corporate innovation; Cultivar Market; drought endurance (semi arid region); energy, biomass technology and chemistry; environmental services; fibers and biomass for industry; fishing and aquaculture; food loss and waste; food safety, nutrition and health; fForest; fruit farming  (temperate and tropical); genetic resources; geotechnologies; grains; horticulture; Integrated Crop, Livestock, Forest Systems (ICLFS); irrigated agriculture; low-carbon agriculture; Matopiba; meat; milk; nanotechnology; nutrients for agriculture; pastures; plant health; quarantine pests; rural wastewater treatment; social innovation in agriculture; territorial intelligence, management and monitoring; transgenics; and weeds.

For more information, please see https://www.embrapa.br/en/web/portal/about-us

GTermos

Embrapa's Permanent Commission for Controlled Vocabularies, Agriterminologies and Agrisemantics at Embrapa (GTermos) is committed to building, sharing, disseminating and managing knowledge and practices related to semantics and semiotics of agricultural data and information and their applications to information and knowledge management processes at Embrapa. Our goal is to expand their potential for use in both internal and external environments, in alignment with global trends and initiatives. GTermos has been a permanent working group at Embrapa, since May 2018.

Methodological and technological approaches and tools used by GTermos:

  • Corpus linguistic
  • Knowledge mapping, organization and representation 
  • Data, Information and Knowledge visualization
  • KOS engineering
  • Terminological mappings and matchings and semasiological/onomasiological analysis and alignments
  • Semantic interoperability
  • Training and use of Word Embeddings
  • Open and linked data
  • Conceptual space
  • Knowledge Graph

Agrotermos

GTermos conceived, built, implemented and manages Agrotermos, a controlled vocabulary and conceptual space for agricultural knowledge. Using Information Engineering, NLP methodologies and tools, Corpus Linguistics and semantic modeling, is being prepared to expand its technological functionality as a terminological resource to a level of conceptual space for Brazilian agricultural knowledge.

Agrotermos is Embrapa’s platform for organizing, qualifying, and offering terminology data and semantic applications produced within Embrapa. More than a controlled vocabulary, Agrotermos is a conceptual space for knowledge representation of agriculture and related areas.  You can find access  here.

The curation and management of Agrotermos relies on a conceptual (semantic) and terminological re-engineering/enrichment processes. In this context, adding new terms to Agrotermos typically involves using scientific landscapes and term extraction from textual corpus (Corpus Linguistics) and conceptual/terminological validation by domain specialists, see Figure 3. Thus, concepts and terms of specific subdomains within Brazilian agriculture are modularly incorporated into Agrotermos. 

 

Figure 3. Scientific landscape produced by VOSviewer for the topic 'pasture'. Source: InfoPasto Project/GTermos, 2019

 

The use case based on an information system(s) or practical use case(s): Agrotermos and AGROVOC

Agrotermos was built by bringing together Portuguese language terminologies present in national and international agricultural thesauri. Agrotermos is composed mainly of two different Portuguese thesauri, among them AGROVOC. These thesauri compose not only Agrotermos’ content, but they also shape its structure, as assembled by the relationships between terms.

All new inputs from these thesauri are combined to compose the wider, intertwined terminological and semantic resource which is Agrotermos. The terms in it do not repeat themselves, i.e. all new additions/updates are indexed, but not overwritten or repeated, and their source is identifiable, see Figure 4. 

 

Figure 4. Graphic resource depicting all connections of the term ‘sistema agrosilvopastoril’ in Agrotermos’ structure and its origin (AGROVOC). Source: Gtermos, 2021.

Every month, Agrotermos harvests and indexes AGROVOC’s terms and concepts. Currently, Agrotermos is composed of approximately 245 000 terms, of which 41 337 were incorporated from AGROVOC.  

This whole infrastructure is then offered to Embrapa’s repositories (e.g. geoinformation repository - GeoInfo -, project repository - Quaesta -, among others) in the form of a webservice. This is currently Agrotermos’ and hence AGROVOC’s main use in Embrapa. As part of Agrotermos, AGROVOC is therefore also involved in the following underlying applications within the company: 

  1. Matching of terms: an automatic and intelligent process that compares (‘matches’) any text or list of terms with the contents of Agrotermos, and produces a conceptual, semantic representation of the entry text or list then depicted as a reflex of Agrotermos’ structure. This matching process also reveals the terms contained in the text or list that are already part of Agrotermos, hence offering us additional lists of terms of interest for later inclusion in our semantic structure or in AGROVOC.
  2. Quaesta: Embrapa’s projects research tool uses principles of Artificial Intelligence (AI) and its interfaces with Natural Language Processing (NLP). In this tool, Agrotermos (and hence AGROVOC) is used as a qualified information tool, and the textual contents of the projects are indexed using the terms and their relationships. Thus, Agrotermos serves as a specialized ontological structure for agricultural content, improving the search engine and expanding the conceptual coverage of the search. 
  3. Morphosyntactic similarity analysis: we have recently started using Agrotermos for specific textual similarity analysis tasks. We use an algorithm to analyze the morphosyntactic similarity of texts from Embrapa's research projects, to find similar projects based on textual content. Here, Agrotermos (and AGROVOC) is used to expand the terms and certain relationships found in the texts targeted by the analysis, and helps the algorithm by conveying semantic characteristics inherited from Agrotermos’ conceptual structure.

Furthermore, Embrapa’s Information professionals have always used AGROVOC’s main search interface as reference and for indexing the company’s products and information in its repositories, such as in BDPA, Base de Dados da Pesquisa Agropecuária

In a wider effort to bring the company closer to web-semantics environments, in 2010, Embrapa contacted FAO and has since been accompanying AGROVOC’s development and evolution, first through observation within the Agrisemantics Working Group at the Research Data Alliance and recently as an active curator of AGROVOC’s Brazilian Portuguese terms and concepts, thus engaged in the editorial community and its discussions. AGROVOC’s conceptual and terminological uptake, experience and expertise in knowledge representation and conceptual alignment are by now insurmountable references for Agrotermos.  

Benefits of using AGROVOC

1. AGROVOC became a theoretical, conceptual and operational reference for the creation of Embrapa’s own controlled vocabulary/semantic structure, Agrotermos, which was created in 2014. 

2. AGROVOC is part of Agrotermos: Out of Agrotermos’ current 245 000 terms, 41 337 were directly incorporated from AGROVOC. All updates and new PT/BR uploads we provide to AGROVOC are automatically incorporated into Agrotermos. 

3. The curatorship of AGROVOC’s Brazilian Portuguese terms and concepts greatly contribute to our understanding of collections of concepts, terms, definitions and relationships and other semantic web technologies.

4. The collaboration with AGROVOC, in the curation of its Brazilian Portuguese terms and concepts, is an invaluable opportunity for the enrichment of both vocabularies, and allows us to disseminate the vast and diverse Brazilian agricultural scientific production.

The following practical examples provide a quick sample of our daily tasks and challenges in the Brazilian Portuguese curatorship in AGROVOC.

ENGLISH

PT/PT

PT/BR

DIFFERENCES DUE TO

OBSERVATIONS

Reproduction control

Controlo da reprodução

Controle da reprodução

Orthography

“Controle” instead of “controlo”

Ammonia

Amónia

Amônia

Orthography

Amônia”,instead of “amónia

Weeding

Monda

Capina

Other term used in PT/BR

-

Food shortages

Penúria alimentar

Escassez alimentar

Other term used in PT/BR

-

Bumble bees

Abelhão

Mamangava;mamangaba

Other term used in PT/BR

Brazilian indigenous term

Table1. Practical examples of the Brazilian Portuguese curatorship in AGROVOC. Source: GTermos, 2021.

 

 

Termos team meeting in 2019.©Francisca Rasche

 

GTermos team

Permanent Commission for Controlled Vocabularies, Agri-terminologies and Agri-semantics at Embrapa

Ivo Pierozzi Júnior (technical coordinator)

Biologist, PhD in Ecology, researcher at Embrapa Informática Agropecuária

Bibiana Teixeira de Almeida

BA in Language and Literature Studies, Translation specialist, analyst at Embrapa Territorial

Francisca Rasche

Librarian, MA in Information Science, analyst at Embrapa Florestas

Maria de Cléofas Faggion Alencar

Librarian, PhD in Education, analyst at Embrapa Meio Ambiente

Viviane de Oliveira Solano

Librarian, MA in Information Science, analyst at Embrapa Pantanal

Leandro Henrique Mendonça de Oliveira

Computer scientist, PhD in Computer Science and Computational Mathematics, analyst at the Secretariat of Research and Development

Milena Ambrosio Telles

BA in Language and Literature Studies, PhD in Information Science, analyst at the Research and Development Secretariat

Rochelle Alvorcem

Librarian, MA in Information Science, analyst at Embrapa Uva e Vinho

Vera Viana dos Santos Brandão

Librarian, specialist in Information Units Management, analyst at Embrapa Territorial

Patrícia Rocha Bello Bertin (institutional coordinator)

Biologist, PhD in Information Management, researcher at the Secretariat of Institutional Development

 

References

[1] IBGE. População. Available at: https://www.ibge.gov.br/estatisticas/sociais/populacao.html. Accessed on: June 12, 2020. 

[2] FAOSTAT. Selected Indicators - Brazil. Available at: http://www.fao.org/faostat/en/#country/21. Accessed on: June 13, 2020.

[3] Map image source: WIKIMEDIA COMMONS. File:BlankMap-World-Microstates.svgContent source:File:Mapa_da_CPLP.png, CC BY-SA 4.0, By Cristiano Tomás. Available at: https://commons.wikimedia.org/w/index.php?curid=77196210. Acessed on: June 13, 2020.

[4] Map image source:File:BlankMap-World-Microstates.svgContent source:File:Mapa_da_CPLP.png, CC BY-SA 4.0, By Cristiano Tomás. https://commons.wikimedia.org/w/index.php?curid=77196210. 

[5] ZIMMERMAN, A. 'Sotaques do Brasil' desvenda as diferentes formas de falar do brasileiro. Globo.com - Jornal Hoje, 02 set. 2014. Available at: http://g1.globo.com/jornal-hoje/noticia/2014/08/sotaques-do-brasil-desv… . Accessed on: 08 set. 2021.