As Appendix 1 shows, there is overlap as well as complementarity in the terms and the information for each term given in the three systems. A sampling analysis showed that both FAO Term and FAO Glossary have many terms not in AGROVOC. These terms are often quite specific or not in a central area of the FAO subject domain, yet needed.
Examples from FAO Term:
linear low density polyethylene
rhizobial bacteria,
single-disk furrow-opener
Examples from FAO Glossary:
mortgage
land economy
abundance
aquatic ecosystems
FAO Glossary terms often dependent on the context of the glossary of which they are part.
Examples:
abundance (of fish in a fishing ground)
disturbance (real estate law)
To deal with this problem one must create a form of the term that is qualified for proper communication in the integrated database. Each glossary can still use the term exactly as it appears now.
Definitions given in the three systems overlap to some extent (sometimes a whole set of definitions from a given source is entered in two or all three systems). To determine whether the definitions in two systems are the same, one must first strip source indications (and store them separately) since source citations are not uniform across systems.
When integrating the three vocabularies one also must pay attention to format (see also Section 2.4.2). FAO Term uses within-sentence capitalization (most terms start lower case), singular, and proper hyphenation. (This is the format recommended in this report.) AGROVOC starts all descriptor with a capital and uses plural. FAO Glossary is inconsistent. Hyphenation is not consistent across the three vocabularies. All of these format variations will create problems in term matching. An intelligent algorithm to deal with these problems must be found.
Gaps in vocabulary must be determined by identifying the existing and future uses of KOS, considering the priorities set by FAO management, as suggested in Recommendation 1, and then checking whether terminology supporting these uses (as identified, for example, from key documents) is present in AGROVOC. On a very specific level, gaps can be identified - and simultaneously filled - by mapping between many different vocabularies following the methodology detailed in Section E. The selection of these vocabularies should be done by FAO staff who know the work of the agency well, guided by FAO priorities and the use cases.
Section G introduces the list of KOS in the FAO domain that are accessible through the Web that is given in Appendix 2 and discusses criteria I used to assign to some of this KOS priority for mapping.
AGROVOC shares with many thesauri three major problems:
It does not have an overall well-structured hierarchy.
It uses an impoverished set of conceptual relationships (BT, NT, RT). The underlying specific relationships are not mapped consistently to these broad relation ship types; for example, partOf is sometimes mapped to BT, sometimes to RT.
The UF relationship from a descriptor to a non-descriptor sometimes is based on synonymy and sometimes on a conceptual relationship, such as includesSpecific (or NT in the broad relationship types).
These structural problems lead to many inconsistencies and difficulties in figuring out the structure, as illustrated in the following examples (giving only enough of the NT and RT cross-references present in AGROVOC to illustrate a point).
Example 1: Vegetable crops, Vegetables, Vegetable product
With some effort, the user may be able to figure out that his "semantic field" involves several hierarchies or families of concepts:
(1) The biological taxonomy of plants
(2) Crops
Vegetable crops
The broad term Vegetable crops brings together plant taxa that produce vegetables; these plant taxa are listed as RTs rather than NTs, presumably because BT/NT between biological taxa are reserved for biological taxonomy
(3) Plant products RT Processed plant products [Why not NT?]
Vegetables RT Vegetable products
- Root vegetables
Vegetables is a broad term that includes the specific part used as a vegetable as NT
(4) The Processed products hierarchy given below
|
Processed products
|
Vegetable products is a broad term that includes processed food products produced from vegetables
The records for an individual plant and the associated vegetable look like this:
|
Daucus carota |
Carrots |
|
Solanum tuberosum |
Potatoes |
The plant taxon and the vegetable derived from it are related through RT, but there are many other RT relationships as well, so the specificity of this relationship is lost. Furthermore, it would seem more logical to have Potatoes BT Root vegetables, which then leads to its BT Vegetables.
To index a document on or search for canned carrots, one would use the combination of Carrots with Canned vegetables, but that is not so easy for the user to figure out.
Below are a few sample cross-references for Vegetable crops and Vegetables.
|
Vegetable crops |
Vegetables |
|
UF Salad crops |
UF Fresh vegetables |
One can easily see that RT is used for a variety of relationships and that UF is used for more specific concepts.
Each species listed as an RT under Vegetable crops should have a corresponding vegetable, but this is not the case as can be seen from the following examples, none of which has an RT to a specific vegetable. In the fourth example, the name of the vegetable, African breadfruit, is given as a UF with the species.
Luffa acutangula
BT Luffa
RT Cucurbit vegetables
RT Vegetable crops
Bauhinia variegata
BT Bauhinia
RT Drug plants
RT Dye plants
RT Vegetable crops
Coleus rotundifolius
UF Coleus parviflorus
UF Coleus tuberosus
BT Coleus
RT Ornamental foliage plants
RT Root vegetables
RT Vegetable crops
Treculia africana
UF African breadfruit
UF Okwa
BT Treculia
RT Fruit vegetables
RT Starch crops
RT Vegetable crops
So this whole area should be restructured to make it consistent and easy to grasp.
Example 2: Milk and Milk products
This is another murky area:
|
Milk |
Milk products |
Note the inconsistency in the treatment of milk components (shown in bold). Also note the different meaning of NT: Milk by origin, milk-like substance (Colostrum) and milk component. Furthermore, Milk is best considered a body part and should have the appropriate BT.
The following example illustrates still another use of UF: Drip fertigation is a combination of Trickle irrigation and Fertigation.
Trickle irrigation
UF Drip fertigation
UF Drip irrigation
UF
Microirrigation
BT Localized irrigation
Fertigation
SN Application of fertilizers in irrigation water
UF Drip
fertigation
UF Fertirrigation
BT Fertilizer application
BT
Irrigation
RT Irrigation water
RT Liquid fertilizers
RT Wastewater
irrigation
In
Bovine spongiform encephalopathy
ST mad cow disease
ST BSE
it is not clear that mad cow disease is a synonym and BSE is an abbreviation. Even more confusing is the case where the abbreviation belongs to a synonym and not to the main term.
These examples are not isolated. These types of inconsistencies and unclear structure are pervasive. Further examples are found in the JoDI paper (Appendix 12). On the other hand, AGROVOC is a gold mine of good information on concepts in terms; the task is to recast this information in more precise and more easily grasped form.
The main part of this analysis is found in the presentation for the AOS Workshop in Beijing, Appendix 13. This section gives a few additional rules
Disease RT Organism can be converted to Disease
<causedBy> Organism
Note: would need to check how often it
should be Disease <afflicts> Organism
Taxon BT Taxon can be converted to Taxon <isa> Taxon
X BT Vegetables can be converted to X <isa>
vegetable
if there is also X RT PartBasedVegetableType, this can be refined
to
X <isa> PartBasedVegetableType
A different analysis or modeling of these relationships is as
follows, where Vegetable is a concept that, in AGROVOC, has a BT
Vegetables
Vegetable RT Taxon and Vegetable RT PartBasedVegetableType (such
as root vegetable)can be converted to (using the synonym Taxon for
Organism)
[Taxon, AnatomicalPart] <usedAs> vegetable (vegetable
being a value of the entity type Use)and similarly with other uses of
plants
The following is a first attempt at a conceptual schema for the food and agriculture domain. It builds on the relationship types that emerged from an analysis of AGROVOC and a conceptual schema for foods. The food-specific applications of more general relationship types are given in 10-point type. Appendix 5 gives an extensive discussion of entity types and relationship types in the food domain; it contains some additional relationship types that should be integrated.
Entity types
Food product, recipe, standard
Organism (species/variety/cultivar of plant or animal); also called Taxon
Growth stage (maturity)
Environment (with subtypes GeographicalAreaByTemperatureZone, GeographicalAreaByHeight, SoilType)
Agricultural treatment
Season AnatomicalPart (PartOfPlant, PartOfAnimal)
Composite entity (precombined descriptor) PlantPart,
represented as [Taxon, AnatomicalPart] or through a pair of binary relationships
as indicated below
For example, grape leaf, lotus root, apple
AnatomicalTypeOfFruit (values: pome fruit, stone fruit, berry, etc.)
Cut no. (from permanent plants)
Grade, quality
Substance, material
Physical state
Physical form
Process (incl. storage and handling)
Agricultural Procedure
Sequence number of process
Temperature
Time(duration)
Equipment
Container
Place/stage of processing or point in distribution chain (e.g., farm, manufacturing plant, retail store, restaurant, home)
Use, diet
Use (such as fruit, vegetable, ornamental; need to develop a taxonomy)
Diet
Consumer group
Purpose or effect (e.g.,nutrition, preservation, texture, packing)
Meal Type
Amount
Property
Place (geographic location)
Calendar time
Money
Relationship types
Table starts on the next page
|
Isa |
Inverse relationship |
|
X <includesSpecific> Y |
Y <isa> X |
|
|
Taxon <isa> Taxon |
|
|
Food product <is a> Food product |
|
|
Food product <is one of> [Food product list] |
|
X <inheritsTo> Y |
Y <inheritsFrom> X |
|
|
|
|
Holonymy / meronymy (the generic whole-part relationship) |
|
|
|
|
|
X <containsSubstance> Y |
Y <substanceContainedIn> X |
|
FoodProduct <containsSubstance> [Substance, amount in total, amount in solids, label claim (yes/no)] |
|
|
X <hasIngredient> Y |
Y <ingredientOf> X |
|
FoodProduct <has ingredient> [Food product, rank, total ingredient in total product, ingredient solids in product solids [purpose list]] |
|
|
FoodProduct <may have ingredient> [Food product, rank, total ingredient in total product, ingredient solids in product solids [purpose list]] |
|
|
X <madeFrom> Y |
Y <usedToMake> X |
|
Container <usesStructuralStrengthMaterial> Substance |
|
|
Container <usesCoatingMaterial> Substance |
|
|
FoodProduct <madeFrom> FoodProduct |
|
|
FoodProduct < comes from source> [Food source, environment, agricultural treatment, growth stage] |
|
|
FoodProduct < comes from part> [Anatomical part, growth stage, cut, grade] |
|
|
FoodProduct <isExtractedSubstance> [Extracted substance, extracting substance, process, temperature, duration, sequence.no.] |
|
|
FoodProduct <hadRemovedSubstance> [Extracted substance, etc.] |
|
|
X <yieldsPortion> Y |
Y <portionOf> X |
|
X <spatiallyIncludes> Y |
Y <spatiallyIncludedIn> X |
|
X <hasComponent> Y |
Y <componentOf> X |
|
FoodProduct <containsDish> FoodProduct |
|
|
X <includesSubprocess> Y |
Y <subprocessOf> X |
|
X <hasMember> Y |
Y <memberOf> X |
|
|
|
|
Further relationship examples |
|
|
|
|
|
X <causes> Y |
Y <causedBy> X |
|
X <instrumentFor> Y |
Y <performedByInstrument> X |
|
X <processFor> Y |
Y <usesProcess> X |
|
X <appliedTo> Y |
Y <underwentProcess> X |
|
FoodProduct <underwentProcess> [Process, equipment, temperature, duration, place/stage, sequence no., [purpose list]] |
|
|
FoodProduct <isForSpecialUse> [Use/diet, [country list]] |
|
|
FoodProduct <madeFor> [Consumer, [country list]] |
|
|
FoodProduct <usuallyConsumedFor> [Meal type, [country list]] |
|
|
[Taxon, AnatomicalPart] <usedFor> [purpose, priority [country list]] |
|
|
Alternatively, three binary relationships with entity type Taxon Part |
|
|
TaxonPart <isa> AnatomicalPart |
|
|
TaxonPart <partOf> Taxon |
|
|
TaxonPart <usedAs> Use |
|
|
TaxonPart <isa> AnatomicalTypeOfFruit |
|
|
Substance <usedFor> [purpose, priority, food product] |
|
|
X <beneficialFor> Y |
Y <benefitsFrom> X |
|
X <treatmentFor> Y |
Y <treatedWith> X |
|
X <harmfulFor> Y |
Y <harmedBy> X |
|
Substance <harmfulFor> [harmful effect, strength, food product] |
|
|
X <hasPest> Y |
Y <afflicts> X |
|
X <growsIn> Y |
Y <growthEnvironmentFor> X |
|
X <hasProperty> Y |
Y <propertyOf> X |
|
X <hasPhase> Y |
Y <phaseOf> X |
|
FoodProduct <hasState> Physical state |
|
|
X <hasForm> Y |
Y <isFormOf> X |
|
FoodProduct <hasForm> Physical form |
|
|
Container <hasForm> Physical form |
|
|
X <hasSymptom> Y |
Y <indicates> X |
|
X <similarTo> Y |
Y <similarTo> X |
|
Food product < is analog of> Food product |
|
|
X <oppositeTo> Y |
Y <oppositeTo> X |
|
X <ingests> Y |
Y <ingestedBy> X |
|
FoodProduct <packed in> Container |
|
|
X <has price> MoneyAmount |
|
|
Substance <measuredIn> Unit of measurement |
|
|
|
|
The basic approach to matching concepts is to match on any of the English terms given for the concept with subsequent manual editing. The procedure described in the following minimizes the need for manual checking and edits.
This section also touches on procedures for developing a well-structured hierarchy and for refining relationships, as suggested in Recommendation 3, because both processes are interrelated and are most efficiently performed together.
Note on multilinguality. This section makes the vastly oversimplified assumption that the mapping between terms in different languages is accomplished by the existing translations. But different languages reflect different cultures and their differing category schemes. Eventually this problem needs to be addressed, but is beyond the limits of this report.
Note on resource requirements
One theme of this section is that the best approach in terms of the resulting product is to integrate mapping and intellectually editing into one process. But intellectual editing of an integrated KOS of AGROVOC plus size requires considerable resources - five person years at a minimum. However, much can be accomplished with automated integration (far better than nothing). One can then edit incrementally as needs for KOS in specialized areas arise (from special projects in FAO or elsewhere; one possible source of such editing are the various committees that deal with glossaries in special domains).
The mapping goes through the following steps:
Standardize capitalization to within-sentence form (see algorithm in Appendix 8) and derive the singular form for purposes of matching (see Section 2.4.2).
The matching algorithm should recognize British and American spellings as the same. Create synonym sets (groups of terms that are linked through synonym relationships taken from all KOS to be matched; technically speaking, form groups based on transitive closure of synonym relationships from all KOS).
Some criteria
Place name matches have high confidence. No need to check. (AGROVOC marks geographical areas.)
Taxa matched on scientific name have high confidence. No need to check. (AGROVOC marks living organisms.)
These two criteria will remove a large number of terms from the terms to be checked manually. For the other terms, additional criteria can be used:
If the descriptors are the same in two schemes, the degree of confidence is higher than if the match is based on synonym relationships. But same term does not guarantee same meaning.
The agreement in the information given for the terms in two schemes contributes to confidence:
If two schemes give exactly the same definition for a term, the meaning can assumed to be the same
Agreement in synonyms and/or translations
Agreements in broader, narrower, and related concepts (in that order; concepts may be expressed by different terms in the two schemes)
(1) Computer-assisted refinement of the relationships in AGROVOC. The JODI paper (Appendix 12) and my presentation at the AOS Workshop in Beijing in April 2004 have introduced the rules-as-you-go approach to refining relationships. A few additional rules are given in Section C2. As few as 10 rules that are already known will cover a large number of relationship instances. These rules should be applied now to pave the way for resolution of the more difficult semantic problems and so that obvious errors are detected while editors work on the mapping anyhow. Differentiated relationships are not only needed for AI applications, they also support facet analysis and hierarchy construction discussed in the next steps.
(2) Extract further relationships from descriptor texts and from definitions and scope notes found in any of the contributing KOS.
(3) Preferably also do automated semantic factoring subject to later edit.
Use the software to arrange concepts from all sources into a hierarchy following a skeleton, such as the AGRIS categorization scheme, and using BT/NT relationships to create detailed hierarchies. (The FAO Glossary gives for each term the AGRIS category.)
Detecting mappings between schemes that were not detected by term matching (i.e., the discovery that term A from KOS 1 and term B from KOS 2 actually have the same meaning) is easiest in the context of a hierarchy, and a hierarchy should be constructed in any event (Recommendation 3). The advantage of combining these two steps is efficiency - dealing with a concept once both for checking the terms used to express it and for determining its place in the hierarchy.
Dealing with hierarchy development and all the mappings represented by cross-references all at once may be overwhelming. The following variation may be easier to implement. First develop a skeleton hierarchy (working from the system-generated draft and consulting sources such as textbooks) using just the terms (ignoring, for the moment, the many relationships found in sources but introducing relationships that naturally surface in the editor's mind). Then "enrich" the skeleton hierarchy by adding in all the information (definitions, relationships) from many sources, and edit these in a second pass.
The high-level structure needs to take account of the two halves of the FAO subject domain - the primary production of foods and other agricultural (in the broadest sense) raw materials on the one hand and food and nutrition on the other. There are, however, a number of facets that apply to both. On the conceptual level, this suggests a scheme along the lines of the proposal presented below.
There are already several top level schemes; the two most prominent are given in Appendix 14, The FAO Subject Tree and the AGRIS Category Scheme. Other schemes include the subject categories used for FAO Term and the list of individual subject glossaries included in FAO Glossary; these were not available for this report but should be consulted in the further development of the high-level structure..
Both have two problems:
(1) The arrangements are alphabetical and thus do not communicate a meaningful structure to the user.
(2) They focus on arrangement of resources and not on conceptual structure
The proposal below focuses on conceptual structure. From this perspective, the branches of agriculture, broadly defined, form a facet, along with other facets. Processes, such as anti-pest measures or fertilization apply across different branches of agriculture For arrangement it may well be useful to group by branches of agriculture first, but on a Web site one can offer other arrangements, based on other facets, as well.
The proposal below does in no way claim to be complete. It is merely meant to illustrate an idea as the starting point of further discussion.
|
agriculture (broadly defined)
· processes in agriculture
· earth science and environment (soil, water, climate, pollution) · veterinary medicine and nutrition · demographic characteristics of plants and animals food and nutrition · food uses subjects and facets applicable to both · physical sciences, chemistry,
biology
· communication, education,
extension, advisory work
· engineering and
technology
· biological taxonomy |
Appendix 2 lists a number of sources, divided into general coverage sources and specialized sources.
The sources are labeled as follows:
G/D = Glossary, dictionary
T = Thesaurus,
classification
N = nomenclature
DB = Database
O = Other (handbook
etc.)
+ KOS maintained, sponsored, or used by FAO
* Otherwise
consider as a priority source of terms for AGROVOC
# Site to link to in an
Agricultural Ontology Server for more detailed information
The sources marked by * were selected based on germaneness to FAO's work, authority of the originating organization, richness of information, and, where appropriate, size. FAO personnel are more knowledgeable about the areas in which AGROVOC is weak and are therefore in a better position to assign source priority based on that criterion.
These sources can be harvested for additional concepts, terms in multiple languages, definitions, and relationships. This requires
Clearance of copyrights, where applicable.
Scripts to transform the sources into the input format of the KOS management software used.
Adding to the FAO KOS database via KOS management software (see Section E above)
Editing terms, definitions, relationships that are new to integrate the source into the FAO KOS database.
There are two approaches that can be used together as appropriate:
(1) a system for integrated access to various independently maintained KOS (corresponding to solution (S2) in Recommendation 2);
(2) A Web-accessible collaborative integrated multi-KOS database (corresponding to solution (S2) in Recommendation 2).
Such a system would need a plug-in for every KOS accessed to translate queries going out and data coming in in response. The Z39.50 standard will be useful in simple cases. The major problem is integrating "on the fly" information on a concept or term from different sources. In many cases, the system will simply provide separate records. Such a system is described in some detail in the two documents given in Appendix 11. An abstract follows.
|
Dagobert Soergel. SemWeb: integrated access to distributed ontological resources Abstract We propose to develop a system, dubbed SemWeb, that would revolutionize the way people - from experts to students - interact with conceptual structures and terminology and the way they share such knowledge. We aim at the synergistic exploitation of existing lexical and ontological knowledge bases (ontologies/classifications, thesauri, dictionaries) and their vast intellectual capital through integrated access, allowing a user to consult multiple sources with one search that returns one integrated answer that visualizes concept relationships for ease of understanding. SemWeb is intended for for a wide variety of users and uses - including education, information retrieval, knowledge-based systems and natural language processing - and bridge discipline, languages, and cultures. Then same environment will support collaborative development and maintenance of ontologies and lexica. We will do research on difficult issues that need to be addressed in the system, for example we will study how ontological and lexical knowledge is used in different disciplines and we will work on defining measures and methods for the evaluation of ontologies, lexica, and their representations and for correlating and integrating ontologies. We will also study the use and impact of the prototype through pilot application and user studies, particularly the impact on learning by students. |
Perhaps the easiest approach is a central Web-accessible integrated multi-KOS database for the collaborative development and maintenance of KOS. FAO needs to develop such a system for the development and maintenance of its own KOS in any event and could open this system to others who either want to develop their own KOS (and could draw on the resources available) or who want to use the new version of AGROVOC (provided the concepts of interest to them are represented) and are willing (and certified) to contribute to AGROVOC as external collaborators. The thesaurus data model from the JoDI article (Appendix 12) supports this approach. AS mentioned under Recommendation 2, FAO might consider to enter into a system development collaboration withe Harvard Business School (HBS) thesaurus effort; the data schema for a thesaurus and ontology system under Oracle developed at HBS is found in Appendix 10 (HBS internal document, for internal use of the FAO thesaurus and ontology group only).
This system should allow access to individual KOS in such a way that access through a given URL would bring the user to an interface that accesses a specific KOS and is customized by and for the organization that maintains that KOS. Such an interface could provide the option of accessing other KOS in the system, to access other KOS as described in H1, or of displaying, for any descriptor, the corresponding descriptor from any other KOS.
This approach is easiest from a technical and KOS maintenance point of view but requires organizational arrangements that may not always be feasible.
The two approaches clearly complement each other. Whatever important KOS can not be incorporated into a central KOS system can be made accessible through distributes access. The two approaches can also share code. The Web access module for the central KOS database should be written in such a way that it can be used for the display of data obtained though distributed access as well. Approach 2 will need some import modules for KOS to be imported into the central database. What is learned from writing these import modules, and possibly some of the code, can be used for writing plug-ins for communicating with external KOS in Approach 1.
The major conclusion of the business case document is validated in this report: The present fractured approach to developing and maintaining KOS leads to a number of undesirable consequences:
There is duplication of effort; the same problems of terminology (including translation) and concept definition and concept relationships are dealt with independently in several units.
Terminology and thesaurus work
is not always done in the most expert and efficient
manner.
The schemes developed show many inconsistencies.
Unique knowledge generated in one place is not or not fully used in other places. For example, the careful definitions prepared in the FAO Glossaries is not fully used by translators.
Consequently, the business case for a unified approach to developing and maintaining KOS is overwhelming. This is demonstrated in more detail in the analysis under Recommendation 2. One drawback is that individual units will be obliged or at least encouraged to work together on the definition of concepts in an attempt to harmonize their definition. However, if this attempt is not successful, the KOS Distribution System (KDS) described under Recommendation 2 is capable of storing different views, so individual units do not need to abandon whatever autonomy they have now. Another drawback is that resources for developing and maintaining the system for KOS development and maintenance need to be centralized in one place which, even though overall efficiency is increased, may meet with resistance within the organization.
If anything, the business case is not made strongly enough. It focuses on the use of KOS for information retrieval and translation. However, there are many more uses of KOS, including learning and reader assistance and intelligent information processing (see Appendix 4). All those uses must be considered to maximize the return on investment for KOS projects. This requires, among other things, a more differentiated set of relationships. This report includes efficient methods to achieve this. Furthermore, the KDS makes it possible to marshal resources from outside FAO for this task.
This is covered by other suggestions, especially Recommendation 2.
Examples showing the effects of integration of FAO Term, AGROVOC and the FAO Glossary are given in Appendix 1.
A mock-up of some screens of an interface for accessing KOS data is given in Appendix 9.