Part 1. Major recommendations

Major recommendations. Overview

1 Recommendation 1. Build an inventory of KOS uses and KOS, now and future

For FAO, for organizations with which FAO collaborates or which FAO wishes to support, and for the food and agriculture domain in general, assemble and maintain an inventory (registry) (1) of KOS use cases and of functions that should be served by information on concepts and terms and (2) of KOS and KOS efforts. To realize the full benefit KOS can provide for the organization and to allow for a complete cost-benefit analysis of KOS activities, this inventory should cover a wide spectrum of existing and imaginative new KOS uses (Appendix 4). Development of this inventory requires thorough knowledge of the organization and imagination and vision.

2 Recommendation 2. Integrate information management for all FAO KOS

Integrate information management for all FAO KOS, beginning with AGROVOC, FAO Term, and FAO Glossary, into one FAO KOS database, called the FAO KOS Distribution System (KDS), to be used by all groups that create and maintain KOS. The FAO KDS should also provide the environment for collaborative development and refinement of KOS as needed by semantic Web and other artificial intelligence applications and for a crosswalk between major agricultural KOS and (Recommendations 3 and 4, Section H).

3 Recommendation 3. Incrementally build a rich ontology of the FAO domain

This recommendation responds to the need for rich ontologies for improved information access and intelligent information processing in the food and agriculture domain. Starting with the concepts in AGROVOC, FAO Term, and FAO Glossary (and possibly the NAL, CABI, and CAAS thesauri) develop a well-structured, meaningfully arranged classification of the FAO domain (using facets where appropriate) in collaboration with groups both inside and outside FAO. Apply the rules-as-you-go approach to refine relationships, starting now with a few rules that apply to many relationships so that collaborating groups can focus on the semantically more difficult relationships.

4 Recommendation 4. Create a crosswalk between major KOS in the FAO domain

Within financial constraints, collaborate with other institutions, in particular NAL, CABI, and CAAS, to create a crosswalk between major KOS in the FAO domain, evolving incrementally into a system modeled in functionality, but not in implementation, after the Unified Medical Language System of the US National Library of Medicine as the basis for the Agricultural Ontology Service. This should be linked seamlessly to a database of the taxonomy of living things and to a geographic name server.

5 Recommendation 5. Use powerful KOS management software

Use KOS management software (KMS) that supports many KOS, can handle complex concept and term relationships, and, through full exploitation of the knowledge available in existing KOS and through intelligent processing, makes the process of creating and maintaining KOS as efficient as possible.

Major recommendations. Detail

1 Recommendation 1. Build an inventory of KOS uses and KOS, now and future

KOS uses include existing uses as well as new uses or applications for KOS (existing functions not now supported but that would benefit from using a KOS and future functions that would benefit). KOS include existing KOS and their maintenance, KOS in the process of development, and new KOS planned or suggested.

1.1 Rationale

(1) Even within FAO itself (not to mention other national and international organizations) there are many isolated efforts in developing and maintaining KOS, often by staff who are not experts in KOS development. This leads to duplication of effort, inefficiency, redundancy, and inconsistency. An inventory of such efforts, and of implemented and potential use cases for such efforts, is needed for better planning.

(2) An inventory of KOS and KOS development efforts is needed so that all available resources can be used to answer questions, be it through a distributed system or an integrated database.

(3) An inventory of use cases provides the data needed to set priorities in KOS development efforts and supports full exploitation for KOS and thus increases return on investment.

(4) An inventory of use cases provides the basis for at least a "seat of the pants" estimate of the return on investment and thereby provides a basis for a complete cost-benefit analysis of KOS activities and decisions on resource allocation.

(5) This will provide a starting point for a KOS inventory needed as the backbone of the Agricultural Ontology Service.

1.2 Implementation ideas

Develop a simple database application with a Web interface for easy collaborative data entry and easy search. An outline of fields and a few sample records are given in Appendix 3. Note: The US Food and Drug Administration (FDA) will develop such an inventory and may be willing to share their code. I could be the liaison for this.
Import the existing list of KOS products/projects and the KOS records in Appendix 2 (especially the KOS produced within FAO) through reformatting.
Have members of the working group each enter some use cases
Use the inventory for dealing with requests for vocabulary development: Encourage users, both from within and outside FAO, to enter new requirements for KOS through the system as a new use case. Alternatively, when a vocabulary for a new application within FAO is requested, the staff member receiving and processing the request enters a new use case. These new use cases can be run as queries against the inventory of available KOS and/or an editor of FAO could match the need with a suitable KOS or determine that an existing KOS needs to be adapted or a new KOS developed.
Use the templates as a guide to enter data that are easily available to the person entering a use case or a KOS. Other data may need to be entered later, possibly by editors with specialized knowledge.
The templates need to be amended as they are used

1.3 Template for KOS use cases

Template for KOS use cases

Number and title

Relationship to agency mission: priority supported

Activity supported / savings or benefit offered by the KOS

Internal activity (e.g., routing of drug applications) versus
External activity (e.g, public information about drugs)
Both linked to beneficiary group

Beneficiary/user group

How many people? How many potential instances per person per month

How many instances of the activity per month

Now
Planned (may increase due to ease of use through thesaurus support, marketing, etc.)

Benefits per instance

Savings in time and/or money
Quality improvement (how much?)

System supporting this activity

KOS requirements: Subject domains, specificity, languages, types of relationships

KOS that are or can be used as is, need to be adapted, need to be developed,

Other functional requirements: What needs to happen to make the potential benefits real. Costs and responsibility for each

System-side. For example

Install automatic query term expansion for free-text searching on a Web site
Use KOS for indexing with a controlled vocabulary: human, computer-assisted, automatic
User side. Train users in applying the KOS
Reengineering work processes

Estimated cost for KOS application

Estimated benefits

Time frame

Comments

1.4 Template for KOS projects

Template for KOS projects

Number and title

Relationship to agency mission: priority supported

Related KOS use cases

Scope and size

Unit and person responsible

Collaboration / coordination (actual and possible)

Development versus maintenance

Any gaps in domain coverage

Data model (entity and relationship types) (existing and needed)

Software used, file structure

Publication data / Location / URL

Development person hours / maintenance person hours per month

Estimated cost for development and for maintenance

Time frame

Comments

2 Recommendation 2 Integrate information management for all FAO KOS

2.1 General rationale (also for Recommendation 3)

There is considerable overlap both in terms and in definitions and relationships between terms. Having one database will reduce effort in maintenance and reduce system requirements.
On the other hand, individual systems contain a considerable amount of unique information so that users now need to consult multiple systems.
The proposed system will provide users and editors with a complete, unified, reconciled view of all the information (or the subset they require). Reconciliation of information will not be easy on the fly.
Information will be better and more widely applied. For example, from multiple definitions, one new and better definition can be constructed or, if several definitions are needed, each can be improved and differences can be explicated. For example, the definitions given in the glossaries would become available to translators using FAO Term without any extra effort (which would normally not be undertaken), and this will improve translation quality. In other words, the information in the glossaries, which is uniquely valuable due to the consensus definitions that clarify concepts, would be put to use more widely and have more impact.
The interests and requirements of all participating groups can be served by an integrated system that is hospitable to differing viewpoints, different categorical views when necessary. KDS would allow for the "cohabitation" of multiple views, even to the point of having reverse hierarchical relationships in different perspective (with different source codes).

While these objectives can be achieved, to some extent, by a federated solution, such a solution will have higher costs and will likely not achieve the objectives as well.

The data in this database are produced by many units throughout FAO, for example the various groups responsible for maintaining specialized glossaries. Each unit should retain ownership over and control of its data, and data should not be changed without agreement of the unit.

2.2 Present situation

Presently, the three major KOS are maintained as follows:

AGROVOC is maintained in a MySQL database with a simple interface for adding and editing concepts and terms. This database is ported to ORACLE for Web access.

FAO Term is maintained in TRADOS MultiTerm, which tightly integrates with the TRADOS translation environment used by translators. The MultiTerm database is ported to ORACLE for Web access. (See Appendix 15 for a description of the work flow)

FAO Glossary is maintained as an ORACLE database with a nice interface for adding and editing terms. However, many of the glossaries are maintained by there owners as word processing documents which are then parsed (requiring sophisticated procedures) to read the data into ORACLE. This is done only at large intervals following the publication cycle of updated versions of the printed glossaries (for example, every two years). The data structure and the interface to this database allow for many types of information, many of which are at present not populated.

Within the scope of this report it was not possible to examine all the KOS maintained in FAO. Recommendation 1 addresses the issue of creating an inventory of these KOS.

2.3 Detailed consideration of alternatives

It is assumed that two criteria must be met by any solution to be implemented:

(a) Users should have one-stop access to all information about a concept or term.

(b) The present owners of a KOS need to retain control over the content of that KOS.

There are four overall solutions that can be considered (each with many ways of implementation)

(S1)Maintain the status quo of multiple systems with independent access.

(S2) Develop an interface that accesses different systems and integrates information from several KOS on the fly.

(S3) Develop a unified system that provides a joint home for different lexical knowledge bases, leaving control with the owners, and that provides users with integrated access to information from all KOS within FAO

(S4)Implement a unified system under central control.

Solution (S1) fails to meet the necessary criterion (a), and solution (S4) fails to meet the necessary criterion (b). This leaves solutions (S3) and (S4) for closer examination. The criteria set forth below are suggested for this examination.

First this report will elaborate on the solutions.

Solution (S2) can be characterized and implemented as follows:

Existing systems operate as they do now. Have an overall access format that is simply a list of all the types of data available from any of the underlying systems, with duplicates removed. This tells users what data are available; internally the format indicates which type of information is available where.

Data can be communicated from one system to another through intermediation of this format but the content may be based on different definitions for a type of data, for example, different definitions of relationships between concepts.

If a user requests certain types of data, the access system obtains these data from the appropriate underlying systems and combines them into one display without resolving semantic ambiguities and inconsistencies that might exist.

It is possible within this solution to start the semantic integration of different KOS and reflect the results in each KOS; the editors of each KOS have access to all the data in the other KOS to facilitate this process. That will make the on-the-fly integration of data for presentation to the user easier.

This solution has two variants:

(S2.1) Each system has its own set of KOS data entry and editing screens; access to data from other KOS through the integrated end-user interface

(S2.2 Data entry and editing screens are coordinated with built-in access to all KOS (a step towards (S3))

Solution (S3) can be characterized and implemented as follows. (See also the detailed suggestions below)

A common semantic model that provides standard types of data (entity types and relationship types, data elements) with definitions that are used by all contributors/editors of KOS data.

All contributors/editors have access to all data so that duplication of effort in creating and maintaining data is avoided. The system supports online communication among contributors/editors to enable continuous thoughtful integration of data (see below).

The overall system has non-redundant or at least consistent data storage.

Responsibility for cleaning up existing data and adding new data and editorial control is distributed across multiple contributors/editors according to one or more of the following criteria: subject domain, user group, and type of data. Each unit maintains complete control of its data.

Data on concepts and terms are thoughtfully integrated across all points of origination. This increases data quality and interoperability, for example by clarifying a concept and its definition from multiple perspectives.

If a user requests data about a concept or term, she receives a unified and consistent report.

Criteria for evaluating alternatives for dealing with FAO KOS

(c1) User access to the KOS

(c1.1) Ease of access

(c1.2) Quality of information integration

(c1.3) Response time

(c2) Interoperability of KOS as they are applied in information systems

(c3) Maintenance of the KOS

(c3.1) Enable use of the information in all KOS while editing any one KOS

Good performance on (c3.1) promotes (c2) Interoperability

(c3.2) Using suggestions (for new concepts, terms, definitions, etc.) from users and indexers of one KOS for other KOS. Suggestions made by users of one system will often be useful for the maintenance of other KOS

(c3.3) Transferability of KOS editing skills from one KOS to another

(c4) Implementation and maintenance of the software system

(c4.1) First implementation

(c4.2) Maintenance

(c4.3) Storage space

(c5) Integration with applications

(c6) Supporting new KOS or automating existing KOS that are manually maintained. Adding these KOS to the system

The following table compares solutions (S2) and (S3) using these criteria

Criterion	Solution (S2) Integrated access to separate systems	Solution (S3) Unified system for decentralized but coordinate KOS maintenance
(c1) User access to the KOS
(c1.1) Ease of access	No difference between solutions
(c1.2) Quality of information integration	This is likely to be limited since integration on the fly is difficult. This will work as well as Solution (S3) only if the information in the different KOS has been edited to avoid all unnecessary differences.	Good because integration is done beforehand in the system.
(c1.3) Response time	May be slow because of accessing multiple systems and then processing the information gathered. Likely to deteriorate as new systems are added.	Good because only a single database with pre-integrated information is accessed. Proper database design will maintain performance even in a large database.
(c2) Interoperability of KOS	If unnecessary differences are edited out and common standards are followed, (S2) will work ok. However, there is the danger that the systems will diverge again unless maintenance is tightly coupled, as described in (S3). (S2.2) would be better here than (S2.1)	Built-in to the extent that the owners of the different systems can agree on common concepts and terms. Remaining differences should be clearly articulated, which is somewhat easier in a unified framework.
(c3) Maintenance of the KOS
(c3.1) Enable use of the information in all KOS while editing any one KOS	Cumbersome in (S2.1), involving copying from the end-user interface to use data from a KOS other than that being edited. Easier in (S2.2)	Built in
(c3.2) Using suggestions made for any KOS for maintenance of all KOS	Requires continuous exchange of data	Unified suggestion list built into the system
(c3.3) Transferability of KOS editing skills from one KOS to another	Requires learning content rules specific to the other KOS For (S2.1): Requires learning a new interface For (S2.2): Interface always the same	Requires learning content rules specific to the other KOS Interface always the same
(c4) Implementation and maintenance of the software system
(c4.1) First implementation	Very similar. Both need to deal with idiosyncracies of individual systems. Both can build on the existing code base Needs to deal with access to multiple databases	Needs to access just one database - easier to implement
(c4.2) Maintenance	Any changes in participating systems require work. Different systems may do same functions differently.	All changes and improvements made benefit all KOS.
(c4.3) Storage space	Much redundant storage	Each piece of information stored once
(c5) Integration with applications	Difficult since either all applications must deal with accessing multiple systems or a common API that provides integrated access to all systems must be developed. While developing such an API can build on the work done for the integrated end-user interface, it is still a major piece of work	Easy to access information from all KOS at once. If data are needed in a format specific to the application, a data export module must be created
(c6) Supporting new KOS	Need to develop a whole new application (unless the new KOS is created and maintained within one of the existing KOS systems - Solution (S3) on a smaller scale. All mechanisms for common access must be updated.	May need to add some new functionality, but can simply add new data in most cases

Table 1. Comparison of KOS implementation solutions

This analysis shows that both Solution (S2) and Solution (S3) can be designed to meet the requirements set forth, but (S3) does so more elegantly, at lower cost, and most likely higher performance. This is why (S3) is recommended.

2.4 Notes on implementation

2.4.1 Integration of ORACLE databases

Since all these KOS exist as ORACLE databases, storing all their data in one database (without intellectual integration) should not be hard. The new database consists of a set of tables drawn from the existing databases. Some of these tables will consist of the union of the data fields (columns) from two or more existing tables that are very similar. Other tables will simply be taken as is from an existing database. The comparative table of data fields from the three existing databases in Appendix 7 should assist in this schema integration. There may also be some entirely new tables that deal with data ownership and display destination.

The AGROVOC interface can be ported from MySQL to ORACLE and combined with the functionality of the FAO Glossary interface (which should be easily adapted to the new database structure). This functionality can then be used for editing FAO Term, with the possible addition of some features. Finally, software must be written to port all FAO Term data (all the data in the combined database that are potentially useful to translators) to MultiTerm so that these data are available in the TRADOS translation environment.

Another feature to be added is a notification system: Whenever one team adds a term, the other teams must be notified so they can consider adding the term as well. In the case of the FAO Glossary, there should be an editor that can forward the notification to the appropriate unit(s) (possibly none).

The database should be set up in such a way that each concept can be accessed through a URI.

Note: The Harvard Business School Thesaurus Project has developed an ORACLE database schema (Appendix 10) and is developing data entry and user interface screens based on that schema. They would be amenable to talks about collaboration in developing this application. This schema would have to be augmented to take account of multiple languages.

2.4.2 Harmonization of term formats

All systems should predominantly follow standard dictionary practice for the form of terms. This means:

As a rule, terms should be normalized to singular (possibly with some exceptions in AGROVOC)
Terms should appear in the database as they would inside a sentence. All terms except proper nouns would thus start with lower case. Term comparison in the system should be case-sensitive to distinguish terms like turkey and Turkey. (For searching the default should be case-insensitive with an option for case-sensitive). An algorithm to convert AGROVOC to initial lower case with some intelligence is shown in Appendix 8.

2.4.3 Arrangements for Web access

The Web access should probably done as follows for now:

Each unit that produces a glossary has on its Web site a link to its own glossary
FAO Term and the overall FAO Glossary are combined
AGROVOC retains its present access.

The existing interfaces for Web access should be easily modified to work with the new database.

2.5 Further development

The KDS common database suggested here will provide immediate access to all data through a common mechanism. By putting the knowledge available now into a format that can be easily augmented and refined, KDS also provides the framework for

incrementally introducing intellectual integration of all FAO KOS and other KOS added to the system over time and incrementally refining relationships, culminating in a rich ontology for the FAO domain;
extracting specialized KOS;
developing new KOS in an overall framework;
providing a platform through which all KOS editors can communicate;
accepting inputs from users for incremental augmentation of the KOS database;
efficiently building a cross-walk between various systems;
incrementally improving the software system for KOS management

3 Recommendation 3. Incrementally build a rich ontology of the FAO domain

A well-structured hierarchy is essential to support good indexing and query formulation and other user interactions with the KOS. It is equally important to support reasoning through hierarchical inheritance in artificial intelligence and Semantic Web applications. Both applications also require a rich set of differentiated relationships as detailed in the JoDI paper (Appendix 12). Such relationships are also needed for extracting specialized KOS. For example, to extract a KOS on rice, we need relationships from rice to organisms that are pests of rice to be sure to include such organisms.

Constructing such a hierarchy requires standard procedures of thesaurus development: semantic factoring and facet analysis followed by hierarchy building in each facet. Much of this process can be supported by computer. This should be organized as a collaboration of many groups inside and outside FAO supported by the FAO KDS described in Section 2 with Web data access and Web data entry. For example, the Thai AGROVOC group is eager to develop a rich ontology of Thai food, horticultural, and forestry plants. There is also a group in the US working on expert systems for processing documents in the domain of food and agriculture, and they are working on ontologies to support their AI programs.

Procedural details are discussed in Section E.

4 Recommendation 4. Create a crosswalk between major KOS in the FAO domain

Within financial constraints, collaborate with other institutions, in particular NAL, CABI, and CAAS, to create a crosswalk between major KOS in the FAO domain. This might evolve incrementally into a system modeled in functionality, but not in implementation, after the Unified Medical Language System of the US National Library of Medicine as the basis for the Agricultural Ontology Service. This should be linked seamlessly to an existing database of the taxonomy of living things.

5 Recommendation 4. Use powerful KOS management software

There are many requirements KOS management software (KMS) should meet; they are detailed in Appendix 6. A few requirements that are essential for FAO but absent from most KMS are highlighted here:

can handle an integrated database of many sources;
uses synonym relationships from multiple from many source for compiling "synonymsets" across source and thus establish mappings between sources;
supports an extensible set of relationships, preferably with integrity constraints;
supports the creation, maintenance, and display of a meaningful hierarchical arrangement
Supports the rules-as-you-go approach for refining relationships.