Table of ContentsNext Page


1 From thesauri to rich ontologies


1.1 The problem

Empowering end users in searching collections of ever increasing magnitudes with performance far exceeding plain free-text searching (as used in many Web search engines), and developing systems that not only find but also process information for action, require considerably more powerful - and complex - knowledge organization systems (KOS) than the classification schemes and thesauri that currently exist. Such systems must serve the following functions, among others:

All of these functions require semantic relations that are more expressive and nuanced than the few rudimentary categories and relationships found in traditional thesauri and classifications.

A typical scenario in information retrieval illustrates some of the shortcomings of current free-text search engines such as Google. A farmer is interested in finding out about rice and starts a search by entering the string 'rice'. The results returned in response to the query immediately indicate several problems. First, because the system performs the search based on the actual text string entered rather than on an interpretation of the meaning of the string, many irrelevant results are retrieved. This occurs because the query term itself is ambiguous (i.e. rice can refer to the grain, to the university in Houston, or to the name of an author, among others). Further, there are millions of results with no apparently meaningful arrangement. To find something of possible relevance, the user may need to click and scan page after page of the retrieved results. Finally, the user is stuck with the results that have been retrieved; to find other related resources, such as rice cultivation, the user must start from the beginning again and formulate a different query, despite the fact that the new query corresponds to concepts related to the original query. The problem becomes evident: The biggest challenge in information retrieval is concept identification in a specific domain of interest!

In contrast, in a semantics-driven information retrieval system, the system would recognize, i.e. "understand", that the string 'rice' was ambiguous; it would then request clarification from the user as to which of the possible meanings was intended. Only then, after the user disambiguated the term, would the system execute the search. The system would then retrieve only those resources that had been semantically marked up (through manual or automatic indexing) with the concept of rice, no matter what words or even languages are used in the resources to refer to rice. Moreover, because the system is semantically rich, it not only presents results that are based on understanding the user's request, it also offers related concepts the user might not have thought of initially. Based on a <hasPest> relation, the system could display such concepts as rice weevil and rice moth. Searching on these latter concepts could in turn lead to concepts on pesticides used on rice, and so on. The system could retrieve not only information directly pertinent to the user's query but also help the user explore and clarify the information need and find useful related information. In this scenario, a KOS has two functions: assisting the user with exploring the topic of the query, and supporting intelligent automatic indexing (metadata assignment) through statistical and syntactic-semantic analysis and "understanding" of text; both functions require a KOS with a rich and precisely defined semantic structure.

To accomplish these and other more sophisticated tasks, the new KOS must marry the conceptual structure of full-fledged ontologies - well-structured hierarchies of concepts connected through a rich network of detailed relations that support concept retrieval and reasoning - with the terminological richness of good thesauri. While existing KOSs do not provide the full set of precise concept relations needed for reasoning, existing KOSs, both large and small, represent much intellectual capital. This paper explores the question of how this intellectual capital can be put to use in constructing full-fledged KOSs.

1.2 The relationship of traditional KOS to ontologies

Reengineering thesauri, classification schemes, etc., into ontologies means building on the information contained in them and refining that information as needed. Consider the relationships given in the http://www.ericfacility.net/extra/pub/thesbrowse.cfm (ERIC = Educational Resources Information Center) with those given in a hypothetical ontology as shown in Table 1.

Table 1: Statements and rules of a hypothetical ontology versus the information given in the ERIC thesaurus (broader term (BT), related term (RT))

Eric Thesaurus

Hypothetical ontology


Statements:

reading instruction

BT instruction

RT reading

RT learning standards

reading instruction

<isa> instruction

<hasDomain> reading

<governedBy> learning standards

reading ability

BT ability

RT reading

RT perception

reading ability

<isa> ability

<hasDomain> reading

<supportedBy> perception


Rule 1

Instruction in a domain should consider ability in that domain:

X shouldConsider Y

IF X <isa (type of)> instruction AND X <hasDomain> W

AND Y <isa> ability AND Y <hasDomain> W

yields: The designer of reading instruction should also consider reading ability.


Rule 2

X shouldConsider Z

IF X <shouldConsider> Y

AND Y <supportedBy> Z

yields: The designer of reading instruction should also consider perception

The inferences given rely on the detailed semantic relationships given in the ontology. But the ERIC thesaurus gives only some poorly defined broader term (BT) and related term (RT) relationships. These relationships are not differentiated enough to support inference.

For another example, consider the hypothetical ontological relationships and rules we could formulate with these relationships in an example taken from the AGROVOC thesaurus (described in detail in section 2) in Table 2.

Table 2: AGROVOC relationships compared with more differentiated relationships of a hypothetical ontology (narrower term (NT), broader term (BT))

AGROVOC

Hypothetical Ontology

Undifferentiated hierarchical relationships in AGROVOC

milk

NT cow milk
NT milk fat

cow

NT cow milk

Cheddar cheese

BT cow milk

Differentiated relationships in an ontology

milk

<includesSpecific> cow milk
<containsSubstance> milk fat

cow

<hasComponent> cow milk*

Cheddar cheese

<<madeFrom> cow milk


Rule 1

Part X <mayContainSubstance> Substance Y

IF Animal W <hasComponent> Part X

AND Animal W <ingests> Substance Y


Rule 2

Food Z <containsSubstance> Substance Y

IF Food Z <madeFrom> Part X

AND Part X <containsSubstance> Substance Y

In the context of food and nutrition it makes eminent sense to consider milk and egg as parts of an animal since their nutritional value and safety depend on the nature of the animal and the feed it ingests just as do skeletal meat and organ meat. This is an example of careful definition of relationships.

From the statements and rules given in the ontology, a system could infer that Cheddar cheese <containsSubstance>milk fat and, if cows on a given farm are fed mercury-contaminated feed, that Cheddar cheese made from milk from these cows <mayContainSubstance>mercury. But the present AGROVOC Thesaurus (described in detail in section 2) gives only narrower term/broadr term (NT/BT) relationships without differentiation.

The limitations of existing KOS can be summarized as follows:

To overcome these limitations and enable more powerful searching and intelligent information processing, especially as such capabilities can be made more widely available through the Web, traditional KOSs must be reengineered into KOSs that contain domain concepts linked through a rich network of well-defined relationships and a rich set of terms identifying these concepts. A concept can be represented by many different terms (words or phrases) in multiple languages. This paper refers to terms as lexicalizations of a concept. One term can identify several concepts (homonymy) and one concept can have multiple synonymous terms. A concept is conveyed by all its lexicalizations, the domain it occurs in, and by its relationships to other concepts. In addition, valid rules and constraints need to be specified to provide additional generalizations over sets of related concepts and to support inference. These systems must also be converted to machine-processable formats based on Web technologies like XML which tag the vocabularies in a standardized way.

In contrast to traditional KOSs, ontologies provide conceptual abstraction and differentiated relationships. Ontologies specifically separate concepts from lexicalizations and thereby better reflect the structure of human understanding of a domain. In ontologies, the semantics are developed through ensuring that each concept within the domain is uniquely and precisely defined and by specifying elaborated relationships among the concepts. The relationships in an ontology are explicitly named and developed with specification of rules and constraints so that they reflect the context of the domain for which the knowledge is modeled.

Given their more precise and unambiguous semantics, ontologies allow further knowledge to be inferred from the knowledge explicitly represented in the ontology. The new (implicit) knowledge could be derived by applying generalization or transitivity rules, the level of applicability of which is limited in a poorly defined KOS like a traditional thesaurus. This added knowledge in the ontology makes it powerful when employed for intelligent information processing. Although there is a huge cost involved in moving from thesauri to ontologies, there is an expectation that the added power of consistency, precision, and completeness will be worth the investment even though reliable numbers on the return on investment (ROI) of ontology development are hard to come by.

1.3 Potential benefits of future generation KOSs

For emerging KOSs to satisfy user needs, they must improve both information organization and retrieval in a way that was not possible with traditional KOSs. The following potential benefits are expected from such systems:

To be an effective tool to facilitate information categorization, integration and retrieval, ontologies should be multilingual, domain-specific, and cross disciplinary at the same time. For maximum application potential they should be developed in a non-proprietary, application-independent, and machine-processable format to ensure interoperability among different systems.

1.4 The process of reengineering: The rules-as-you-go approach

Reengineering a thesaurus into an ontology entails refining thesaurus relationships, a laborious process. The steps in the process are:

1. Define the ontology structure

2. Fill in values from one or more legacy KOS to the extent possible

3. Edit manually using an ontology editor:

1. make existing information more precise
2. add new information

Step 1 is addressed in section 3 which gives an overall conceptual model at a high level of abstraction, and section 4, which begins the process of defining a set of relationship types for the food and agriculture domain by examining relationships in AGROVOC as to their relationship types.

Step 3 is the most laborious. We have plans to streamline this process by implementing intelligent conversion using a "rules as you go" approach. The idea is as follows: The KOS editor watches out for patterns; based on these patterns the editor formulates rules that can be applied immediately to all subsequent similar cases as illustrated in the following:

1. An editor has determined that
cow NT cow milk should become cow <hasComponent> cow milk

2. She recognizes that this is an example of the general pattern
animal <hasComponent> milk (or, even more general animal <hasComponent> body part)

3. Given this pattern, the system can derive automatically
goat NT goat milk should become goat <hasComponent> goat milk
since goat is an animal and goat's milk ends with the word milk and thus can be seen to be a type of milk.

To automate this approach even more, we plan to build an inventory of patterns such as animal <hasComponent> body part, augmented by an ontology that specifies the concepts of type animal (cow, goat, sheep, horse, chicken, etc.) and the concepts of type body part (skeletal meat part, liver, bone, milk, egg, etc.). This information would be drawn from AGROVOC itself and other sources, such as Langual, UMLS, and even WordNet. The system can then detect the applicability of these patterns, at least once it saw one example transformed by an ontology editor. The ontology editors will add to the pattern inventory incrementally.

These patterns are a special type of constraint. Other constraints can be formulated and used to limit the options presented to the human editor as thesaurus relationships are refined. The bases for such constraints are the thesaurus relationships, on the one hand, and the entity types of the concepts involved, on the other. Table 3 shows some examples of constraints based on thesaurus relationships.

Table 3: Some relationship constraints

Thesaurus Relationships

Possible ontology relationships

NT/BT

<hasMember>/<memberOf>

<includesSpecific>/<isa>

<hasComponent>/<componentOf>

<spatiallyIncludes>/<spatiallyIncludedIn>
etc.

RT

<similarTo>

<growsIn>/<EnvironmentForGrowing>

<treatmentFor>/<treatedWith>?

<hasMember>/<memberOf> etc.

Note that the RT relationship often transforms into relationships that are not symmetric.

Note further that in a well constructed thesaurus, an RT should not resolve into an <isa> relationship. However, reality shows that the RT relationship has been applied to express this relationship. This can be taken as another proof for the weak definition of relationships in many thesauri.

This inventory will constrain the available choices when manually refining a thesaurus relationship to a more specific ontology relationship. Of course, an authorized ontology editor can override such constraints and thereby update the relationship table. As a relationship has been added or refined the inverse relationship is automatically added or refined.


Top of Page Next Page