Previous Page Table of Contents Next Page


3 Conceptual model: Combining thesauri and ontologies


This section introduces a conceptual model that provides the necessary structure to create precise semantics to facilitate the transition from traditional thesauri to ontologies. Figure 1 shows the high level conceptual model we propose. Its chief characteristic is a clear separation of the concept level, the term or lexicalization level, and the string level. Present thesauri give a more or less muddled representation of information about concepts and information about terms. The proposed structure allows for a clear separation of concept information and term information. This model owes much to the structure of the UMLS.

Figure 1. Conceptual model for combining thesauri and ontologies

3.1 The basic model

The following is just the broad outline of the model. Many more types of information could be added. In any event, we consider the model extensible. On the other hand, not all applications will use all features of the model. For example, our model provides for relationships between notes (for example, as hypertext links). This is not possible in all environments but very useful in some. Our intent is to present a framework that can be used for the simplest thesaurus or the most complex and rich ontology in a format that communicates equally to thesaurus and ontology editors with a background in information science, artificial intelligence, or linguistics.

Each concept, term, and string can be assigned an identifier, preferably a Unique Resource Identifier (URI); for concepts, UMLS uses Concept Unique Identifiers (CUI), while the Topic Map Standard uses unique subject identifiers. Using unique concept identifiers allows for unambiguous reference to concepts, as opposed to often ambiguous terms. Concepts can furthermore be assigned notations (such as class numbers in the Dewey Decimal Classification; notations are also called term numbers); notations can be used to maintain a logical, meaningful sequence in hierarchical displays.

Concepts take center stage in our proposed thesaurus/ontology information model; accordingly, relationships between concepts are central. Concepts are arranged in hierarchies and have additional relationships to other concepts in the network; a hierarchy can be defined on any weak ordering relationship including isa, part-whole, spatial containment, etc. (the relationship must be transitive and not symmetric, but must have an existing inverse relationship, for example <componenttOf> is the inverse relationship of <hasComponent>). There are many other relationship types, such as <causes>; a scheme of relationship types needs to be defined for the domain of the respective thesaurus. One source for finding relationship types is the detailed analysis of concept relationships present in the thesaurus that is to be reengineered into a richer ontology (see section 4). Each concept should be assigned to an entity type or facet, such as process, function, substance, living organism (see, for example, the semantic types in the UMLS Semantic Network); the type of a concept constrains its participation in relationships.

A concept is designated or represented by one or more lexicalizations or terms in one or more languages; this is the linkage between the concept level and the term level. For examples see Table 4.

Table 4. Concepts, terms, strings (concept and term numbers are fictitious and used only for illustration)

Concept ID

Term ID

Strings manifesting the term

AGROVOC:C316301

AGROVOC: T657210

bovine spongiform encephalopathy, BSE

AGROVOC:C316301

AGROVOC: T657211

mad cow disease, Mad Cow Disease, MCD

AGROVOC:C316301

AGROVOC: T734567

encephalopathy spongiforme bovine, ESB

AGROVOC:C316301

AGROVOC: T734566

maladie de la vache folle, MVF

AGROVOC:C316301

AGROVOC: T700345

encefalopatia espongiforma bovina, EEB

AGROVOC:C316301

AGROVOC: T700346

enfermedad de la vaca loca, EVL

AGROVOC:C014593

AGROVOC: T187953

plow, plows, plough, ploughs, plogh [a frequent misspelling]

AGROVOC:C014593

AGROVOC: T498001

charrue

AGROVOC:C014593

AGROVOC: T498002

materiel de labour

If a term is a homonym (designates more than one concept), several disambiguated terms are introduced. The homonym is linked to each of the disambiguated terms, and each disambiguated term is linked to the corresponding concept. Two terms designating the same concept are called synonyms. Conversely, if one does not agree that concepts per se exist, one can simply view "concept" as a convenient shorthand for an equivalence class of terms that are linked by the <hasSynonym> relationship, such as the synsets in Wordnet. A KOS may select a preferred term as the term used to represent the concept or it may make that choice dependent on the audience (for example, veterinarians versus farmers).

Terms can be connected through many relationships such as <hasSynonym> (with <hasScientificName> as a special case), <hasAntonym>, <hasCognate> (term in a different language from the same root), and <hasTranslation>. One might think that the synonym and translation relationships are not needed since all terms linked to the same concept would be synonyms or translations. However, two terms may be linked to the same concept yet be used in different contexts, i.e. they are not strict synonyms. If a concept has linked to it several English terms and several French terms, it is not true that just any of the French terms is a good translation for a given English term (see the examples in Table 4). Another example of term-specific relationships is <hasAntonym>. For example, big and small designate opposite concepts but are not antonyms. (The antonym pairs are big versus little and large versus small; see Wordnet.)

Finally, a term is manifested in one or more strings, as shown in Table 4. Strings can be connected through relationships such as <hasCaseVariant>, <hasSpellingVariant>, <hasAbbreviationOrAcronym>, <pluralOf>/<singularOf>, which are all subordinate of a broader relationship <hasStringVariant>. A term can be seen as a convenient shorthand for an equivalence class of strings that are linked by the <hasStringVariant> relationship. A KOS may select a preferred variant as the string used to represent the term or it may make that choice dependent on the audience (as in British versus US spellings). A string, especially an acronym, may belong to several terms, in which case it needs to be disambiguated.

In addition, a concept, a lexicalization/term, a string, or a relationship type can have several types of notes (definitions, usage notes, comments, image, etc.) in different languages (in the case of multilingual thesauri). Just like concepts and terms, notes can be related to each other through relationships such as <hasTranslation>, <hasSimplifiedVersion>, <hasOtherDefinition>, or any other type of hyperlink. Many other pieces of information about terms can be added, for example, case frames for verbs (in case the verb has a case frame different from the case frame for the corresponding action concept) or register (see below) or whether the term is the preferred term for the concept. Administrative data will be accommodated as well.

Relationship types themselves can form relationship hierarchy (i.e. a relationship of relationships), in which more generic relationships are further up in the hierarchy than more specific relationships, for example, <componentOf'> is a specific kind of <partOf> relationship.

Why define concepts, terms, and strings as separate entity types?

First, each of these entity types takes different types of information. Conceptual relationships and other information are associated with concepts. Linguistic information, such as part of speech and how a term combines with other terms into sentences, usage, or information on etymology, are associated with terms. Information such as that a string is an acronym is associated with terms. Usage information may sometimes be associated with strings; for example, lay people may commonly use a slang abbreviation while professionals use the full string. Definitions are primarily associated with concepts but may also be associated with terms.

Second, this distinction avoids confusion. In a standard thesaurus like AGROVOC, for each concept that is to be used in indexing and searching, a preferred term, and for that term a preferred string, is selected; this string is the descriptor. Non-descriptors are linked only to descriptors, not among themselves. As a result, BSE, mad cow disease, and MCD [which we made for illustration] are all linked to bovine spongiform encephalitis as synonyms (or, in some thesauri, as synonym and as abbreviations). But the information that BSE belongs with bovine spongiform encephalitis and MCD with mad cow disease is lost. Furthermore, if decisions on terms are made (for example, omitting mad cow disease as a non-scientific name), these decisions should apply to all term variants, in the example MCD as well.

3.2 Model extensions

As was mentioned above, many more types of information could be added to concepts, terms, strings, notes, and relationships. For example, we might specify an audience (general lay public, K-12 students by grade level, university students, experts), a subject domain, a scope (as in Topic Maps), or a specially selected subset of concepts and terms to be used for a given application, or all concepts and terms taken from a given source.

Scopes could be defined in many ways. For example, one might define a scope as the conceptual system embedded and expressed in a language (whereas the link from terms and notes to language simply refers to the surface form). Consider the conceptual system underlying Walpiri (an Australian indigenous language); one of its noun classifiers includes women, fire, and dangerous things (Lakoff 1987). A native speaker of English would find this classifier and the corresponding <isa> relationships very curious. Thus one would introduce the category and the <isa> relationships with a scope of the Walpiri conceptual system. (By the way, the relationship between these relationships makes sense in the context: fire is dangerous; fire is sometimes started by or anyway related to the sun; the gender of the noun for sun is female). Many such problems, if more subtle, occur in thesauri for international use.

A subvocabulary can be extracted using any type information about concepts, terms, strings, and relationships that is available in the thesaurus. Thus one could extract as subvocabulary

Each of these subvocabularies provides a specific view on the entire KOS for a given purpose. In online implementations such subvocabularies can be created on the fly or defined as views for certain user groups. But subvocabularies can also be printed or exported (for example, a subvocabulary extracted as the personal KOS of a researcher who maintains an information retrieval system on his or her own computer).

3.3 Limitations

The separation into the concept layer and the term layer is appealing for its simplicity and elegance but it is somewhat of an oversimplification. Terms, particularly terms in different languages, rarely mean exactly the same thing. So the question arises as to when to map two terms to the same concept - and possibly explain shades of meaning and associations in the definition of each term that complements the definition of the concept - and when to create two closely related concepts, possibly under one broader concept. Our model permits any type of relationship between terms. Thus it is possible to introduce conceptually motivated relationships between terms that more accurately reflect the reality of language than the mapping of terms to "concepts". These two representations of conceptual information can coexist within the same system.

3.4 Implementation

All relationships from all layers (concept, term, string) can be stored in the same format within a database. The type of each element should be explicitly given to enable integrity constraints (so that the relationship <hasSpellingVariant> is not allowed between two concepts, for example). A concept can be identified by a URI or other number (cleanest solution) or by its preferred term in the base language of the thesaurus (the term being typed as preferred). Likewise, a term can be identified by a URI or other number (cleanest solution) or by the preferred string (the term being typed as preferred). The same holds for strings. The main difference with implementations in most existing thesaurus management software is that relationships between non-descriptors are allowed. Thoughts for an XML/RDF schema for KOS data are presented in the Appendix.

3.5 Related approaches

The proposed conceptual model integrates well with standardization approaches regarding Web technologies like RDFS. The proposed structure shows all aspects of a proposed RDFS-compatible Thesaurus Interchange Format by Matthews et al. (2002), which will appear as a W3C note. The proposal is being done in the context of the SWAD-Europe project. The Appendix presents another approach for representing ontology and thesaurus data in XML/RDFS.


Previous Page Top of Page Next Page