3 Conceptual model: Combining thesauri and ontologies

This section introduces a conceptual model that provides the necessary structure to create precise semantics to facilitate the transition from traditional thesauri to ontologies. Figure 1 shows the high level conceptual model we propose. Its chief characteristic is a clear separation of the concept level, the term or lexicalization level, and the string level. Present thesauri give a more or less muddled representation of information about concepts and information about terms. The proposed structure allows for a clear separation of concept information and term information. This model owes much to the structure of the UMLS.

Figure 1. Conceptual model for combining thesauri and ontologies

3.1 The basic model

The following is just the broad outline of the model. Many more types of information could be added. In any event, we consider the model extensible. On the other hand, not all applications will use all features of the model. For example, our model provides for relationships between notes (for example, as hypertext links). This is not possible in all environments but very useful in some. Our intent is to present a framework that can be used for the simplest thesaurus or the most complex and rich ontology in a format that communicates equally to thesaurus and ontology editors with a background in information science, artificial intelligence, or linguistics.

A concept encapsulates meaning.
A concept can be represented or designated by one or more linguistic expressions, namely terms or lexicalizations which can be single words or multi-word phrases (or composite words in agglutinative languages).
A term, in turn, can take variant forms (singular/plural, variations in case, spelling variants, abbreviations, acronyms); so just as a concept can have many lexical representations, a term can have many string manifestations.

Each concept, term, and string can be assigned an identifier, preferably a Unique Resource Identifier (URI); for concepts, UMLS uses Concept Unique Identifiers (CUI), while the Topic Map Standard uses unique subject identifiers. Using unique concept identifiers allows for unambiguous reference to concepts, as opposed to often ambiguous terms. Concepts can furthermore be assigned notations (such as class numbers in the Dewey Decimal Classification; notations are also called term numbers); notations can be used to maintain a logical, meaningful sequence in hierarchical displays.

Concepts take center stage in our proposed thesaurus/ontology information model; accordingly, relationships between concepts are central. Concepts are arranged in hierarchies and have additional relationships to other concepts in the network; a hierarchy can be defined on any weak ordering relationship including isa, part-whole, spatial containment, etc. (the relationship must be transitive and not symmetric, but must have an existing inverse relationship, for example <componenttOf> is the inverse relationship of <hasComponent>). There are many other relationship types, such as <causes>; a scheme of relationship types needs to be defined for the domain of the respective thesaurus. One source for finding relationship types is the detailed analysis of concept relationships present in the thesaurus that is to be reengineered into a richer ontology (see section 4). Each concept should be assigned to an entity type or facet, such as process, function, substance, living organism (see, for example, the semantic types in the UMLS Semantic Network); the type of a concept constrains its participation in relationships.

A concept is designated or represented by one or more lexicalizations or terms in one or more languages; this is the linkage between the concept level and the term level. For examples see Table 4.

Table 4. Concepts, terms, strings (concept and term numbers are fictitious and used only for illustration)
Concept ID	Term ID	Strings manifesting the term
AGROVOC:C316301	AGROVOC: T657210	bovine spongiform encephalopathy, BSE
AGROVOC:C316301	AGROVOC: T657211	mad cow disease, Mad Cow Disease, MCD
AGROVOC:C316301	AGROVOC: T734567	encephalopathy spongiforme bovine, ESB
AGROVOC:C316301	AGROVOC: T734566	maladie de la vache folle, MVF
AGROVOC:C316301	AGROVOC: T700345	encefalopatia espongiforma bovina, EEB
AGROVOC:C316301	AGROVOC: T700346	enfermedad de la vaca loca, EVL
AGROVOC:C014593	AGROVOC: T187953	plow, plows, plough, ploughs, plogh [a frequent misspelling]
AGROVOC:C014593	AGROVOC: T498001	charrue
AGROVOC:C014593	AGROVOC: T498002	materiel de labour

If a term is a homonym (designates more than one concept), several disambiguated terms are introduced. The homonym is linked to each of the disambiguated terms, and each disambiguated term is linked to the corresponding concept. Two terms designating the same concept are called synonyms. Conversely, if one does not agree that concepts per se exist, one can simply view "concept" as a convenient shorthand for an equivalence class of terms that are linked by the <hasSynonym> relationship, such as the synsets in Wordnet. A KOS may select a preferred term as the term used to represent the concept or it may make that choice dependent on the audience (for example, veterinarians versus farmers).

Terms can be connected through many relationships such as <hasSynonym> (with <hasScientificName> as a special case), <hasAntonym>, <hasCognate> (term in a different language from the same root), and <hasTranslation>. One might think that the synonym and translation relationships are not needed since all terms linked to the same concept would be synonyms or translations. However, two terms may be linked to the same concept yet be used in different contexts, i.e. they are not strict synonyms. If a concept has linked to it several English terms and several French terms, it is not true that just any of the French terms is a good translation for a given English term (see the examples in Table 4). Another example of term-specific relationships is <hasAntonym>. For example, big and small designate opposite concepts but are not antonyms. (The antonym pairs are big versus little and large versus small; see Wordnet.)

Finally, a term is manifested in one or more strings, as shown in Table 4. Strings can be connected through relationships such as <hasCaseVariant>, <hasSpellingVariant>, <hasAbbreviationOrAcronym>, <pluralOf>/<singularOf>, which are all subordinate of a broader relationship <hasStringVariant>. A term can be seen as a convenient shorthand for an equivalence class of strings that are linked by the <hasStringVariant> relationship. A KOS may select a preferred variant as the string used to represent the term or it may make that choice dependent on the audience (as in British versus US spellings). A string, especially an acronym, may belong to several terms, in which case it needs to be disambiguated.

In addition, a concept, a lexicalization/term, a string, or a relationship type can have several types of notes (definitions, usage notes, comments, image, etc.) in different languages (in the case of multilingual thesauri). Just like concepts and terms, notes can be related to each other through relationships such as <hasTranslation>, <hasSimplifiedVersion>, <hasOtherDefinition>, or any other type of hyperlink. Many other pieces of information about terms can be added, for example, case frames for verbs (in case the verb has a case frame different from the case frame for the corresponding action concept) or register (see below) or whether the term is the preferred term for the concept. Administrative data will be accommodated as well.

Relationship types themselves can form relationship hierarchy (i.e. a relationship of relationships), in which more generic relationships are further up in the hierarchy than more specific relationships, for example, <componentOf'> is a specific kind of <partOf> relationship.

Why define concepts, terms, and strings as separate entity types?

First, each of these entity types takes different types of information. Conceptual relationships and other information are associated with concepts. Linguistic information, such as part of speech and how a term combines with other terms into sentences, usage, or information on etymology, are associated with terms. Information such as that a string is an acronym is associated with terms. Usage information may sometimes be associated with strings; for example, lay people may commonly use a slang abbreviation while professionals use the full string. Definitions are primarily associated with concepts but may also be associated with terms.

Second, this distinction avoids confusion. In a standard thesaurus like AGROVOC, for each concept that is to be used in indexing and searching, a preferred term, and for that term a preferred string, is selected; this string is the descriptor. Non-descriptors are linked only to descriptors, not among themselves. As a result, BSE, mad cow disease, and MCD [which we made for illustration] are all linked to bovine spongiform encephalitis as synonyms (or, in some thesauri, as synonym and as abbreviations). But the information that BSE belongs with bovine spongiform encephalitis and MCD with mad cow disease is lost. Furthermore, if decisions on terms are made (for example, omitting mad cow disease as a non-scientific name), these decisions should apply to all term variants, in the example MCD as well.

3.2 Model extensions

As was mentioned above, many more types of information could be added to concepts, terms, strings, notes, and relationships. For example, we might specify an audience (general lay public, K-12 students by grade level, university students, experts), a subject domain, a scope (as in Topic Maps), or a specially selected subset of concepts and terms to be used for a given application, or all concepts and terms taken from a given source.

Scopes could be defined in many ways. For example, one might define a scope as the conceptual system embedded and expressed in a language (whereas the link from terms and notes to language simply refers to the surface form). Consider the conceptual system underlying Walpiri (an Australian indigenous language); one of its noun classifiers includes women, fire, and dangerous things (Lakoff 1987). A native speaker of English would find this classifier and the corresponding <isa> relationships very curious. Thus one would introduce the category and the <isa> relationships with a scope of the Walpiri conceptual system. (By the way, the relationship between these relationships makes sense in the context: fire is dangerous; fire is sometimes started by or anyway related to the sun; the gender of the noun for sun is female). Many such problems, if more subtle, occur in thesauri for international use.

A subvocabulary can be extracted using any type information about concepts, terms, strings, and relationships that is available in the thesaurus. Thus one could extract as subvocabulary

a subset that was selected for a given application;
all strings that are acronyms (for an acronym reference);
all scientific names;
all entries for taxonomic entities for the entire range of living things or for a given large taxon such as insects;
terms suitable for a given audience.

Each of these subvocabularies provides a specific view on the entire KOS for a given purpose. In online implementations such subvocabularies can be created on the fly or defined as views for certain user groups. But subvocabularies can also be printed or exported (for example, a subvocabulary extracted as the personal KOS of a researcher who maintains an information retrieval system on his or her own computer).

3.3 Limitations

The separation into the concept layer and the term layer is appealing for its simplicity and elegance but it is somewhat of an oversimplification. Terms, particularly terms in different languages, rarely mean exactly the same thing. So the question arises as to when to map two terms to the same concept - and possibly explain shades of meaning and associations in the definition of each term that complements the definition of the concept - and when to create two closely related concepts, possibly under one broader concept. Our model permits any type of relationship between terms. Thus it is possible to introduce conceptually motivated relationships between terms that more accurately reflect the reality of language than the mapping of terms to "concepts". These two representations of conceptual information can coexist within the same system.

3.4 Implementation

All relationships from all layers (concept, term, string) can be stored in the same format within a database. The type of each element should be explicitly given to enable integrity constraints (so that the relationship <hasSpellingVariant> is not allowed between two concepts, for example). A concept can be identified by a URI or other number (cleanest solution) or by its preferred term in the base language of the thesaurus (the term being typed as preferred). Likewise, a term can be identified by a URI or other number (cleanest solution) or by the preferred string (the term being typed as preferred). The same holds for strings. The main difference with implementations in most existing thesaurus management software is that relationships between non-descriptors are allowed. Thoughts for an XML/RDF schema for KOS data are presented in the Appendix.

3.5 Related approaches

The proposed conceptual model integrates well with standardization approaches regarding Web technologies like RDFS. The proposed structure shows all aspects of a proposed RDFS-compatible Thesaurus Interchange Format by Matthews et al. (2002), which will appear as a W3C note. The proposal is being done in the context of the SWAD-Europe project. The Appendix presents another approach for representing ontology and thesaurus data in XML/RDFS.