Previous Page Table of Contents Next Page


Appendix 4: Directory-Level Metadata


This Appendix is intended to provide background material for task 2.5.

1 Introduction

It is commonly agreed that in this "information age" data itself is a resource and an abundant resource at that. As with many natural resources, it is important for a potential user to know first of the existence of data, and also to know where the data are to be found, their condition, and other information to determine whether or not they can be used for the task at hand. This knowledge about existing data and how it might be accessed and used, has become more and more important as funding for systematic data collection activities has been reduced in many jurisdictions.

Metadata are "data about data", describing such things as the location, sources, content, quality, condition of existing data. Metadatabase systems are systems specifically designed to manage metadata i.e. to provide facilities for input, update, retrieval and reporting of data about data. Such systems are used at a variety of levels, for example, within a single institution or organisation to organise and maintain their own data holdings in order to protect and maximise the investment in organising and structuring data. They are also used on a broader level to provide a mechanism through which data producers can ensure that potential users can be made aware of existing data and how it might be obtained. The systems may also vary in the types of data which are described, for example, books, reports, maps, digital files, etc. Thus there is a wide range of relevant work, from bibliographic systems to handling of digital imagery.

2 Sample Metadata Developments

2.1 NASA Global Change Master Directory

The Global Change Master Directory (GCMD) of NASA is a comprehensive directory of material relevant to global change research. It uses a Directory Interchange Format (DIF) which has been widely used for exchange of metadata. The content of the DIF is organised into a nested structure using prescribed "labels" to identify the beginning and end of information fields. The term "DIF" is also used to refer to an entry in the directory i.e. the description of a dataset is referred to as a DIF.

The standard GCMD DIF allows for some 43 fields (many of which are fairly specific to remote sensing data) which can be grouped roughly as follows:

identification: (addresses, contact names and other information about the agencies and people responsible for the dataset)

spatial reference: (geographic scope and location of the data, map projections used and the like)

distribution: (information on conditions and methods of access to the data)

metadata reference: (information on data content, resolution, scientific purpose, etc.)

As can be seen from the examples given below, these are the types of information common to many metadatabase developments though the actual fields will vary.

A limited number of DIF fields are considered mandatory, and many permit narrative descriptions or comments. Some fields have controlled vocabulary - e.g. prescribed allowable keyword lists. Again, both these features are common to other metadatabase developments.

2.2 United Nations Environment Program

Within UNEP, a metadatabase was developed as part of the GRID programme. A "dataset" is defined as a collection of data and accompanying documentation maintained at a single source, where a collection of data refers to a minimum of one or a series (no maximum) of "data members" which relate to a specific theme or geographic region in terms of physical area covered. Information held about a dataset includes:

identification: (name of dataset, and institution holding it)

contact references: (names, details of address of contact, access conditions)

geographic coverage: (general location, e.g. continent or country, and/or latitude-longitude bounding rectangle)

general description: (subject keywords and free text summary of dataset contents)

A "member" is regarded as equivalent to a data file, a paper report, a map or other unit of data, and is always a component of a particular dataset. Thus, members are the lowest-level, "concrete" data entities that could actually be requested by a potential user. Information held about a member will vary with the type of member, for instance, metadata items for a raster data file will include resolution and number of rows and columns, for a vector data file they will include geo-referencing details, and so on.

A separate section of the metadatabase is maintained containing information on "institutions", that is, the centres that hold data. This includes information on the overall scope and nature of scientific programmes and the information management capacity of the centre. This could provide a useful model for a GTOS Data Centres Directory (see Section 11).

2.3 Federal Geographic Data Committee (FGDC)

In the USA, the FGDC developed a standard for digital spatial metadata through a consultative process over a 2-year period starting in 1992 and it is now mandatory for all US Federal agencies. The standard provides a common set of terminology and definitions for the documentation of the data. It establishes the names and definitions of data elements and groups of data elements to be used for these purposes, and information about the allowable values for the data elements. The standard also defines which data elements are mandatory, mandatory under certain conditions, and optional (i.e. included at the discretion of the data provider).

The standard is quite extensive specifying the structure and expected content of some 220 data elements. These can be roughly grouped as follows:

Identification: (basic information about the data set, including the dataset name, geographic area covered, currency, and rules for obtaining or using the data)

Data Quality: (information which assists in assessing the usefulness of the information for the users purpose, including, the positional and attribute accuracy, completeness, consistency, information sources and methods used to process the data)

Spatial Data Organisation: (spatial representation methods, e.g. the method used to represent spatial positions directly (such as raster or vector) and indirectly (such as street addresses or county codes), the number of spatial objects in the data set and so on)

Spatial Reference: (description of the reference frame for, and means of encoding, co-ordinates in the data set including map projections parameters, grid co-ordinate systems and resolution, and the horizontal and vertical datum)

Entity and Attribute Information: (information about the content of the data set, including the entities types and their attributes and the allowable attribute value domains)

Distribution: (information about obtaining the data set, including contact addresses, available formats and media, online access, and fees for the data)

Metadata Reference: (information on the source of the metadata entry and its most recent up-date)

There is a systematic mapping of the fields of this to those of the GCMD DIF.

Details of information to be reported and tasks to be performed are in the Spatial Data Transfer Standard (Federal Information Processing Standard 173).

2.4 The Australia New Zealand Land Information Council (ANZLIC)

ANZLIC has developed a definition of the appropriate elements for a national land and geographic data directory system through a consultative process which began in 1995. The approach is deliberately less ambitious than that of the FGDC described above but is, as far as possible, consistent with the FGDC guidelines.

The core elements of the definition are grouped into 9 categories:

Dataset: (title, custodian, jurisdiction)

Description: (abstract, search word(s), geographic extent with name(s), or geographic extent defined by polygon(s))

Data Currency: (beginning date, ending date)

Dataset Status: (progress, maintenance, update frequency)

Access: (stored data format, available format type(s), access constraints)

Data Quality: (lineage, positional accuracy, attribute accuracy, logical consistency, completeness)

Contact Information: (name, title, organisation, address, contact numbers)

Metadata Date: (date metadata was prepared or updated)

Additional Metadata: (optional information)

Note that the JDIMP metadata pilot project is being led through the Australian Oceanographic Data Centre and uses the above categories as a starting point.

3. Issues

3.1 Level of agreement on metadata content

The above has given some samples of activities in construction of metadatabases. It is an area in which there has been increasing interest in the recent past and there are a number of data standards emerging, but no single one is adopted generally. As shown in the preceding paragraphs, there are several well-documented examples which have common elements.

There appears to be general consensus that the way forward is to have some form of metadata standards harmonisation which will at least facilitate movement between metadatabases. This is what GTOS will have to put in place to facilitate information exchange and access across Data Centres. The important principle is to include the information needed to determine if the dataset is potentially useful for the user’s purposes.

3.2 Standard Terminology and Keywords

The use of clearly stated and well-defined terminology in constructing a metadatabase is essential. Open-ended text searching to determine dataset content is a very hit-and-miss method. For environmental data, existing keyword lists serve reasonably well for the broad category levels but frequently present problems as the need to add detail arises. GTOS may need to extend or modify existing vocabularies established or underdevelopment, e.g. by the EEA, CEISIN or UNEP. Any such lists should also be open-ended i.e. allow for additions and evolution, but existing metadata must be kept consistent with any changes made.

3.3 Geographic Location

An important item of metadata describes the geographic area covered. The use of geographic names for this presents problems in the standardisation of the names of countries and regions, and also in the way in which the metadata has to be handled. For example, an metadata entry which describes an entry as relating to "East Africa" could be of interest to someone searching for information about "Africa", or for information about "Kenya". Consideration should be given to using an existing standard for country names and abbreviations such as from ISO or the numeric codes used by FAO. This can allow for consistency in searching.

Another commonly used method is the specification of a "bounding rectangle" or point location using, for example, latitude and longitude.

3.4 Metadata Output and Exchange

The Directory Interchange Format mentioned above defines a formal syntax that helps to ensure that the metadata is as complete and unambiguous as possible and this syntax is generally being adopted for exchange of metadata at all levels. The output/exchange files are in ASCII format consisting, for each dataset, of each attribute or field name followed by the value, with a colon separating the two. Although it is unattractive as printed output, it is possible to read it electronically without ambiguity. This format can be considered a de facto exchange standard, and can be employed to easily import existing metadatabases from participating DCs.

3.5 Availability of Tools

A variety of software packages have been developed and used for the management of metadatabases. Currently in countries where there is readily available access to the Internet, there is substantial development activity in providing metadatabase access tools using those protocols. These developments tend to concentrate on facilitation of metadata entry and development of search interfaces (including implementation of spatial queries) to existing metadatabases.

3.6 Level of Effort

The construction and maintenance of a metadatabase involves significant effort often seen as unglamorous extra work beyond the initial data assembly. It is important that the burden of supplying metadata not be too onerous.

Experience has shown that the greater the number of mandatory fields to be entered, the greater the likelihood is that no entry will be made at all. The rather simplified ANZLIC structure reflects this thinking. A balance must be established between the need to have sufficient information for a user to identify potentially valuable datasets, and the need for ease in entering and maintaining the metadata. In this regard it is important to make the distinction between the "directory level" metadata - referred to here, which must be kept simple, and the "dataset level" metadata (sometimes called co-data) which must be available with the dataset. This dataset level metadata must contain all the information needed to correctly use the dataset - calibrations, off-sets, origins, assumptions, classification systems, terminology sets, taxonomies, geo-referencing schemes, instrument parameters, legends and coding schemes, etc. etc., as appropriate to the dataset.


Previous Page Top of Page Next Page