4. Metadata and their Importance to Information Technology

Metadata contain information needed to understand and effectively use the data. This includes documentation of the data set contents, context, quality, structure and accessibility.

Metadata are receiving increasing attention from the scientific community. Ecologists, scientific societies and state and federal agencies are recognizing the importance of high quality, well-documented and securely archived data for addressing long-term and broad-scale environmental questions. Ecological and environmental data (e.g. collected at field stations, marine laboratories, national parks, and natural reserves), represent a significant institutional, regional, national and international resource. This data is essential to understanding and monitoring the health of the dynamically changing environment. Comprehensive metadata is the key to ‘unlock’ these resources, thereby allowing the broad and long-term use of this data.

4.1 Benefits and Costs Associated with Metadata

The rows and columns of numeric and textual observations contained within a data set are frequently referred to as raw data. Raw data are usually considered of value if they can be used within the scientific framework of the study that generated the data. Interpreting and using raw data to investigate a study’s underlying theoretical or conceptual model(s) requires an understanding of the types of variables measured. The measurement units, the data quality, the conditions under which the variables were measured and other relevant facts are all needed and are provided in metadata. Information is then generated from the combination of raw data and metadata.

Information content can be lost through the degradation of the raw data or the metadata. Such loss is unavoidable and has been referred to as information entropy.

Although metadata loss and degradation can occur throughout the period of data collection and analysis, the rate of loss frequently increases after project results have been published or the study has been terminated.

Benefits

At least three major benefits can be obtained from investing adequate time and money into metadata development.

i. Data entropy is delayed and, correspondingly, data set longevity is increased

As a consequence of data complexity, time and funding constraints, and information entropy, the life span of a typical ecological data set may be very short, possibly lasting only from data set conception to publication. Even data which is properly archived and maintained is often found to become useless because relevant metadata is missing or unavailable. Development and maintenance of comprehensive metadata can counteract this natural tendency of data to degrade in information content through time.

ii. Data reuse by the originator and data sharing with others is facilitated

With the rare exception of extremely simple data sets that are immediately analysed after collection, even data collectors need some form of metadata for subsequent analysis and processing. Furthermore, scientists require highly detailed instructions or documentation in order to interpret and analyse accurately unfamiliar research data and complicated experimental designs.

iii. Well-documented data may be used to expand the scale of ecological inquiry and support valid comparisons in space and time

For example, short-term investigations may evolve or be integrated into long-term studies. Metadata will then be essential for maintaining historical records of such long-term data sets. This is because inconsistencies in documenting data and changes in personnel, methods and instrumentation are likely to occur during ongoing long-term projects. Furthermore, metadata are critical for combining physical, chemical and biological data sets containing different parameters but sharing common spatial or temporal domains. Comprehensive metadata can therefore enable data sets (which were designed for a single purpose) to be used repeatedly for other objectives and over long periods.

Costs

High costs, mainly in terms of staff time, can be associated with developing and maintaining metadata. For relatively simple, short-term experiments the size and the effort employed to create the metadata may exceed the size and effort needed to create the data file. This is not unusual in disciplines such as chemistry and physics where understanding the experimental conditions is critical for reproducing data. In such cases, metadata may be scrutinized to the same extent or even more than the data itself. High costs are associated with editing and publication (in paper or electronic formats) of data and metadata. Furthermore, costs associated with developing and distributing metadata are rarely included in project budgets. Long-term stewardship and maintenance of data and metadata represent real cost burdens which are seldom calculated. Its is also difficult to anticipate the numbers of potential secondary users which may require comprehensive metadata for a particular aspect of the data set.

4.2 Metadata Content ‘Standards’ Relevant for Ecology

All ecological data have a spatial or geographic component. The spatial component of the data may range from being central to being relatively unimportant to the success of a project. Geospatial data, for example, are explicitly associated with multiple geographical locations. In such cases, both environmental attributes associated with each sampling point and the specific location of the points are of scientific interest. So far most metadata standardization efforts have focused on data with a strong geospatial component. In contrast, non-geospatial data might include data from laboratory experiments or other ecological data collected at a limited number of locations.

Geospatial metadata

Significant effort has been put into developing geospatial metadata standards during the past decade. Recently the Federal Geographic Data Committee (1994, 1998) completed the Content Standards for Digital Geospatial Metadata. The Content Standards contain more than 200 metadata fields that are categorized into seven classes of metadata descriptions. Efforts are underway to add extensions to the Content Standards, creating metadata supersets appropriate to biological, cultural, demographic, and other types of data. For more information on emerging metadata standards in Europe contact MEGRIN (http://www.megrin.org/). MEGRIN is an organization that represents and is owned by a number of European National Mapping Agencies. MEGRIN also maintains a Geographic Data Description Directory on the word wide web that has informational on digital map data of European countries.

Non-geospatial ecological metadata

Metadata standards for non-geospatial ecological data do not exist currently in any accepted format beyond those used for individual studies, projects, or organizations. Ecological studies often require large amounts of variable data related to the chemical and physical attributes of the environment, as well as information on the individual organisms, populations, communities and ecosystems which make up the biotic part of the environment. It is unlikely that a single metadata standard, no matter how comprehensive, could be developed to cover all types of ecological data. As a result, a generic set of non-geospatial metadata descriptors have been recently introduced for ecological sciences. This list of metadata descriptors was proposed as a template for more refined project-specific metadata procedures. Five classes of metadata descriptors were defined.

i. Data set descriptors: Basic information of the data set (e.g. data set title, associated scientists, abstract and keywords).

ii. Research origin descriptors: All the relevant metadata that describe the research that generated the data set (i.e. hypotheses, site characteristics, and experimental design and methods).

iii. Data set status and accessibility descriptors: The status of the data set and associated metadata, as well as information related to data set accessibility.

iv. Data structural descriptors: All attributes related to the physical structure of the data file.

v. Supplementary descriptors: All other related information that may facilitate secondary usage, publishing and auditing of the data sets.

These metadata descriptors were formulated to answer five basic questions that might arise when an ecologist attempts to identify and use a specific data set:

i. What relevant data exists?

ii. Why was the data collected and is it suitable for a particular use?

iii. How can the data be obtained?

iv. How was the data organized and structured?

v. What additional information is available that would facilitate data use and interpretation?

It can be especially difficult to identify and document all supplemental information that may be required for specific data uses. For this reason, it may be beneficial to design metadata that can also serve as a vehicle for user feedback and data anomaly reporting. A ‘data set usage history’ for example, may add value to data sets and facilitate their long-term use.

Metadata standards

Information scientists have developed several generic metadata 'standards' to facilitate cataloguing and discovery of electronic resources. Some examples are the Dublin Core, NASA’s Directory Interchange Format and the Government Information Locator Service format.

4.3 Software and Resources

Guidelines for metadata structure and supporting technology (i.e. user-friendly software for metadata generation and management) are being discussed and developed by numerous organizations. One particularly promising approach to metadata management is embodied in Web-based metadata search and data retrieval systems. Mercury, developed at the Oak Ridge National Laboratory Distributed Active Archive Centre is an example of such a Web-based system. Mercury supports searches of metadata to identify data of interest and then delivers the data to the user. To make data and metadata available, data providers make them ‘visible’ in an area on their computer and Mercury periodically harvests the metadata and automatically constructs an index and a relational database that subsequently reside at a central facility. Web-based metadata management programmes like Mercury have several benefits, including control of ‘data visibility’ by the scientist, high levels of inherent automation and computer platform independence.

Metadata generation tools

When selecting a metadata generation tool it is important to consider whether the software meets the specified objectives (especially, metadata completeness), and whether it conforms to industry-wide or discipline-specific guidelines. In some cases, it may be necessary to use more than one metadata generation tool.

For example, an institution’s spatial data may be incorporated into a GIS vendor-supplied metadata programme that conforms to Federal Geographic Data Committee (1994, 1998) standards and is well integrated into the GIS environment. Their water quality data, in contrast, may be incorporated into a specific metadata programme that meets other requirements established by a state or federal funding agency.

Some of the most important metadata attributes (e.g. natural history observations) are often recorded and maintained in unstructured formats (often in the form of paper notes). These attributes may be critical for correct data interpretation and analysis. Field notes and other unstructured metadata can either be archived in paper files or converted into digital format (e.g. scanning, transcription to text or word processing files). These types of unstructured metadata may be suitable for exchange with expert colleagues but are inadequate for electronic data set publication and sharing with the broader scientific community. Although existing or proposed metadata generation tools may fill most of a project’s metadata needs, consideration should be given to how maps, field notes and other unstructured data will be archived and managed, as well as referenced in the metadata.

Metadata structure

Increasing amounts of supplementary metadata are being added to meet the demands of secondary data users. The utility of this data can be improved by adding a structure to the metadata. A highly structured and fully searchable metadata record would include a sophisticated database management system (DBMS). Minimal metadata structure should not be confused with low content.

An increased metadata structure can be beneficial for several reasons. The structured metadata character checklist often provides a memory-aid on the information needed to facilitate subsequent data processing and interpretation. The increased structure will facilitate the development of searchable catalogues and database interfaces. This will potentially allow the data to be available to a larger number of users and processing software. High levels of structure may be a good practice or, in some cases, may be required for specific projects (e.g. those requiring periodic data audits). However, highly structured metadata may be excessive where low levels of secondary usage are anticipated. Thus, the benefits of incorporating metadata into highly structured DBMS format should be considered in relation to software, programming, development and maintenance costs.

The choice of metadata media and structure will often be dictated by the availability of metadata generation tools, available trained personnel, time and funding constraints, and projected rates of metadata usage. When specific metadata tools are inadequate or unavailable, metadata may be incorporated into word processing files (or free-flowing ASCII text), analytical programmes (e.g. Statistical Analysis System Programmes), or more structured DBMS Programmes. Satisfying high levels of demand for metadata may necessitate making the metadata DBMS-accessible via the word wide web.

During the mid-1990s, a number of organizations with moderate to large holdings of data began implementing metadata schemes. Format descriptions or more sophisticated DBMS software that increased metadata structure and often directly linked data and metadata were introduced. The primary objective of these efforts was to initiate the standardization of the data content and structure so as to facilitate search and retrieval.

Several tools are available for implementing word wide web-based metadata applications, including Hypertext Markup Language (HTML) and extensible Markup Language (XML). XML is a standardized text format that represents a subset of the Standard Generalized Markup Language (SGML; ISO standard 8879). XML currently is the most useful representation language for documenting the content and semantics of Web-based resources. It was specifically designed for transmitting structured data to web applications, and its utility is further increased by its relative ease of expansion by having a flexible structure that supports arbitrary nesting, and the potential for automated validation.

4.4 Metadata Implementation

Objectives for metadata implementation include facilitating identification and acquisition of data for a specific theme, time period and/or geographical location. It is also needed to help determine the data’s suitability for specific objectives, analysis, modeling and processing. Three major issues warrant consideration during metadata planning and implementation: desired data longevity, projected rate of use and sharing of responsibility.

All data should be accompanied by some form of metadata (even if minimal). The level of metadata provided will determine the extent and time that the data can be reused by the original investigator(s), scientists, resource managers, decision-makers and other potential users. Metadata development and maintenance can be a costly enterprise. It may therefore be worthwhile attempting to match the metadata content and structure to the needs of the anticipated users. Dedicating project resources to metadata design and implementation costs money and personnel effort and can result in fewer publications in the short term. The rewards are the production of high quality data and metadata that can be ‘mined’ for many years or even decades. The balance of short-term costs against potential long-term benefits is an issue warranting considerable thought and discussion by data collectors, data users, institutions, and funding agencies.

The key step of metadata implementation is to assess the site or project needs. The objectives for the data need to be identified (e.g. the desired longevity of the project or data or the potential reuse needs of the data). Guidelines and procedures for data sharing and data ownership need to be established. The available infrastructure (e.g. hardware, personnel, funds, etc.) will need to be assessed, and finally metadata activities need to be prioritized and categorized.

After data categories have been prioritised, it is necessary to adopt an existing metadata standard (e.g. geospatial metadata standard, FGDC 1994,1998) or identify a set of minimal or optimal metadata descriptors that meet perceived needs. It is also recommended that a pilot project using one or more relatively ‘simple’ data sets be carried out. Project successes and difficulties should be used to re-evaluate site needs and objectives. Formal metadata standards and supporting software can then be developed. While these standards are being established metadata descriptors can be used to develop metadata for the individual scientists, laboratories and projects. Small groups of scientists focused on a specific research objective, such as synthesizing data on a particular topic, may benefit significantly from efforts to implement metadata.

Metadata implementation ‘keys to success’.

Keep it simple! Start small and build upon successes. For example, the time and effort spent on a pilot project are usually paid back several-fold in the long run.
Build consensus among scientists and data managers from the start. Data management initiatives, regardless of their potential benefits, are often unsuccessful when the ‘user community’ is excluded from the process.
Data longevity is roughly proportional to metadata comprehensiveness. However, establishing a goal of complete metadata that can meet all future needs may be exorbitantly expensive and, ultimately, unattainable.
Data and metadata should ideally be platform independent. Hardware and software change frequently. Today’s ‘standard’ may be gone tomorrow. Thus, it pays to avoid administrative storage formats whenever possible.
The stewardship of secure archived and accessible ecological data for future research will depend on the way its promoted and rewarded.

Both basic and applied ecological research depends upon the availability of data including the ability to locate and use the data. If a greater effort is made to develop high quality data sets and accompanying metadata, then individual scientists and organizations can focus their valuable time on the analyses. Comprehensive metadata will also allow individual scientists and organizations to reuse data intended for other applications. Flexible metadata generation and management tools that support entry, search and retrieval are essential for facilitating metadata implementation. There is a significant need for research and development in this area.