DATA AND INFORMATION SYSTEMS AND SERVICES

Data assembly issues
Guidelines for data and information systems and services
In situ data for satellite-derived products
Priority needs and recommendations

Data assembly issues

Overview

TO MEET ITS OBJECTIVES, TCO REQUIRES A VARIETY OF DATA FROM IN SITU AND SATELLITE PLATFORMS. THIS WILL INCLUDE NEW DATA SETS OBTAINED IN A GLOBALLY COORDINATED MANNER AND NEW DATA ASSEMBLED FROM MEASUREMENTS ORIGINALLY MADE BY DIFFERENT GROUPS IN VARIOUS REGIONS. In the latter case, the task of producing quality products is complicated by differences in the data acquisition procedures. Thus, a major challenge for TCO will be to process and harmonize these diverse data sets into consistent products. This section addresses issues related to data management and methodologies associated with data harmonization.

In general, the data harmonization issues include:

identifying data needs;
evaluating data availability and gaps;
registering data;
processing generic data;
establishing a database.

Issues

Identifying data needs

Point and gridded data sets are of interest to TCO (see Table 1). The parameters selected will depend on the stage of TCO development and the specific topics being addressed. Likely parameters to be used will include carbon stocks and changes in above and below components (or a finer breakdown of litterfall, branch, bole, coarse roots, and fine roots). This activity would also include identifying units of measure, conversion factors, grid cell size selection, and reference year. In the following discussion, the focus is placed on these types of measurements.

Evaluating data availability and gaps

To help identify gaps in the coverage between needed and available data, sites can be plotted geographically as well in terms of environmental conditions (biome, vegetation, climate, soils, land use, others). The sites and their data can then be arranged within a multidimensional set of environmental parameter axes (see Terrestrial measurements, page 18). Similarly, global gridded data sets of biome/land cover, climate, and soils can be used to define the global distribution of the environmental space. Comparing the two sets of axes helps identify gaps in the available coverage of in situ data. The analysis can then be used to search for additional data to fill the gaps.

Registering data

The identified data resources need to be registered in a database for convenience of access. Mercury, developed by the Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC; http://mercury.ornl.gov/), is a candidate system that could be used for this purpose. Mercury allows searching by spatial or temporal coverage, observation methods, and environmental characteristics associated with the sites. Other data handling approaches are described in a previous TCO report (FAO, 2002b).

For scientists and organizations to contribute data to programmes such as TCO, they need to understand and appreciate the value of these programmes. In addition, they may need incentives to provide sufficiently detailed documentation so that others may use the data appropriately. In general, the incentives should result in data contributors getting recognition similar to that received from published papers (Olson et al., 1999; Ecological Society of America: http://esa.sdsc.edu/esapubs/Journals_main.htm/).

Regarding documentation, definite metadata standards are lacking but a general guideline is the "20-year rule", which is that data set documentation is adequate if, 20 years from now, a scientist not familiar with the original study could effectively use the data in an appropriate application. Fortunately, there is a gradual convergence in metadata formats, e.g. Earth Observing System (EOS) DAACs and ISLSCP.

Previous experience from the design, management, and processing of ecological data (Michener and Brundt, 2000) is relevant to many of the TCO issues. Data processing can be improved by providing guidelines to scientists for creating data sets such as the practices for new data sets' (Cook et al., 2001; www.daac.ornl.gov/DAAC/PI/bestprac.html/). The guidelines provide suggestions and examples that emphasise:

descriptive file names;
consistent and stable file formats;
definitions of parameters;
consistent data organization;
basic quality assurance;
descriptive data set titles;
documentation.

Generic data processing

Generic issues include scaling (both up and down), gridding of polygon data, gap filling, statistical design, uncertainty estimates, and propagation of errors. These issues have been addressed in specific situations (e.g. Baldocchi et al., 2001; Falge et al., 2001; Gu et al., 2000). However, for TCO to obtain an understanding of the compounding of these uncertainties in the global estimate of carbon, an expert panel may be needed to provide recommendations as to specific methods and an overall strategy.

Establishing a database

As a supplement to standard quality assurance/quality control (QA/QC) procedures applied to individual data sets, additional checks can be applied to the larger collections of integrated data sets. After the data has been assembled from diverse sources, the patterns within the collection can be reviewed to identify potential inconsistencies and used to further check individual values. For example, frequency distributions can be examined for the larger collections. Often these procedures can identify errors (wrong units of measure, transcription errors) or identify special conditions (results form fertilizer or irrigation trials).

Activities typically associated with accessing data sets (for data product assembly originating from several sources) include:

review metadata;
read data using format specified in the metadata, verify that it is correct;
verify correct data transfer;
review QA/QC results based on the documentation in the metadata;
request or assemble documentation;
compile/update site data (lat/long, etc.);
review data (x-y plots, range checks, etc.);
convert to standard units of measure;
harmonize, e.g. express LAI as half of two-sided leaf area;
perform quality assessment/outlier analysis, such as: range checks, x-y plots; comparisons to model ensemble (e.g. comparing NPP to a simple model such as the Miami model);
reformat to ASCII, tabular form;
perform gap filling - multiple methods;
provide data set to the contributor for review.

Estimates of pools and fluxes of carbon will often be derived from traditional measurements of commercial forest products and crop and pasture yields. There is a considerable literature available for estimating tree biomass from tree diameters using site-based allometric equations or estimating plant biomass from grain harvest. However, a critical task is assembling and reviewing these transformation formulas so that the conversions are done appropriately and accurately (see also chapter 3).

Recommendations

There is a need to compile specific, harmonized in situ data sets from disparate sources. In many cases, this will require specific meetings and projects to identify the most effective approaches. The following list of priority topics and potential meetings are recommended:

define axes of an analytical framework for an analysis of the 'observation space';
design of a validation framework (through a workshop);
support regional campaigns and projects aiming to integrate several components of the carbon budget (through a symposium);
prepare field manual(s) suitable for TCO and similar applications. Possible subject areas include micrometeorological instrumentation, soil and plant chemistry, biomass and NPP, LAI, eddy flux, and trace gas measurements;
scale, uncertainty, preparation of gridded products from inventories, and geostatistical methods in relation to TCO needs (using a team of experts);
convert forest and crop inventories and yield data into carbon estimates, including approaches to using data from different countries derived using various measurement protocols (workshop);
develop and use of common definitions;
focus on lessons from existing projects on how to merge data, how to build "pools of information" based on in situ data and scale to regional, global levels (could be a series of workshops);
update global estimates of carbon pools and fluxes from inventories and other in situ data including evaluation of uncertainty, error analysis, etc. (through regional workshops).

In addressing the above topics, TCO should continue to coordinate and collaborate with related carbon cycle projects or programmes. This will allow many of the common data and processing issues to be addressed in a concerted way. International organizations (CEOS, FAO, GTOS, IGBP, UNEP) should be approached to organize and support meetings as appropriate.

Accomplishing the overall TCO goal of data synthesis for globally consistent data products is an ambitious task. A useful step will be conducting one or more prototype exercises to develop the TCO approach and demonstrate its effectiveness. The prototype results will help to refine the requests for additional data and to have the logistics in place to able to process the larger data collections.

Data fusion techniques would be used to merge data and to build "pools of in situ information" for scaling to regional and global levels. Several regional studies are already developing complete carbon budgets for large regions including the Amazon, Australia, Canada, Siberia, and western Europe.

Biomass and primary productivity are among the topics appropriate for prototype efforts. Estimates of global biomass and productivity have been previously compiled based on in situ measurements, beginning with Lieth (1975). Updates of his estimates by Ajtay et al. (1979) and Olson et al. (1983) were made by assigning average values to biome types and multiplying these by the biome area.

TCO can rapidly improve the current estimates of global biomass and soil carbon by using more sophisticated ways of aggregating and analyzing the existing data. These products could also incorporate uncertainty measures (levels of accuracy) to provide better estimates for global values. This would result in an early baseline to compare with the TCO products to be produced in 2005.

In addition to a prototype synthesis exercise, TCO can build on existing validation activities, including:

LAI - NASA-EOS/VALERI;
NEP/NPP - flux towers;
GT-NET NPP demonstration project (~25 sites).

Researchers synthesizing a thematic data set (such as LAI or NPP) from the literature or using data from various sources will face the challenge of using measurements often collected with various sampling methods, changing instruments, expressed in differing units of measure, and assigned different names.

The goal of harmonization is to make sure, as far as possible, that the analysis and interpretation will be unbiased in terms of comparability, completeness, and representativeness of the data. The underlying assumption is that the patterns and signals within the data will be stronger than differences in methods or lack of a priori in the sampling design.

Often the differences between methods appropriate for different plant forms must be considered instead of specifying a standard method for all data. Harmonization often requires a good judgement to be made when documentation is inadequate. However, as larger sets of data are compiled with associated environmental driver data, there are a variety of approaches that can be used to identify potential outliers. Researchers associated with the Long-Term Ecological Research (LTER) network have developed standard methods for soil sampling, decomposition studies and are in the process of documenting standard methods for measuring NPP.

To improve the compilation of consistent new global data sets, TCO should either develop a field manual or provide suitable references (preferably in electronic format through the web page) that are available for all to review and use. The manual should describe the most desirable (and alternative) approaches for experimental design, sampling frequency, and for field and laboratory measurement methods.

While uniformity of in situ data acquisition methods may not be an achievable goal (at least in the near term), a readily accessible and promoted field manual would lead to significant progress. Alternatively, TCO could require all participating in situ groups to submit a brief description of procedures employed to a coordination panel for review/suggestions. As a guideline, TCO should be pro-active and identify the most desirable methods, but also be realistic and allow for alternatives.

Guidelines for data and information systems and services

IGOS Partners defined a set of Data and Information Systems and Services (DISS) principles to be followed in the implementation of the individual themes (Appendix 4). Ultimately, national governments and agencies are the main executors or sponsors of data collection initiatives and the associated tasks of establishing, maintaining, validating, describing and archiving data. However, from the international perspective of a globally consistent, systematic observation programme, the key issues among the IGOS-P DISS guidelines are:

commitment to data management systems and services at national and international level;
longevity of data - commitment to long-term provision;
accessibility - ideally full and open;
archiving - original data must be preserved, archived and easily accessible;
documentation - data should be accompanied by documentation to a set standard and with sufficient metadata to allow new users to understand and use data effectively.

Commitment to data management systems and services

International programmes are usually able to sign up to IGOS- type principles but are not in a position to actually implement these for national data sources. Space agencies and international bodies are generally receptive to open data access for TCO-type initiatives. However, at the national government level this is more difficult as there may be conflicting interests between international and national needs.

A national response depends on governments (ministry-level) of individual countries, which could have requirements opposite to international needs and policy. IGOS-P is an unlikely facilitator of the commitment process since its members do not include important holders of in situ data. Such data is typically controlled by national agricultural, forestry, and natural resource departments or other contributing agencies.

Success is more likely to be achieved through the definition of procedures that are consistent with policies at a ministry level (i.e. data holder) rather than a government level. However, given the multitude of national agencies with relevant data this approach will not be generally applicable. The chances of success will increase if key individuals are identified who can respond to requests for data. An alternative/complementary approach (to government or ministry - level contacts) may therefore be working with national scientific teams to deal with issues of access but also parameter definitions, documentation, etc.

The recommended approach is also to focus on a manifesto emphasizing the current state and the urgent need to address the degradation of networks in terms of data quality and quantity and the willingness/ability to continue the current observations. The following steps could be taken:

1. TCO and its data requirements should be brought to the attention of important decisions makers. Letters should be sent with explanation of the initiative, its relation to the Conventions, and the related responsibilities under the Conventions. These letters should include a brief but clear summary of TCO objectives and activities. There is also an urgent need to raise the profile of GTOS/TCO so that individuals will respond positively to TCO initiatives. The letters would be sent to:

national governments (national points of contact for UNCCC);

international organizations;

key influential national agencies/institutions related to IGOS-P and TCO.

2. IGOS-P and its members need to work towards increasing TCO visibility at key international forums, particularly putting an item on the agenda for Rio+10.
3. In the longer term there is a need to ensure high visibility (beyond the science community), e.g. develop an annual communication plan.

Longevity of data - commitment to long-term provision

The underlying requirement behind TCO is to expand the current knowledge of the carbon cycle. In obtaining a commitment to long-term data provision, there is a need to clearly specify the exact nature of the commitment. For example, TCO requires more than just the provision of highly summarized data which is already available under national reporting. Furthermore, the target audience for the various TCO products should be clearly identified at the national level.

To achieve a long-term commitment will depend on making the request at a country-level appealing both to scientists and political and social audiences. This means highlighting:

additional value of many data products required for carbon, e.g. NPP/NEP key data for a country's resource management;
contribution of TCO to national level UNFCCC reporting;
annual variability of carbon sources and sinks - hence the need to establish a long-term system;
value of a harmonized system which would allow detailed inter-country comparison.

Accessibility - fully open sharing and exchange of data for all users in a timely fashion

Complete access to data is a highly desirable goal, and one that can be realized for many data types and sources. However, experience suggests that it is unlikely to be achieved for national inventory-type data that may be considered to have economic or otherwise strategic value. In this case, access to derived data products is more likely. To be of value, the derived products must have clear documentation and multiple levels of data access (processed data sets, original data). Arrangements are needed to ensure that incentives for data exchange are put in place (also refer to Chapter 5). The involvement of experts from the inventory agencies is a key pre-requisite in this process. In general, issues of data availability and security and the mechanisms for data conservation will require attention.

Archiving - original data must be preserved, archived and easily accessible

Derived products will be more easily obtained than national inventories (see above).

To be of value to TCO the data should be above a certain minimum quality standard. In practice, a single standard is unlikely to be feasible because of (among other reasons):

subjectivity in definitions of 'quality';
data values below a fixed standard depend on the availability of other sources and therefore has both a geographical and a temporal aspect;
in terms of accuracy, models used within TCO may have different input data requirements;
impact on the output product error is not the same for all input data types.

One approach to dealing with these factors would be to select desirable and minimum levels of standard for TCO, using existing mechanisms or consensus practices wherever possible. The involvement of modellers and experts on the data products is essential.

TCO's initial target requirement is an error budget lower than 30 percent with higher accuracy at the regional level. To define accuracy requirements for input data products the error budget should be analysed at different levels of modelling using sensitivity tests and then worked back to define input data quality requirements. Since this has not yet been done comprehensively, the priority should be to provide high quality documentation for input data including, where possible: algorithm description, processing methods and assumptions, and accuracy definition and evaluation. This description should follow a common format (Data assembly issues, page 37).

The maintenance of, and access to, data both require data documentation (metadata) and an adequate search method. Two options are available for the location of data products:

Central archive: likely to have both funding and political problems given the international scope and volumes of data involved;
GTOS-TCO structure: in cooperation with IGBP core projects and comprising of a dedicated office to coordinate the data system. The office will be responsible for the generation and management of metadata, interacting with data holders and providers and handling user feedback (see below).

Other data management items

The following IGOS-P DISS items (Appendix 4) were also briefly discussed.

Archive maintenance

All data should be archived and maintained to allow products to be recalculated. However, it is recognized that this creates a significant overhead and data should therefore have a reasonable expectation of expiry. Nevertheless, the underlying principle of data preservation and the '20-year rule' (National Research Council, 1991) should be respected.

Accessibility

Recommendations on data structure and computer directory nomenclature are required, particularly if access is to be made from web map servers.

Internationally-agreed standards

These need to be adopted or developed as necessary.

Monitoring of information access and system performance

The DISS office should have this responsibility but will need funding.

Internal consistency

This is an aspect of metadatabase/template construction. The responsibility for such a task should lie with the data provider, although TCO should adopt the needed guidelines. In addition, it is important to make clear that reanalysis products are provided to the programme free of charge; reanalysis should be considered as part of the product provision.

Feedback

Feedback is important in assessing that data is of scientific and policy use. This should be a two way process between TCO and the data providers. This will also have the benefit of showing the data providers the benefits of contributing their data.

The mechanisms should include:

simple acknowledgement of use (all data users);
statements back to contributing bodies especially individual countries describing what the data were used for and the impact, on progress (value), of the data provision (citations generated, joint work, information on scientific papers, national reports, policy implications, etc.);
circulation of newsletters containing the results and statistics demonstrating the success of the programme.

In situ data for satellite-derived products

For TCO, in situ data play an important role in the preparation of products derived from satellite measurements, in three ways:

to validate biophysical parameter products (e.g. biomass, LAI);
to be used as an input in advanced products generation (e.g. land cover maps);
to support instrument calibration as a precursor to the above two uses.

The types of in situ data and the locations where these should be obtained will differ for different applications. Precise instrument calibration is an indispensable first step; otherwise, there can be no confidence in the quality of subsequent satellite-derived products. Correct characterization of sensors and spectral radiance are also needed when data from different sensors are to be effectively combined (e.g. in a time series). Responsibility for instrument calibration lies with the space agencies and considerable improvements in this domain have been made over the last few years.

Calibration is achieved by using in situ measurements at well characterized and (usually) homogeneous test sites. Supporting data from aircraft flights may also be used. If the in situ and aircraft measurements can be related to the standards used to establish the pre-launch instrument calibration, then sensor performance can be very well characterized. However, to maintain confidence in the data such in situ measurements and underpinning flights need to repeated throughout the lifetime of the spaceborne mission.

The 20th Conférence Générale des Poids et Mesures (CGPM, the international body responsible for the international system of units, SI) concluded that "those responsible for studies of Earth resources, the environment, human well-being and related issues need to ensure that measurements made within their programmes are in terms of well-characterized SI units so that they are reliable in the long-term, are comparable world-wide and are linked to other areas of science and technology through the world's measurement system established and maintained under the Convention du Metrè." There is of course considerable cost to the space agencies in maintaining their instrument calibration efforts to these demanding standards.

If properly calibrated data are available then, algorithm theoretical basis documents (ATBD) describing the relationship between biophysical products and measurements of spectral radiance can include details on the errors or uncertainty budgets. In situ biophysical measurements are used to develop the inversion algorithms yielding biophysical parameter fields from satellite data, and they may subsequently be employed to confirm the expected uncertainty budgets described in the ATBDs.

The performance of algorithms used to create the biophysical parameters should be periodically validated to increase user confidence. More importantly, the in situ data are needed as a check on the impact of imperfect radiometric corrections, particularly for passive optical satellite measurements. To support these uses, the in situ measurement protocols need to be standardized and rigorous intercomparisons of the algorithms need to be performed.

While the institutional responsibility for the instrument calibration work rests with the space agencies, the responsibility for the validation of the biophysical products is less well defined (and presently evolving). A critical factor has been the acceptance by space agencies of the responsibility for products derived from satellite data, not just the raw data themselves.

This acceptance was demonstrated by the establishment and support of CEOS Working Group on Calibration and Validation (WGCV, www.wgcvceos.org/), and more recently its subgroup on Land Products Validation (LPV, http://modarch. gsfc.nasa.gov/MODIS/LAND/VAL/CEOS_WGCV/). LPV has developed an action plan focusing initially on LAI and fire products, but also includes the development of protocols for land cover and other products. With these mechanisms in place, effective product validation procedures may be developed through international collaboration. Importantly, these mechanisms need to include cross-validation among products generated by various space agencies for comparable biophysical variables.

To implement these procedures, cooperation at the national level will be critical, involving sites in various countries and data sharing with space agencies. TCO and similar programmes can assist this process by providing the context and justification for the data collection and sharing efforts.

For products advanced products, notably land cover, the validation process depends less on rigorous instrument calibration (although it will also benefit) but is far more demanding in terms of in situ observations. For the validation of each land cover map produced, data needs to be collected at the same time at a number of sites using specific statistical criteria. Such sites are not permanent, but are intended to yield information from a period concurrent with the satellite observations used to create the land cover maps.

Depending on the application, the collected data may be used both in the generation of the product and its validation. The collection of in situ observations inevitably involves institutional arrangements among projects, agencies, or countries. Since the observations need to be concurrent with the dates of the data used to create the map and because the sites are selected according to various statistical rules there is little opportunity to re-use sites, they are almost always product- specific. This means that the costs associated with a land cover map validation must be taken into account at the start of such a mapping exercise. There are strong advantages in data collection and sharing at the heart of collaborative ventures such as the CEOS LPV.

Priority needs and recommendations

TCO should strongly support the calibration of satellite data and urge agencies to continue to improve the calibration and accuracy of their instruments;
TCO should endorse and promote initiatives aimed at testing and validating satellite-derived products, such as those carried out under CEOS LPV.