3. Data Quality Assurance

It cannot be overstated how important data quality is to ecological research data. Data contamination occurs when a process or phenomenon, other than the one of interest, affects a variable value. Prevention through quality control is the first step in eliminating data contamination and is by far more preferable than ‘cure’. Prevention is primarily a data management issue, not a statistical one. Many of the quality problems encountered are due to construction and data management.

3.1 Preventing Data Contamination

Sources of data contamination due to data entry errors can be eliminated or greatly reduced by using quality control techniques. One very effective strategy is to have the data independently keyed in by two technicians and then computer-verified for agreement. This practice is commonplace in professional data entry services and in some service industries.

3.2 Illegal Data Filters

Illegal data are variable values or combinations of values that are literally impossible given the actual phenomenon observed. A simple and widely used technique for detecting this type of contamination is an illegal data filter. This is a computer programme that simply checks a data set with a ‘laundry list’ of variable value constraints and then creates an output with the identity and details of each violation. The filter programme can be updated and/or enhanced to detect new types of illegal data that may not have been anticipated earlier in the study.

3.3 Outlier Detection

An outlier is a unusually extreme value for a variable, given the statistical model in use. What is meant by ‘unusually extreme’ is a matter of opinion, but the operative word here is ‘unusual’. Infact, some extreme values are to be expected in any data set.

Outlier detection is part of the process of checking the assumptions of the statistical models (a process that should be intergral to any formal data analysis).

Elimination of outliers should not be a goal of data quality assurance. Many ecological phenomena naturally produce extreme values, and to eliminate these values simply because they are extreme is equivalent to pretending the phenomenon is ‘well-behaved’ when it is not.

3.4 Checking Test Assumptions with Normal Probability Plots

The normal distribution patterns of uncontaminated data sets of a given size need to be known before outliers can be detected in contaminated data. This is usually achieved by assuming that uncontaminated measurements follow a given probability distribution, usually the normal (or Gaussian) distribution. Most outlier tests also assume that the measurements of interest (the ‘errors’ in a regression or ANOVA model) follow a normal distribution. An old means of checking this normal distribution, that is gaining increased popularity in the computer age, is the normal probability plot.

A bell-shaped curve is obtained in an idealized histogram of normal distribution data. If at each potential variable value (Y) the cumulative area under the bell curve to its left is calculated, and then these are plotted, we obtain a sigmoid curve. These (Y) values are in fact the values provided in normal probability distribution tables. A normal probability plot essentially re-spaces the vertical axis so that points following this particular sigmoid curve, when plotted against the re-spaced axis, fall on a straight line. If sample points deviate substantially from a line when plotted in this way, they from a non-normal distribution.

3.5 A Formal Outlier Test: Grubbs’ Test

One of the oldest and most widely used procedures for detecting contamination in a sample is the Grubbs’ test. This test assumes that once contamination is removed data will follow a normal distribution. The test is very sensitive to this assumption and should therefore not be used if ‘cleaned’ data is known not to have a normal distribution.

3.6 Diagnostic Measures for Leverage Points and Outliers

Leverage points and outliers are influential data in multiple linear regression. In simple linear regressions, formal diagnostic measures are almost unnecessary, since leverage points and outliers can usually be detected by eye in a plot of the response variable versus the regressor. In multiple linear regression this is no longer true. In this case, the diagnostic checks using leverage values and studentized residuals can help a data analyst find influential observations that are well hidden in scatter plots and other simple analysis tools.

Prevention and detection of contamination in samples and in regression.

The Grubbs’ test can be adapted to meet the requirements of repeated small samples, as would often be the case in water quality studies. This can be achieved by using a pooled variance estimator over several samples. Very large data sets are becoming increasingly commonplace, and will require new or the creative adaptation of existing quality assurance methods.