^{1} With contributions from J. L. Teixeira, Instituto Superior de Agronomia, Lisbon, Portugal.
COMPLETING A DATA SET
Quite often data sets containing a weather variable Y_{i} observed at a given station are incomplete due to short interruptions in observations. Interruptions can be due to a large number of causes, the most frequent being the breakage or malfunction of instruments during a certain time period. When data are missing, it may be appropriate to complete these data sets from observations X_{i} from another nearby and reliable station. However, to use portions of data set X_{i} to replace data set Y_{i}, both data sets X_{i} and Y_{i} must be homogeneous. In other words, they need to represent the same conditions. The procedure for completing data sets is applied after the test for homogeneity and any needed correction for nonhomogeneity has been performed. The substitution procedure proposed herein consists of using an appropriate regression analysis.
The procedure for substituting nearby data into an incomplete data set can be summarized as follows:
1. Select a nearby weather station for which the data set length covers all periods for which data are missing.
2. Characterize the data sets from the nearby station, X_{i}, and of the station having missing data, Y_{i}, by computing the mean _{} and the standard deviation s_{x} for the data set X_{i}:
_{} (4-1)_{} (4-2)
and the mean _{} and standard deviation s_{y} for data set Y_{i}:
_{} (4-3)_{} (4-4)
for the periods when the data in both data sets are present, where x_{i} and y_{i} are individual observations from data sets X_{i} and Y_{i}, and n is the number of observations in each set.
3. Perform a regression of y on x for the periods when the data in both data sets are present:
_{} (4-5)
with
_{} (4-6)_{} (4-7)
where a and b are empirical regression constants, and cov_{xy} is the covariance between X_{i} and Y_{i}. Plot all points x_{i} and y_{i} and the regression line for the range of observed values. If deviations from the regression line increase as y increases then substitution is not recommended because this indicates that the two sites have a different behaviour relative to the particular weather variable, and they may not be homogeneous. Another nearby station should be selected.
4. Compute the correlation coefficient r:
_{} (4-8)
Both a high r^{2} (r^{2} ³ 0.7) and a value for b that is within the range (0.7 £ b £ 1.3) indicate good conditions and perhaps sufficient homogeneity for replacing missing data in the incomplete data series. These parameters r^{2} and b can be used as criteria for selecting the best nearby station.
5. Compute the data for the missing periods k = n+1, n+2..., m using the regression equation caracterized by the parameters a and b (equations 4-6 and 4-7), thus
_{} (4-9)
6. The complete data set with dimension m will now be
Y_{j} = y_{i} (j = i = 1,...,n) (4-10)_{} (j = k = n + 1, n + 2,...,m)
Note that estimates _{} taken from the regression equations are useful for predicting evapotranspiration. However, they cannot be treated as random variables^{ (2)}.
^{2} To create random values,, one can add to _{} (equation 4-9) the residuals e _{k} synthetically generated from a population N (0, s_{y, x}). The residuals are created using tables of random numbers. In that case the estimates Y_{j} can be treated as random variables.
ANALYSIS OF THE HOMOGENEITY OF DATA SERIES
Weather data collected at a given weather station during a period of several years may be not homogeneous, i.e., the data set representing a particular weather variable may present a sudden change in its mean and variance in relation to the original values. This phenomenon may occur due to several causes, some of which are related to changes in instrumentation and observation practices, and others which relate to modification of the environmental conditions of the site, such as rapid urbanization or, on the contrary, perhaps development of irrigation in the area.
Changes relative to data collection may be caused by:
· change in type of sensor or instrument;· change in the observer and or change in the timing of observations;
· "sleeping" data collector;
· deterioration of sensors, such as with some types of pyranometers and RH sensors, or malfunctionning of mechanical parts, such as with a tipping bucket rain gauge, or by an intermittently broken or snorted wire;
· aging of bearings on anemometers;
· use of incorrect calibration coefficients;
· variation in power supply or electronic behaviour of instruments;
· growth of trees or planting of tall crops or construction of buildings or fences near a raingauge, anemometer, or evaporation pan;
· change in the location of the weather station, or in the types of shelters for housing temperature and humidity sensors;
· change in the watering, type or maintenance of vegetation in the vicinity of the weather station;
· significant change in the watering or type of vegetation of the region surrounding the weather station.
These changes cause observations made prior to the change to belong to a statistically different population than data collected after the change. It is therefore necessary to apply appropriate techniques to evaluate whether a given data set can be considered to be homogeneous and, if not, to introduce the appropriate corrections. To do so requires the identification of which sub-data series is to be corrected. To do this requires local information.
Procedures indicated herein are simple but are well proven in practice. They rely upon the statistical comparison of two data sets, one considered homogeneous and constituted by the observations X_{i}, the other being the one under analysis and constituted by the observations Y_{i} of the same weather variable (T_{max}, T_{min}, u_{2}, RH_{max},..., etc). Both sets X_{i} and Y_{i} should be collected at two stations that are in the same climatic region, i.e., X_{i} and Y_{i} should present the same trends in time despite the space variability when short time scales (daily, weekly or decadaily) are utilized.
The reference observations X_{i} are selected from a weather station for which the data set can be considered to be homogeneous.^{ (3)} The X_{i} data set should have the same time length of observations as the set of observations Y_{i}.
^{3} When, for a given climatic region, there is no information concerning the homogeneity of data, then the average of observations of the same variable from all stations (excluding the one in the analysis), _{}, can be used to constitute the homogeneous data set.
Method of Cumulative Residuals
When relating two weather data sets from two weather stations, where the first is considered to be homogeneous, the data set of the second station can be considered to be homogeneous if the cumulative residuals of the second data set from a regression line based on the first data set are not biased. The bias hypothesis can be tested for a given probability p. This is done by verifying whether the residuals can be contained within an elipsis that has axis a and axis b. The magnitudes of a and b depend on the size of the data set, on the standard deviation of the sample being tested and on the probability p used to test the hypothesis^{ (4)}.
^{4} This test utilizes results from residuals from the linear regression of Y on X. The residuals should follow a normal distribution with mean zero and standard deviation s_{y, x}, i.e. the error e _{i} Î N (0, s_{y, x}). The residuals from the regression should be considered to be independent random variables (i.e., they should exhibit homoscedaticity).
The procedure for analysing the homogeneity of a weather data set Y_{i} collected in a given weather station environment can be summarized as follows:
1. Select a reference weather station inside the same climatic region that is known to have an homogeneous data set X_{i} of the same weather variable. As an alternative, construct a "regional" homogeneous data set by averaging the observations at several weather stations in the same region.
2. Organize both data sets x_{i} and y_{i} in chronological order i = 1, 2,..., n, where the starting time and time increment are identical for both data sets.
3. For both data sets, compute the mean and standard deviation (equations 1 to 4) for the homogeneous variable (x_{i}) and for the variable to be tested (y_{i}).
FIGURE 4.1. Regression between two sets of weather data, with the X data set being homogeneous. The example shows that the homoscedescity condition was satisfied.
4. Calculate the regression line between the two variables yi and xi and the associated correlation coefficient (equations 4-5 to 4-8). The regression equation among the full sets is expressed as
_{} (4-11)
where the subscript f refers to the full set. Whenever possible, plot x_{i}, y_{i} and the regression line to visually verify whether the homoscedaticity hypothesis^{ (5)} can be accepted (see Figure 4.1)^{6}
^{5} The homoscedaticity hyphotesis is accepted when the residuals e _{i} of the dependent variable to the regression line (equation 4-5) can be considered to be independent random variables. This can be visually assessed when the deviations of y_{i} to die regression estimates _{} are within the same range for all x_{i}, i.e., when these deviations are not increasing (or decreasing) with increasing values of x_{i}.^{6} Data in this example were provided by J. L. Teixeira (personal communication, 1995).
5. Compute the residuals of the observed y_{i} values to the regression line (equation 4-5), the standard deviation s_{y, x} of the residuals and the corresponding cumulative residual E_{i}:
_{} (4-12)_{} (4-13)
_{} (4-14)
6. Select a probability p for accepting the hypothesis of homogeneity. The value p = 80% is commonly utilized. Then compute the elipsis equation having axes
a = n/2 (4-15)_{} (4-16)
where:
n size of the sample under analysis
z_{p} standard normal variate for the probability p (usually p = 80% for non excedancy): Table 4.1
s_{y, x} standard deviation of the residuals of y (equation 4-13)
The parametric equation of the elipsis is then
X = a cos (q) (4-17)
Y = b sin (q)
with q [rad] varying from 0 to 2 p.
TABLE 4.1. Value of the standard normal variate z_{p} for selected probabilities P of non-excedance
p (%) |
z_{p} |
p (%) |
z_{p} |
50 |
0.00 |
80 |
0.84 |
60 |
0.25 |
85 |
1.04 |
70 |
0.52 |
90 |
1.28 |
75 |
0.67 |
95 |
1.64 |
Note: given the symmetry of the normal distribution, the values for p < 50% correspond to (100 - p) but with the opposite sign. Ex: p = 20% is associated with z = -z_{80} = -0.84
It can therefore be concluded, at the level of probability p, that there is no bias in the distribution of residuals, i.e., the data set y_{i} is considered to be homogeneous, when the computed values for E_{i} fall inside the elipsis (equation 4-17).
7. Plot the cumulative residuals E_{i} against time using the time scale (interval) of the variable under analysis (Figure 4-2).
8. Draw the elipsis on the same plot and verify whether the E_{i} all lie inside the elipsis. If they do, then the hypothesis of homogeneity is accepted at the p level of confidence (Figure 4.4).
FIGURE 4.2. Plot of cumulative residuals against time and associated elipsis for the probability p = 80%, with results indicating that data set Y is not homogeneous (relative to data set X).
9. If the hypothesis of homogeneity cannot be accepted (this is the case in Figure 4.2), then one can select the break point where it appears that E_{i} ceases to increase (or to decrease) and begins to decrease (or to increase), for example at I = 16 in Figure 4.2. This break point is termed k = i.
10. The data set is now divided into two subsets, the first from 1 to k, the second from k + 1 to n. Then, new regression equations are computed between Y and X for both subsets. If we presume that the second subset is homogeneous but that the first is not, then we have
_{} (4-18)
and
_{} (4-19)
where the subscripts h and nh identify the regression coefficients of the homogeneous and the non homogeneous subsets, respectively (see Figure 4-3).
11. Compute the differences between the two regression lines
_{} (4-20)
for the non homogeneous set (i = 1, 2,...,k)
FIGURE 4.3. The regression lines for the two subsets obtained from the data sets of Figures 4.1 and 4.2. Selection was made after definition of the break point in Figure 4.2.
FIGURE 4.4. Plot of cumulative residuals against time and the associated elipsis for p = 80% after correction of variable y.
12. Correct the non homogeneous subset portion of data set
_{} (4-21)
where the subscript c identifies the corrected values. Thus, the corrected, homogeneous full set for weather variable Y is composed by
Y_{i} = y_{c, i} for i = 1, 2,..., k (4-22)Y_{i} = y_{i} for i = k + 1, k + 2,..., n
A similar procedure would be utilized if it was presumed that the second sub-set requires correction, rather than the first sub-set.
Note that the variables Y_{i} are still considered to be random variables despite that the mean and the variance have been modified due to the correction introduced. To confirm the results of the correction of data set Y for homogeneity, the homogeneity test methodology can be applied again to the corrected variable Y to provide evidence of homogeneity in the graph of residuals. This has been done in Figure 4.4.
In this example, it was presumed that the latter sub-set (k to I) was the correct (representative) data set, or the data set displaying the desired attributes. It was therefore presumed that prior to time k, the readings were biased by instrument calibration, different location of the station or the instrument within the station, change in type or manufacturer of the instrument, or change in general environment of the station. It appears in Figure 3 that the data prior to i = k were biased downward by approximately 100 mm of annual precipitation.
Double-Mass Technique
The double-mass technique is also useful for assessing homogeneity in a weather parameter. As with the method of cumulative residuals discussed in the last section, the double-mass technique requires data sets from two weather stations, where X_{i} (i = 1, 2,..., n) is a chronologic data set for a given weather variable observed for a certain time length at a "reference" station, and which is considered to be homogeneous, and where Yi is a data set of the same variable, with the same time length, observed at another station and for which homogeneity needs to be analysed.
In the double-mass technique, starting with the first observed pair of values X_{1} and Y_{1}, cumulative data sets are created by progressively summing values of X_{i} and Y_{i} to verify whether the long term trends in variation of X_{i} and Y_{i} are the same. Thus the following cumulative variables are obtained
_{} (4-23)
and
_{} (4-24)
with i = 1,..., n and j = 1,..., i - 1.
FIGURE 4.5. Double mass analysis applied to two series of precipitation when data from station Y are not homogeneous
These variables x_{i} and y_{i} are still considered to be random variables and are characterized by the mean and the standard deviation (equations 4-1 to 4-4). The y_{i} and x_{i} variables can be related through linear regression (equations 4-5 to 4-8). However, the double mass technique is typically applied as a graphical procedure.
The graphical application of the double-mass analysis is done by plotting all coordinate points x_{i} and y_{i}. The plot is then visually analysed to determine whether successive points of x_{i} and y_{i} follow an unique straight line, indicating the homogeneity of the data set Y_{i} relative to data set X_{i}. If there appears to be a break (or more than one break) in the the plot of y_{i} to x_{i}, then there is a visual indication that the data series Y_{i} (or perhaps X_{i}) is not homogeneous (Figure 4.5). The break at coordinates x_{k} and y_{k} can be used to separate two subsets (i = 1, 2,..., k) and (k + 1, k + 2,..., n). One of the subsets is to be corrected. The appropriate one to correct needs to be identified by consulting the records of the weather station, when available.
FIGURE 4.6. Residuals of double mass to the straight line (equation 26) indicating the non homogeneity of the residuals of the series of precipitation of station Y.
Often, visual interpretation of the double-mass balance is difficult. Thus the following numerical regression procedure is recommended:
1. Compute the regression line through the origin for the full set of data x_{i} and y_{i}
_{} (4-25)
_{}
2 Compute the residuals to the regression line
e _{i} = y_{i} - b x_{i} (4-26)
3. Analyse the distribution of residuals. If the residuals plot as independent, random variables, then the set can be considered to be homogeneous. However, if the distribution of residuals is biased over i = k, then the. homogeneity hypothesis is rejected. The bias can be visually assessed by plotting (e _{i}, i). The example in Figure 4.6 shows that residuals follow a trend of decreasing e _{i} until i = k (= 16). Following that, the trend is to increase. This plot demonstrates a bias indicating that the data set Y is not homogeneous.
4. The break point at i = k defines two subsets (i = 1, 2,..., k) and (i = k +1, k+2,..., n). Using local information on data collection, the user must decide which subset requires correction.
5. When the first subset is homogeneous the following correction procedure can be applied:
a) compute the two regression lines, the first through the origin
_{} (4-27)and
_{} (4-28)where subscripts h and nh identify respectively the homogenous and non homogeneous subsets.
b) compute the differences between both regression lines for i = k+1, k+2,..., n
_{} (4-29)
6. When the second subset is homogeneous:
a) compute the regression line for the homogeneous subset (i = k +1, k + 2,..., n) after correcting the coordinates (x_{i}, y_{i}) using the coordinates of the break point (x_{k}, y_{k}), i.e. moving the origin of coordinates from (0, 0) to (x_{k}, y_{k}). This regression is therefore
y_{i} - y_{k} = b_{h} (x_{i} - x_{k}) (4-30)
thus
_{} (4-31)
b) compute the regression line for the non homogeneous subset forced to the origin
_{} (4-32)
c) compute the differences between the regression lines (4-31) and (4-32)
_{} (4-33)
7. For both cases, correct the variables y_{i} corresponding to the non homogeneous subset as
_{} (4-34)
with _{} given by equations (4-29) or (4-33).
FIGURE 4.7. Double mass after correction of data set Y (case of Figure 4.3)
8. Compute the corrected estimates of the weather variables Y_{i} by solving equation (4-24) for Y_{i}.
Figure 4.7 illustrates the double mass after correction of subset Y in Figure 4.3, where the cumulative sums now follow a straight line.
Figure 4.8 is a plot of the corresponding residuals, which now follow a normal distribution. Similar verification can be easily made by the user. This procedure can be easily applied using a spreadsheet computation and graphical packages that are currently available.
FIGURE 4.8. Residuals of the double mass after correction of data set Y (compare to Figure 4.4)
SELECTED BIBLIOGRAPHY ON STATISTICAL ANALYSIS
Dubreuil, P. 1974. Initiation à l'analyse hydrologique. Masson & Cie. et ORSTOM, Paris.
Haan, C. T. 1977. Statistical Methods in Hydrology. The Iowa State University Press, Ames.
Kite, G. W. 1988. Frequency and Risk Analyses in Hydrology. Water Resources Publications, Littleton, CO, 257 pp.
Natural Environment Research Council (NERC) 1975. Flood Studies Report, Vol I - Hydrology Studies. Natural Environmental Research Council, London, 550 pp.
NOTATION IN STATISTICAL ANALYSIS
a |
regression coefficient |
b |
regression coefficient |
cov_{xy} |
covariance of variables x and y |
E_{i} |
cumulative residuals |
i |
number of order of variable x_{i} in the sample |
j, k |
number of a variable in a subset |
n |
size of the sample |
p |
probability |
p (x) |
probability distribution density function |
r |
correlation coefficient |
r^{2} |
coefficient of determination |
s_{x} |
estimate of the standard deviation of the variable x |
_{} |
estimate of the variance of the variable x |
s_{y} |
estimate of the standard deviation of the variable y |
_{} |
estimate of the variance of the variable y |
s_{y, x} |
standard deviation of the residuals of y estimated from the regression |
X |
random variable |
X_{i} |
value of a variable in a data set |
x_{i} |
random variable |
_{} |
estimated value for the variable x with probability of non excedance p |
_{} |
estimate of the mean, or mean of a sample of the random variable x_{i} |
Y |
transformed variable from X |
Y_{i} |
value of a variable in a data set |
y_{i} |
random variable |
_{} |
value of y_{i} estimated from the regression |
_{} |
estimate of the mean, or mean of a sample of the random variable y_{i} |
Z |
standard normal variable |
z_{p} |
value of the standard normal variable for the probability p |
e _{i} |
residuals of y estimated from the regression |
m |
mean of a population |
s |
standard deviation of a population |