2. Sampling design

There are two general sampling approaches: subjective sampling and probability sampling. Subjective sampling attempts to use professional judgment to select sample units believed to be representative of the entire population. These units are often convenient to measure which reduces cost. Although data gathered in this way accurately describe the conditions on the sampled sites, they may not accurately characterize the entire population. Supporters of subjective sampling trust the ability of experts to select a representative sample and argue that this approach is good enough for practical purposes. In some simple situations, this may be true. But what if a user of the data does not have the same confidence in the experts? Expensive data can become worthless because the sampling design is not defensible under scientific criticism. Also, convenient sampling sites are often near roads, which are frequently associated with unique landforms, land uses, management histories, and landscape patterns. Are such sites truly representative of the entire population? The answer is debatable. It is far easier to discredit the accuracy of population estimates from a subjective sample than prove otherwise.

Probability sampling replaces subjective judgments with objective rules based on known probabilities of selection for each member of a population. For example, assume a 1-million ha forest is considered as a population of 10m x 10m plots. There would be 100-million of those plots in the population. If one of those plots were selected at random, then its probability of selection would be 1/100 000 000. If a simple random sample of 1 000 plots were selected to estimate conditions in the entire 1-million ha population, then each member of that population has a probability of selection of approximately 1 000/100 000 000=1/100 000, and each plot measured in the sample could be seen as representing 99 999 other unmeasured plots. The important lesson is that probability sampling is an objective method with precise rules and a mathematical foundation for estimating population attributes based on a sample. The probability that an expert will select any one potential sample plot is unknown, and the mathematics of subjective sampling can not be applied in a scientifically defensible way. Thus, we recommend probability rather than subjective sampling and further recommend equal probability sampling in which each possible sampling unit location has an equal probability of selection for the sample.

Selecting a probability sample design

Many of the difficulties associated with selecting a sampling design arise from two factors: first, sampling units are distributed in a space and observations of them may be spatially correlated, and second, different sampling designs have different costs. Spatial dependence among observations of variables of interest strongly influences selection of sample designs. Ecological, climatic, and soil factors and forestry management practices cause observations from plots that are near to each other to be, on average, more similar than observations from plots that are farther apart. The result is that, in a strict sense, creation of a completely optimal sampling design is an impossible task because the numerous NFA measured and derived variables vary quite differently in space. Thus, because optimal sampling designs would be different for different variables, optimization may require focusing on minimizing the standard error of a single important variable such as wood volume or on a weighted function of the standard errors for a small number of variables. One partial solution is to minimize the effects of spatial correlation by establishing sampling locations as far apart as possible. This also accommodates the fact that sample plots that deviate more from each other, bring more information to the sample. In forest sampling, this often suggests hexagonal sample designs. The primary sampling costs are attributed to traveling to and from the sampling unit location and measuring the unit. These costs, in turn depend on the structure of landscape and forests, measurements to be taken, and topographic, economic and transportations conditions.

A common starting point in selecting a sample design is knowledge of the acceptable upper bounds for the standard errors of the estimates and an upper bound for cost. Optimizing the sample design, given the sampling frame and plot configuration, involves selecting a procedure for spatially distributing the sampling unit locations in such a way that standard errors are minimized while not exceeding the total allowable costs. Sometimes this will not be possible, and compromises may be necessary.

Simple random sampling

A simple random sample places sample plots randomly within the sampled population (Figure 1a). By chance, there are spatial clusters and voids in the plot distribution; however, this remains a valid probability sample. The geographic coordinates for each sample plot in a random sample may be selected with a random number generator with the allowable coordinates restricted to the sampled population. Otherwise, no consideration is given to safety, difficulty of measuring the plot, or travel to and from sample plot location. This is the least risky equal-probability sample design, but it is also usually the least efficient with respect to both cost and the precision of estimates, partially because of spatial correlation among observations.

Systematic sampling

A systematic sample uses a fixed grid or array to assign plots in a regular pattern (Figure 1b). The advantage of systematic sampling is that it maximizes the distance between plots and therefore minimizes spatial correlation among observations and increases statistical efficiency. In addition, a systematic sample, which is clearly seen to be representative in some sense, can be very convincing to decision makers who do not have experience with sampling. Systematic samples may be based on rectangular grids or hexagonal arrays. For example, a sample plot could be established at the intersections of a 2x2 kilometer grid. A random number is used to select the starting point and orientation for this grid, but no other random numbers are required. This sampling frame is common in forestry. The greatest risk is that the orientation of the grid may, by chance, be coincident with or parallel to natural or man-made features such as roads or gravel ridges that resulted from melting glaciers. For very large geographic areas, orientation of gridlines along lines of longitude should be avoided. In higher latitudes the converging nature of these north-south gridlines may cause sample plot locations to be closer together in higher latitudes than in lower latitudes. Sample designs based on hexagonal arrays alleviate this problem (White et al 1992).

Systematic unaligned sample designs combine features of both simple random and systematic sample designs. With these designs, a single sample plot is assigned to a randomly selected location within each grid or array cell (Figure 1c).

Figure 1. (a) simple random sample design, (b) aligned systematic sample design, (c) unaligned systematic sample design, (d) unaligned, clustered, systematic sample design with the same number of plots but grouped into clusters.

Cluster sampling

For practical reasons such as increasing cost efficiency and reducing field crew travel, sample plots may be organized into clusters, thus leading to systematic cluster sampling and stratified systematic cluster sampling. In systematic cluster sampling, the clusters are distributed throughout the population using grids or polygons such as hexagons.

Several questions are relevant when planning a cluster-based sample design: (1) what is the spacing between clusters? (2) what is shape of the cluster? (3) what is the number of plots per cluster? and (4) what is the sample plot configuration? To answer these questions, preliminary information about the spatial distribution and correlation of the variables of interest is needed. Correlation, as a function of distance between field plots, estimated using variograms can be used to compare the efficiencies of different sample designs.