2. POPULATION AND SAMPLING

2.1 Potential methods for data collection

Data collection can be classified into two general forms: census and sample.

A census is not a survey per se, as it involves collecting data from all individuals in the target population. Several European logbook programmes could be considered a census as they (theoretically) require all vessels that meet certain characteristics to provide the required data. The key advantage of a census is that (assuming perfect compliance) the results are known with certainty.

The principle disadvantage of a census is the considerable cost involved in collection and the subsequent compilation of all the data collected. In the case of fisheries, the cost of interviewing every fisher to collect the data would be prohibitive. Logbook programmes require fishers to complete the data themselves and provide the completed forms to the appropriate authority. Provision of such data is mandatory for the target population of vessels, and is enforced through legislation that enables prosecution and penalisation of individuals who do not comply or deliberately provide incorrect or misleading information.

While such an approach has considerable appeal, regular provision of such data would place an increased burden on both fishers and administrators. The benefits of more precision in the resulting values of the key indicators would need to outweigh the additional costs for such an exercise to be worthwhile.

Sample surveys involve the collection of data from a sample of the target population rather than all individuals in the target population. The key advantage of the sample survey is that less data need to be collected and analysed.

A key assumption of the sample survey is that the sample is representative of the target population as a whole. A range of sampling methods can be employed to improve the likelihood that the sample is representative (see next paragraphs), although a risk always remains that the sample estimates are biased due to the sample being different in some way to the target population as a whole. However, as the standard error decreases with sample size, the optimal sample sizes can be determined based on the desired level of precision of the data (see paragraph 4.3).

2.2 Advantages of sampling methods

· Reduced cost

If data are secured from only a small fraction of the aggregate, expenditures are smaller than if a complete census is attempted. With large populations, results accurate enough to be useful can be obtained from samples that represent only a small fraction of the population.

· Greeter speed

For the same reason, the data can be collected and summarized more quickly with a sample than with a complete count. This is a vital consideration when the information is urgently needed.

· Greater scope

In fisheries inquiry trained personnel or specialised equipment, limited in availability must be used to obtain the data. A complete census is impracticable: the choice lies between obtaining the information by sampling or not at all. Thus surveys that relay on sampling have more scope and flexibility regarding the types of information that can be obtained.

· Greater accuracy

Because personnel of higher quality can be employed and given intensive training and because more careful supervision of the field work and processing of results becomes feasible when the volume of work is reduced, a sample may produce more accurate results than the kind of complete enumeration that can be taken.

2.3 Some statistical terms

This document is not a text book of statistics many of which describe the statistical concepts related to sampling. However, knowledge of some basic statistical terms is required to a better understanding of the next sections.

2.3.1 Mean

The arithmetic mean or the mean of a set of N numbers X₁, X₂, X₃, ..., X_N is denoted by (read “X bar”) and is defined as:

2.3.2 Variance and standard deviation

The standard deviation of a set of N numbers X₁, X₂, X₃, ..., X_N is denoted by s and is defined as:

where x represents the deviation of each of the numbers X_j from the mean .

Sometimes the standard deviation for the data of a sample is defined with (N-1) replacing N in the denominators of the previous expression because the resulting value represents a better estimate of the standard deviation of a population from which the sample is taken. For large value of N (certainly N>30) there is practically no difference between the two definitions. Also, when the better estimate is needed we can always obtain it by multiplying the standard deviation computed according to the first definition by .

The variance of a set of data is defined as the square of the standard deviation and is thus given by s².

When it is necessary to distinguish the standard deviation of a population from the standard deviation of a sample drawn from this population, we often use the symbol s for the latter and s for the former. Thus s² and s² would represent the sample variance and the population variance respectively.

Finally, we define the coefficient of variation as:

The coefficient of variation does not depend on the measurement unit and it gives an indication of the importance of the standard deviation with respect to the mean.

2.3.3 Normal distribution and confidence limits

The normal distribution is a bell-shaped distribution which is used most extensively in statistical applications in a wide variety of fields. Its probability density function is given by:

Its mean is m and its variance is s². When x has the normal distribution with mean m and variance s², we write this compactly as .

When the variable x is expressed in terms of standard units, z=(x-m)/s, the previous equation is replaced by the so called standard form:

In such case we say that z is normally distributed with mean zero and variance one.

A graph of this standardised normal curve is shown in figure 2.1. In this graph we have indicated the area included between z=-1 and +1 as equal to 68.27% of the total area which is one. The areas included between z=-2 and +2 and z=-3 and +3 are equal respectively to 95.45% and 99.73%.

Figure 2.1 - Standardized normal curve

For example the “99% confidence” figure implies that if the same sampling plane were used many times in a population, a confidence statement being made from each sample, about 99% of these statement would be correct and 1% wrong.

2.4 The role of sampling theory

Sampling theory is a study of relationships existing between a population and samples drawn from the population. It is of great value in many connections. For example it is useful in estimation of unknown population quantities (such as population mean, variance, etc.), often called population parameters or briefly parameters, from a knowledge of corresponding sample quantities (such as sample, mean, variance, etc.), often called sample statistics or briefly statistics.

The purpose of sampling theory is to make sampling more efficient. It attempts to develop methods of sample selection and estimation that provide, at the lowest possible cost, estimates that are precise enough for our purpose. In order to apply this principle, we must be able to predict, for any sampling procedure that is under consideration, the precision and the cost to be expected.

Sampling theory is also useful in determining whether observed differences between two samples are actually due to change variation or whether they are really significant. The so-called tests of significance and hypothesis are important in the theory of decisions.

In general, a study of inferences made concerning a population by use of samples drawn from it, together with indications of the accuracy of such inferences using probability theory, is called statistical inference.

2.5 Probability sampling

The sampling procedures have the following mathematical properties in common.

1. We are able to define the set of distinct samples S1,S2,...Sv, which the procedure is capable of selecting if applied to a specific population. This means that we can say precisely what sampling units belong to S1, S2, and so on.

2. Each possible sample Si has assigned to it a known probability of selection pi.

3. We select one of the Si by a random process in which each Si receives its appropriate probability pi of being selected.

4. The method for computing the estimate from the sample must be stated and must lead to a unique estimate for any specific sample.

For any sampling procedure that satisfies these properties, we are in a position to calculate the frequency distribution of the estimates it generates if repeatedly applied to the same population. The term probability sampling refers to this situation.

In practice we seldom draw a probability sample by writing down the S_i and p_i outlined above. This is too laborious with a large population, where a sampling procedure may produce billions of possible sample. The drawn is most commonly made by specifying probabilities of inclusion for the individual units and drawing units, one by one in groups until the sample of desired size and type is constructed.

2.6 Alternatives to probability sampling

The following are some common types of non-probability sampling.

1. The sample is restricted to a part of the population that is readily accessible.

2. The sample is selected without conscious planning.

3. With a small but heterogeneous population, the sampler inspects the whole of it and selects a small sample of “typical” units - that is units that are close to his impression of the average of the population.

4. The sample consists essentially of volunteers, in studies in which the measuring process is unpleasant or troublesome to the person being measured.

In some cases and under the right conditions, any of these methods can give useful results. They are not, however, open to the development of a sampling theory, since no element of random selection is involved. These methods, moreover, are unable to predict from the sample the accuracy to be expected in the estimates.

2.7 Bias and its effects

Sample bias largely arises as a result of inappropriate sample selection (i.e. the average of the selected group differs in characteristics from the true average of target population). As most sample surveys are completed on a voluntary basis, bias may also be introduced through non-response. In such cases, bias may arise if the individuals who do not participate have different characteristics to the target population as a whole. As information is not subsequently derived from these individuals, it is impossible to determine the extent of any bias that may be introduced.

Moreover, in sample survey theory it is necessary to consider biased estimators for two reasons.

1. In some of the most common problems, estimators that are convenient and suitable are found to be biased.

2. Even with estimators that are unbiased in probability sampling, errors of measurement and non response may produce biases in the numbers that we are able to compute from the data.

The use of a stratified random sample approach (see paragraph 4.2) reduces the potential for sample bias, but required additional information on the target population prior to the sample selection. Where the complete sample for particular fleet segments cannot be achieved due to non-response, bias can be reduced through assigning weights to the individual sample responses to re-balance the data. The potential bias arising directly from non-response can be reduced through replacement of boats with similar characteristics, on the assumption that the similar boat is as representative of the boat that failed to respond.