In cluster sampling the population is partitioned into groups, called clusters. The clusters, which are composed of elements, are not necessarily of the same size. Each element should belong to one cluster only and none of the elements of the population should be left out.

The clusters, and not the elements, are the units to be sampled. Whenever a cluster is sampled, every element within it is observed.

In cluster sampling, only a few clusters are sampled. Hence, in order to increase the precision of the estimates, the population should be partitioned into clusters in such a way that the clusters will have similar mean values. As the elements inside the clusters are not sampled, the variance within clusters does not contribute to the sampling variance of the estimators. Therefore, in order to decrease the sampling variance of the estimators the variation within the clusters should be as large as possible, while the variation between clusters should be as small as possible.

It should be noted that
the partitioning of the population into clusters follows two opposite criteria
to the criteria of partitioning a population into *strata*, that is, the
heterogeneity within clusters as opposed to the homogeneity within *strata* and
the similarity of cluster means as opposed to the differences in *strata* means.

Cluster sampling is often more cost effective than other sampling designs, as one does not have to sample all the clusters. However, if the size of a cluster is large it might not be possible to observe all its elements. The next chapter will show that there are ways to overcome these difficulties.

In fisheries, cluster sampling has been used to estimate landings per trip in artisanal fisheries with a small number of vessels landing at many sites (beaches). Consider, for instance, a fishery with 100 small beaches, where a few vessels land at each beach. One is interested in the total catch per day of the vessels landing at these beaches, but one does not have the possibility to visit all of them. In this case each beach can be a cluster. If a beach is sampled, all its elements (vessel landings) should be observed.

Another example of cluster sampling in fisheries is the sampling of the length composition of an unsorted large catch of a species kept in fish boxes onboard a vessel. Let us assume that the catch in each box is as heterogeneous as possible. The fish boxes can then be looked upon as clusters, and when a box has been selected for sampling, all the elements (fish) inside the box have to be observed.

To be consistent with
other methods, the symbol *N* is used to designate the
total number of population sampling units, which in this case are the clusters
and not the elements.

The
number of elements in a cluster *i* is denoted by *M _{i}*.

The total number of elements in the population is:

and the mean number of elements per cluster is:

The value of the characteristic Y in element *j* from cluster *i* is
*Y _{ij}* and

The mean value per element in cluster *i* is:

The total value of the characteristic in all the elements of the population is:

In cluster sampling, two means can be considered:

The mean per cluster

and

the mean per element

In cluster sampling, two types of variances can be considered. The first is the variance between clusters, and the second is the variance within clusters, or between elements within clusters.

The variance between cluster totals is:

The variance within a cluster i, denoted by , is

Table 5.1 presents a summary of the main parameters of a discrete population divided into clusters that are most used in fisheries research.

In
cluster sampling, *n* is the number of clusters to be sampled and *m _{i}* is the number of elements sampled from cluster

For any sampled cluster *i*, the value of
the chosen characteristic of element *j* is *y _{ij}* and is the total value of
the characteristic in cluster

Table 5.1**Summary of population
parameters of most interest in fisheries research**

N | Number of clusters in the population |

M_{i} | Number of elements in cluster i |

Total number of elements in the population | |

Mean number of elements per cluster | |

Y_{ij} | Value of characteristic Y in element j of cluster i |

Total value of the characteristic Y in cluster i | |

Mean value of the characteristic Y in the elements of cluster i | |

Total value of characteristic Y of all the elements in the population | |

Mean value of characteristic Y per cluster | |

Mean value of characteristic Y per element | |

Variance of characteristic Y between cluster totals | |

Variance of characteristic Y within cluster i |

The mean value of the
characteristic Y in all the elements of cluster *i* is:

The total value of the characteristic Y in all the elements of all the clusters sampled is denoted by:

The mean value of the characteristic Y per cluster is:

The mean value of the characteristic Y per element is:

The variance between total values of the characteristic Y in the clusters sampled is:

The variance of the values of the
characteristic Y within the *i*^{th} sampled cluster is:

Table 5.2 presents a summary of the most common sample statistics in cluster sampling applied to fisheries.

Table 5.2

**Most common sample statistics in cluster
sampling**

n | Number of clusters sampled |

m_{i} | Number of elements in cluster i (Note that m = _{i}M)_{i} |

Total number of elements in sample | |

Sample mean number of elements per cluster | |

y_{ij} | Value of the characteristic Y in element j of cluster i |

Total value of the characteristic Y in the sampled cluster i | |

Mean value of the characteristic Y in the sampled cluster i | |

Total value of the characteristic Y in the sample | |

Mean value of the characteristic Y per cluster sampled | |

Sample mean value of the characteristic Y per element | |

Variance between total values of the characteristic Y in the clusters sampled | |

Variance of the values of characteristic Y within the sampled cluster i |

As mentioned in the introduction to this chapter, in cluster sampling the sampling units are the clusters. The selection of the clusters can be made by random sampling with equal probabilities (simple random sampling) or with different probabilities. A particular case of random sampling with different probabilities is when the probabilities are proportional to the sizes of the clusters.

The most important estimators in cluster sampling are the estimators of the total value, of the mean per cluster and of the mean per element.

First, let us consider an example where the clusters
are selected by simple random sampling without replacement. In this case, the
probability of selecting any cluster *i*, in one extraction, is constant
and equal to .

An unbiased estimator of the population total value, *Y***,** is:

or = *N*

The factor is a raising factor,
which raises the sample total, *y*, to the estimator
of the population total .

The sampling distribution of this estimator is
approximately normal with expected value *E*, and variance *V*:

*N* (*E, V*)

where

*E = E[Ŷ]= Y* (Ŷ is an unbiased estimator of *Y*)

and

An estimate of the sampling variance can be
obtained by replacing the population variance *S _{1}^{2}* with the sample
variance

Two cases of selecting the sample with unequal probabilities will be considered: selection with replacement and selection without replacement. In the former, the Hansen-Hurwitz estimator will be described. The particular case of selection with probabilities proportional to the sizes of the clusters will be also studied.

Another estimator, the
Horvitz-Thompson estimator, is applicable in both cases, *i.e.*, with or
without replacement.

*Hansen-Hurwitz estimator -selection with
replacement*

Let *P _{i}* be the probability of selecting cluster

This estimator has an approximately normal
distribution with expected value *E*, and variance, *V*:

*N* (*E, V*)

where

*E = E[Ŷ] = Y* (Ŷ is an unbiased estimator of Y)

and

An estimate of the sampling variance would be:

*Selection with
probabilities proportional to cluster sizes*

Let us consider the special case where the selection probability is proportional to the size of the clusters,

In this case, the estimator of the
total value, ,
its sampling variance and all the other expressions can be obtained replacing *P _{i}* in
the case previously described, by

or

The sampling variance is:

considering that:

and

Another expression of this variance can also be obtained:

An estimate of this variance is:

or

Note that in order to use this estimate one needs to know the
total number of elements in the population, *M _{o}*.

An estimator of the mean value per element is: or

which has a sampling variance given by:

An estimate of the sampling variance can be calculated replacing the variance of the population total by its sample estimate:

resulting in:

Note that in order to use this estimate one needs to
know the total number of elements in the population, *M _{o}*.

The estimator of the mean value per cluster is:

with a sampling variance:

An estimate of this sampling variance can be obtained from and is expressed as:

In order to use this
estimator one needs to know *M _{o}*, or at least the mean number
of elements per cluster, .

Selecting clusters with probabilities proportional to their sizes is not always easy. A simple procedure for selecting n clusters with probabilities proportional to their sizes (), that can be easily used in fisheries research, is given below:

- calculate the cumulative numbers of elements of the population in each cluster;
- assign intervals of “selection numbers” to each cluster, based on these cumulative numbers;
- use the “selection numbers” in order to choose the
*n*clusters to be sampled, with a probability proportional to sizes. For this purpose, select (applying a simple random sampling design) one of the total number of the “selection numbers” to get the corresponding cluster; - repeat the selection of “selection numbers” to obtain the required number of clusters.

A simple example to illustrate the procedure:

Consider a situation where one wishes to select three out of five boats landing fish on a beach. The boats are considered as the clusters to be sampled. Each boat carries a different number of fish boxes to be landed. The percentages of the total number of fish boxes carried by each one of the five boats will be considered as the probabilities proportional to the sizes of the clusters. Table 5.3 shows the original data and how to calculate what can be designated as the “boat selection numbers”.

Table 5.3**Original data and
calculation of “boat selection numbers”**

Boat | Number of fish boxes | Cumulative numbers | Boat selection numbers | Selection Probability |

1 | 5 | 5 | 1–5 | 5/50=0.10 |

2 | 10 | 15 | 6–15 | 10/50=0.20 |

3 | 7 | 22 | 16–22 | 7/50=0.14 |

4 | 13 | 35 | 23–35 | 13/50=0.26 |

5 | 15 | 50 | 36–50 | 15/50=0.30 |

By repeating three times a simple random sampling of one out of fifty numbers, one will get three boat selection numbers corresponding to the boats to be sampled (ignore the selected numbers corresponding to clusters already chosen).

*Horvitz-Thompson
estimator - selection with or without replacement*

It is convenient, before describing this estimator and its sampling characteristics, to show how inclusion probabilities can be calculated.

Let *π _{i}* denote the probability of including cluster

To
derive the relation between *π _{i}* and

*π _{i}* = 1 - (1-

Let
us now consider the probability, *π _{ij}*, that both cluster

In *n* independent extractions the probability of neither extracting cluster *i* nor
cluster *j* will be [1 - (*P _{i} + P_{j})*]

Alternatively,
the same probability - that either cluster *i* or *j* be included in
the sample, could also be expressed as the probability of including cluster *i* plus the probability of including cluster *j* minus the probability of
including both *i* and *j*, that is, (*π _{i} + π_{j}*) -

The two last expressions are different ways to refer to the same probability, thus:

*π _{i} + π_{j} - π_{ij}* = 1 - [1-(

Finally
the inclusion probability, π * _{ij}*, can be calculated as:

*π _{ij}* = (

The calculations of the inclusion probabilities, for the example presented previously, are shown below.

Boat, i | Prob. of selection, Pi | Prob. of inclusion, πi | Inclusion probabilities, πij | |||

1 | 2 | 3 | 4 | |||

1 | 5/50=0.10 | π= 0.271_{1} | ||||

2 | 10/50=0.20 | π= 0.488_{2} | 0.102 | |||

3 | 7/50=0.14 | π= 0.364_{3} | 0.074 | 0.139 | ||

4 | 13/50=0.26 | π= 0.595_{4} | 0.128 | 0.240 | 0.175 | |

5 | 15/50=0.30 | π= 0.657_{5} | 0.144 | 0.270 | 0.197 | 0.337 |

Total | 1.00 |

Calculations:

π_{1} = 1- (1-0.10)^{3} = 0.271 | |

π_{2} = 1- (1-0.20)^{3} = 0.488 | π_{12} = 0.217 + 0.488 -{1-[1- (0.10 + 0.20)]^{3}} = 0.102 |

π_{2} = 1- (1-0.14)^{3} = 0.364 | π_{13} = 0.217 + 0.364 -{1-[1- (0.10 + 0.14)]^{3}} = 0.074 |

π_{23} = 0.488 + 0.364 -{1-[1- (0.20 + 0.14)]^{3}} = 0.139 | |

etc. |

The Horvitz-Thompson estimator of the total value of the population is:

where *y _{i}* is the total value of the variable,
in the distinct sampled cluster

The estimator is unbiased and its sampling variance can be written as:

An unbiased estimate of this variance is:

Note that inclusion probabilities should be different from zero.

These estimates of the variances can be negative. A way to avoid this inconvenience is as follows:

Calculate, from each effective
sampled cluster *i*, the following *t _{i}* statistics:

Each of the t_{i} values calculated can be considered
as an estimate of the total
value, *Y*.

The mean of *t _{i}* for
all clusters effectively sampled is a Horvitz-Thompson estimator of the total
value

The estimate, v[Ŷ], of the sampling variance of this estimator is:

where and