Previous Page Table of Contents Next Page


2. The three worlds of sampling

2.1 THE POPULATION WORLD

Let us consider a population of several elements, where Yi is the value in element i of a characteristic, represented by the variable Y. For example, if the total length of sardines is taken from a landing box, the characteristic, Y, can be the total length of a sardine, and the ith measure will have length Yi. In this case Y is a continuous variable. Another characteristic could be the age of the sardines, as it is measured in fisheries. In this case age is considered a discrete variable, Y, which can take the values 0, 1, 2, …, i, …

The distribution of the values of a characteristic in the population can be represented in the form of a list, a table, a function, a graph, etc. The distribution may be characterized by parameters, for instance, the mean, the variance, the standard deviation, the quantiles, etc. The population is usually unknown and, therefore, these parameters cannot be calculated.

Greek alphabet letters or Latin alphabet upper case letters will be used to denote the parameters of the population world.

Total and mean values

Populations can be finite or infinite. The total number of elements of a finite population is the size of the population and is denoted by N. When the number of elements of the population is very large, N can be considered as infinite. For example, the population of sardines landed in a country during one year is finite, but for some statistical purposes the population can be considered infinite.

The population mean of a characteristic Y is represented by or µ.
If the population is finite the total value of the characteristic Y will be:

Then the mean will be:

In this case the total value can be expressed as:

Y= N

Note that Y denotes not only the variable, but also the population total value.

Dispersion measures

Several measures of dispersion of the values of the characteristic in the population can be defined. The variance, the standard deviation, the coefficient of variation and the range are the most common ones.

The population variance of the characteristic Y is represented by σ2.

In order to define variance let us consider the deviation of a value Yi to the mean, that is:

Yi -

For finite populations the sum of squares of the deviations is represented by:

The variance is then defined as:

A modified variance, S2, is introduced in some sampling manuals, with the purpose of simplifying formulas and keeping the parallelism between the formulas in the population and the corresponding formulas in the samples.

S2 is defined as:

Note that (N - 1) S2 = Nσ2. The two variances are practically the same for large population sizes.

The population standard deviation of the characteristic Y is represented by σ (or S) and it is defined as (or ). The standard deviation is also a measure of dispersion. Compared with the variance, it has the advantage of being expressed in the same units as the variable, but the variance is preferred in most cases for theoretical reasons.

The coefficient of variation is defined as and it is a relative measure of dispersion, making it possible to compare the dispersions of two populations with very different absolute values. For example, the lengths of sardines and the lengths of some tuna species, have standard deviations with different absolute values, but in terms of CV s, that is values relative to the means, the dispersions can be comparable.

The range, i.e., the difference between the larger and the smaller value of the population, is also a dispersion measure that can be useful in some cases.

Proportions

Some characteristics can be classified into two categories. For instance fish maturity can be classified into “adults” and “not adults”. In these cases the proportions of the total population elements that belong to one or the other category as well as the number of elements in each category are the parameters to be estimated. This type of characteristic is qualitative, that is, the characteristic of the elements is not measured but its quality (to belong or not to belong to a category) is observed. These characteristics are called attributes.

An attribute can be represented by a variable Y, which takes the value 1 if the element belongs to a category and 0 otherwise.

Let N be the number of elements in a finite population. The proportion of elements of the population that belongs to the category is represented by P and the proportion of the elements that does not belong to the category is Q=1-P. The product NP is the total number of elements belonging to the category and NQ is the total number of elements that do not belong to that category. Then the characteristic Y can be represented in the form:

with P + Q = 1

The population total value of the attribute Y is:

Y = (1 × NP) + 0 × NQ = NP

The population mean of the characteristic Y is and can be calculated by:

It is important to note that the proportion P of elements belonging to the category of interest is given by the mean of the characteristic Y. This result simplifies greatly the analysis of proportions, as most results can be obtained directly from those for mean values.

The population variance is obtained from:

and hence σ = PQ

The modified population variance:

The population standard deviation will be: or

Note that as previously mentioned the population is usually unknown and none of these parameters can be calculated.

2.2 THE SAMPLE WORLD

A sample of size n was drawn from a population with a variable Y. The observed value of the characteristic Y of the element i in the sample will be designated as yi. Therefore, a sample of size n will be formed by the values y1, y2,…, yn. The observed values can be sorted by sizes, such as y1y2 ≤…≤y1 ≤…≤y(n) where the sub-indices indicate the orders of magnitude.

When the sample size is large, the values can be grouped into classes. The classes will be denoted by j=1, 2,…, k where k is the total number of classes. The class interval is the difference between the upper limit yj + 1 and the lower limit yj, that is yj + 1 - yj for class j.

In fisheries research, the classes should have constant intervals and the total number of classes should not be less than 12. The central value ycentral j of the jth class is:

The number of elements inside each class is the absolute frequency of the class. The quotient of the number of elements in each class by the total number of elements in the sample is the relative frequency, which is often expressed as a percentage (%) and sometimes as per thousand (‰). Frequencies can be accumulated and then they are called cumulative frequencies. For example, if we group the data into k classes (1, 2, 3, …, k), the cumulative frequency of the first class will be the frequency of class 1, the cumulative frequency of the second class will be the sum of the frequency of class 1 plus the frequency of class 2, and so on, up to the frequency of the last class, k, which will be the result of adding the frequencies of all classes 1, 2, …, k and should be equal to the size of the sample considering absolute frequencies or equal to 1 considering relative frequencies. The absolute, relative or cumulative frequencies can be graphically represented by histograms.

Values calculated from the sample data are called statistics. Latin alphabet lower case letters will be used to denote the statistics of the sample world.

The statistics of location describe the central position of the values of a characteristic, while the statistics of dispersion give an idea of the dispersion of the values in the sample. Examples of statistics of location are the arithmetic mean (commonly called mean or average), the median and the mode. The range, variance, standard deviation and coefficient of variation are examples of statistics of dispersion.

The total of the observed values is designated by y and it is calculated as:

Statistics of location
The mean

The arithmetic mean is the most common statistic of location. It is the quotient between the total value, y, and the sample size, n, that is:

When the sample is organized in the form of a frequency table, the mean can be calculated as:

where fj is the frequency of class j.

Note that or depending on whether the fj represent absolute or relative frequencies.

The median

In an ordered array, the median is defined as the value that separates the set of observations into two parts of equal sizes. When the sample is composed of an odd number of observations the median is the central value, that is, the element of order . When the sample has an even number of observations the median can be calculated as the midpoint between the and the observations.

Quantiles

The median is just one of a family of statistics called quantiles that divide the frequency distribution into several equal parts. The quantiles are designated as quartiles when the number of parts is four. The first quartile cuts the frequency distribution at 25% of the total, the second quartile at 50%, this being the median, as seen before, and the third quartile at 75% of the total.

Other quantiles are also used, for instance, deciles (division into 10 equal parts), percentiles (division into 100 equal parts) and per thousand parts (division into 1000 equal parts). The percentile of order p would be the value of the percentile that separates the smallest p% values of the total number of the frequency distribution. For example the first quartile will be the percentile of order 25%.

The mode

The mode refers to the most frequently observed value of the sample. When the observations are grouped into classes, the class with the highest frequency is called the modal class. In this case, the central value of this class can be taken as the mode.

Some distributions present different local modes, as for instance most of the length compositions of fish landings.

Statistics of dispersion

The range

The range is the difference between the largest and the smallest observed value in the sample, i.e. Range = ylargest - ysmallest.

The difference y75% - y25% is called the inter-quartile range, which is another useful statistic of dispersion.

The variance

The sample variance, s2, is the quotient between two quantities, the sum of squares (ss) of the deviations of each observation yi from the arithmetic mean y and the size of the sample minus one:

and

There are other expressions to calculate the sum of squares (ss):

The standard deviation

The standard deviation is the square root of the variance:

The coefficient of variation

The coefficient of variation is the quotient between the standard deviation and the arithmetic mean:

The parallelism between the sample statistics and the finite population parameters should be noted.

Proportions

As in the population, the proportion of elements of the sample belonging to the category of interest can be calculated for a sample taken from that population. The characteristic of interest, yi is then such that yi = 1 if element i belongs to the category or yi = 0 if it does not.

Under these conditions, the proportion of the sample elements belonging to the category is defined by the sample mean of y,

The relation p + q = 1 is always valid and can be used to calculate q, which is the proportion of the sample elements that do not belong to the category.

The total value of the variable in the sample is np.
The sample variance can be calculated as:

and the sample standard deviation is:

Sample statistics - example

The original measurements of total lengths (expressed in cm) of a sample of 32 sardines landed from a trip of a purse seiner are presented in the following list, arranged in increasing order:

17.3, 17.7, 17.8, 18.1, 18.1, 18.3, 18.3, 18.7, 18.7, 19.1, 19.3, 19.3, 19.3, 19.4, 19.6, 19.7, 19.7, 19.8, 19.8, 20.1, 20.1, 20.1, 20.1, 20.1, 20.2, 20.2, 20.3, 20.6, 20.6, 20.6, 21.3, 21.7

These measurements were grouped into classes of 0.5 cm interval. The results are shown in Table 2.1.

Table 2.1
Distribution of total length measurements (in cm)

Class Interval
(cm)
Class Central Value
(cm)
Absolute FrequenciesRelative Frequencies
(%)
Cumulative Absolute Frequencies
17.0-17.25131
17.5-17.75263
18.0-18.254137
18.5-18.75269
19.0-19.2551614
19.5-19.7551619
20.0-20.2582527
20.5-20.753930
21.0-21.251331
21.5-21.751332
Total 32100-

The following figures represent the histograms of relative frequencies (%) (Figure2.1) and of the cumulative absolute frequencies (Figure 2.2).

Figure 2.1
Histogram of relative frequencies of total lengths

Figure 2.1

Figure 2.2
Histogram of the cumulative absolute frequencies of total lengths

Figure 2.2

The most common statistics of location and dispersion were calculated from the original values:

Total value: = 624.0 cm

Arithmetic mean:

Median: 19.7 cm

Modal class: 20.0 cm

Range: ylargest - ysmallest= 4.4 cm

Sum of squares: = 34.74 cm2

Variance:= = 1.12 cm2

Standard deviation:= 1.05 cm

Coefficient of variation: 5.4%

Proportion of fish above 20 cm

2.3 THE SAMPLING WORLD

Revision of probabilities and some useful distributions

Before starting the discussion of the world of sampling it would be advantageous to review some concepts of probabilities and some probability distributions, as the sampling world is mainly a world of probabilities.

Even if not rigorous, the concept of probability will be presented in a simple and practical way.

In a sample, the relative frequencies are calculated after taking the sample and observing or measuring the characteristics, i.e., the relative frequencies are calculated a posteriori.

Before the extraction of the sample a priori, the concept of relative frequency should be replaced by a new concept, that of probability. For example, if one randomly selects a vessel of a fleet, there will be a certain probability that this vessel is a purse seiner. If we assume that 19% of the total vessels in a fishing harbour are purse seiners, the probability of randomly selecting a purse seiner will be 19% or 0.19. In that case, the properties of relative frequencies can be transformed into properties of probabilities. In summary:

In the theory of probabilities it is convenient to define random variables, that is, variables that have probabilities associated with their values. Mathematically, discrete and continuous variables are studied differently. Thus, the probability that a variable X takes a particular value, x, can be defined when X is a discrete variable. When X is a continuous variable, however, this probability is always equal to 0. For continuous variables, what is defined is not the probability that X will take the value x, but rather the probability that X will take a value within an interval of two values x1 and x2.

An example of a discrete variable is the age of the fishes in an age composition. The values of this variable (age) are 0, 1, 2, 3, etc. years, and are attributed to each fish otolith observed.

The probability of selecting a fish of a certain age from a box is associated with the number of fishes of that age in the box. However, age can also be an example of a continuous variable when it is taken as the time elapsed since birth up to the moment of capture. In this case, only the probability of a fish having an age within an interval of time is considered.

Probability and distribution functions of discrete variables

The probability function, also called probability mass function, P(x), defines the probability that a discrete variable, X, takes a value x:

P(x) = Pr {X = x}

The distribution function, also called probability distribution function, F(x), gives the probability that the variable X will take a value less than or equal to a certain value, x:

F(x) = Pr{Xx}

i.e. F(x) = ∑P(xi), where the summation extends to all values less than or equal to x.

Two important parameters of the probability distribution are the mean, E, and the variance, V. The mean, which in probability theory isalso called the expected value of X, is the sum of the products of the values, xi, times their probability, P(xi):

E[X] = ∑[xi P(xi)]

The variance is defined as the expected value of the square of the deviations of the values of variable X relative to its mean, that is:

V[X] =E[X - E[X]]2 = ∑[(xi - E[X])2 P(xi)]

Another expression of the variance is:

V[X] = E[X]2 - E2 [X]

The standard deviation is the square root of the variance.

If the variable X is continuous, the definition of the parameters E and V needs differential and integral calculus, as is shown below.

Density and distribution functions of continuous variables

In the case of a continuous variable X, a function, f(x), called a density function, or probability density function, is defined to obtain the probabilities and the distribution function. Among the properties of this function it is useful to mention:

The distribution function, F(x), i.e. the probability that the variable X takes a value smaller than x is therefore defined by:

Note that x is used as the reference value and also as the generic value.

The probability that X takes a value within the interval limited by the extremes x=x1 and x=x2 is given by:

Pr{x1X ≤ x2} = F(x2) - F(x1)

The expected value (or mean) of X is defined as:

The variance is defined as:

As for discrete variables, another expression of the variance is:

V [X] = E[X2] - E2 [X]

The standard deviation is the square root of the variance.

Some probability distributions useful for sampling theory

The normal distribution

One of the most important distributions of the theory of probabilities is the normal distribution, also designated as De Moivre distribution (1733), Gauss distribution (1809) or Laplace distribution (1813).

It is a distribution of a continuous variable, X, characterized by the following density function:

Where -∞<x<+∞ and the parameters m and s2 are the mean and the variance.

The density function is symmetrical relative to the vertical ordinate passing through the meanm.

The distribution function, F(x), the expected value, E[X], and the variance, V[X] could be calculated (by numerical methods) from the integral of the expression indicated above, but there is no need to present those methods in this manual.

The following notation is usually used to indicate that X follows a normal distribution:

XN(μ,σ2)

The median and the mode of this distribution are equal to the mean, μ.

A useful theorem for sampling theory is:

If A and B are constants and XN(μ, σ2)

Then (BX+A) N(B μ+A , B2σ2)

The standard normal distribution

Applying the previous theorem with and gives:

(X-μ)/σ N(0,1)

The new variable (X-μ)/σ is said to be a variable in standard measures and is, in this case, called the standard normal variable Z. Any normal distribution can be reduced to its standard form using the relation:

Note that this expression is equivalent to: x= μ + zσ

Some particular probability values of the normal distribution should be mentioned:

In terms of Z:In terms of X:
Pr{-1 < Z <+1} = 0.68Pr{ μ - 1 σ< X < μ+ 1σ} = 0.68
Pr{-1.96 < Z <+1.96} = 0.95Pr{ μ- 1.96 σ< X < μ+ 1.96σ} = 0.95
Pr{-2.58 < Z <+2.58} = 0.99Pr{ μ- 2.58 σ< X < μ+ 2.58σ} = 0.99

Note that if one is using a normal variable X then the probability that X takes values between x1 and x2 is equal to the probability that the standard normal variable Z takes values between the corresponding z1 =(x1-μ)/s and z2=(x2-μ)/s.

Figure 2.3 represents the density function of the standard normal distribution between Z=-3 and Z=+3.

In this graphical form, areas represent the probabilities. For instance, the probability that the standard normal variable Z takes values between z1 and z2 is the area limited by the curve, by the horizontal axis and by the two vertical ordinates passing through the values z=z1 and z=z2.

Remember that the total area under the density curve is equal to 1.

Figure 2.4 represents the distribution function of the standard normal variable Z.

FIGURE 2.3
Density function of the standard normal distribution

FIGURE 2.3

Figure 2.4
Distribution function of the standard normal variable Z.

Figure 2.4

In this graphical form the probability that the variable Z is smaller than a value z is the ordinate passing through this z point.

Note that the probabilities are given by areas when using the density function and by ordinates when using the distribution function.

The t-student distribution

The t-student distribution was introduced by Gosset in 1908.

It is a distribution of a continuous variable with one parameter denoted by ν (degrees of freedom) and it is generally designated by t(ν) or tν.

The mathematical expressions of the density and distribution functions of the t- student variable are not discussed in this manual.

The graphical representation of the probability density function is similar to that of the standard normal variable, but more dispersed, with heavier tails. The dispersion depends on the number of degrees of freedom -the fewer degrees, the larger the dispersion. The mean, the median and the mode of this distribution are all equal to zero.

The t-student distribution is important in statistical methods and in particular for calculating confidence intervals for parameters of the population.

Associated with the normal and t-student distributions are the chi-square2) distribution and the F-distribution, which are useful in many statistical methods, but are not dealt with in this manual.

Bernoulli distribution

This distribution is attributed to Bernoulli (1713).

Consider a discrete variable X which takes the value 1 with probability P and the value 0 with probability Q=1-P. In symbolic terms this is:

P+Q=1

The most important parameters of this distribution are:

Expected value E[X] = Mean μ= P

Variance            V[X] = σ2= PQ

Standard deviation σ =

The binomial distribution, the multinomial distribution, and other probability distributions are associated with the Bernoulli distribution. They are, in certain cases, important to the sampling world. They are combinations of Bernoulli distributions, with the same parameter P in the case of the binomial distribution and with different parameters P in the case of the multinomial distribution.

Introduction to the world of sampling

As previously mentioned an estimator is a statistic used to estimate a parameter.

Let us consider an estimator of the population parameter . From a sample of size n, taken with a certain criterion, one can calculate a value for this estimator, that is called an estimate of .

Sampling distribution of an estimator

The set of estimates that could be calculated from all possible samples (selected with the same criteria) is, by definition, the sampling distribution of the estimator.

This sampling distribution is the basis for measuring the precision and the error of the estimation of the population parameter of interest.

The sampling distribution of an estimator (or in the general case of a statistic) is a probability distribution, because it is the expected distribution of all the possible samples, which could have been selected under the same conditions. Therefore, probability theory can be applied to obtain the properties of the sampling distributions.

The sampling distribution of an estimator is denoted as:

where F (E, V) indicates a probability or density distribution with expected mean, E, and expected variance,V.

In the case of an approximate distribution, the symbol will be used, as for instance:

Expected value of the estimator

Let us consider an estimator . The expected value or sampling mean of this estimator will be denoted by E[]. This expected value does not always coincide with the population parameter . The difference E[] - is called bias. When E= the estimator, , is an unbiasedestimator of the population parameter,.

Sampling variance and error of the estimator

The sampling variance, V[] or , of the estimator is the expected value E[- E()]2. This means that the sampling variance measures the spread of the estimator around its expected value.

Another measure of dispersion is the mean square error, defined as MSE[] = E[ - ]2. The MSE is a measure of the dispersion of the sampling distribution of the estimator around the population parameter , while V[] is a measure of dispersion around the expected value of the estimator.

It can be proven that MSE[]=V[]+bias2.

For an unbiased estimator the sampling variance is equal to the MSE.

The accuracy of an estimator refers to the difference between the estimator and the parameter , while the precision refers to the difference from the expected value.

The sampling standard deviation is called the error of the estimator, which is denoted by σ. Estimates of the sampling variance and of the error, denoted by and Srespectively, can be obtained from the size and the variance of the sample.

Confidence intervals

The sampling distribution of an unbiased estimator gives the opportunity to establish the probability that an interval, (l1, l2) calculated from the sample values, will contain the population parameter,.

The relation Prob[l1l2} = C allows one to calculate the confidence interval (l1, l2) but the solution of this equation is not unique. One should take the smallest of all intervals to which the probability adopted corresponds. This probability is the confidence level, C. The interval is the confidence interval, CI, and its extremes are the confidence limits.

In some cases, it is preferable to take an interval (not necessarily the smallest) corresponding to the probability C but with equal probabilities in both tails of the sampling distribution, which implies:

and

or, alternatively, in the equivalent form

In practical terms, to calculate the limits l1 and l2 of a confidence interval one should adopt the desired level of confidence, C and a sample size, n. Then, based on the sampling distribution of the estimator, the expressions of l1 and l2, can be derived. Finally, one selects a sample and estimates l1 and l2.

Let us consider the particular case that has a normal sampling distribution, that is, N[E, V], the population variance, σ2, is known and the sampling fraction is negligible. Then, according to sampling theory, the confidence limits are:

where z is the value of the standard normal distribution corresponding to the confidence level C and σ is the square root of .

The range of the confidence interval, l2-l1 is in this case equal to 2 z σ.

If the population variance, σ2, is unknown, the expression above will be:

where t n-1 is the value of t corresponding to the confidence C, of the t-student distribution, with n-1 degrees of freedom and σ is the square root of designated by s and calculated from the sample.

The confidence interval range, l2-l1, is equal to 2t(n-1) s.

Note that for a large sample size, let us say n larger than 100, there is practically no difference between the t-distribution and the Z- distribution.

Let us take an example in which the estimator follows a normal sampling distribution. A sample of 100 elements gave an estimate of equal to 13.40 and a variance s2 = 30.25. The error is the square root of, i.e., 0.55.

A 95% confidence interval (symmetrical) for the parameter can be expressed as:

or 13.40 ± 1.96 × error

The factor 1.96 corresponds to the 95% confidence level and is obtained from the normal distribution.

The confidence limits can be presented in different ways. Some authors present the limits in absolute values; others prefer to give the limits in relative terms. In the above example the 95% confidence limits in absolute terms would be:

or

Note that the interval should never be presented as the mean plus or minus the error, i.e. 13.40 ±0.55 because, according to the sampling distribution of the estimator, the confidence limits are defined as the mean plus or minus (approximately) twice the error.

The limits in relative terms would be:

where

or 13.40 ± 8.06%

or only ±8.06% of the estimated mean.

Sample size

As mentioned above, to calculate the confidence interval, one needs to establish the confidence level, C, and the sample size, n. Based on these assumptions one calculates the error and the confidence limits.

The same expressions can be used to estimate the sample size, but in this case it is necessary to start by adopting a confidence interval, CI, and the error. From the sampling distribution of the estimator one can then derive the expression that gives the sample size.

In the above example it is easy to see that:

CI = l2 - l1 = 2 × z × error

Then considering that the error is , it will be:

and the sample size is:

If the confidence interval were calculated with the t-student distribution, the calculation would be more tedious as the sample size n appears also in the degrees of freedom of t.

An iterative method can be used to solve the equation, which in this case is:

The first trial can be done with the Z distribution instead of the t-student distribution. The value of n obtained will then be the next value to use to obtain the tn-1 value corresponding to the confidence level C. The process is repeated until arriving at a convergence of the n values, with a given approximation.

The following expressions summarize the links between the confidence level, precision, error and confidence limits:

“The greater the error, the larger will be the confidence interval and the smaller the precision. The smaller the error, the shorter will be the confidence interval and the greater the precision”.


Previous Page Top of Page Next Page