# Basic Sampling Theory and Sampling Design

CONTENTS

 1. Population, frame, sampling units, survey units 2. Method of selection 2.1 Simple random sample (SRS) 3. Estimation of population mean from a sample and precision of estimate 3.1 Estimation of population total and precision 3.2 Sample size 4. Estimation of proportions and their uses 5. Stratified sampling 5.1 Sample size in different strata 6. Ratio estimation 7. Unequal probability sampling 7.1 Method of selection 7.2 Method of estimation 8. Two stage sampling 8.1 Selection of first stage units at random 8.2 Selection of first-stage units with probability proportional size (pps)

1. Population, Sampling Frame, Sampling Units, Survey Units

Whenever a survey is contemplated, it is first necessary to specify the units which require to be included in the survey, and their geographical context. All rigorous sampling demands a subdivision of the material to be sampled into units, termed “sampling units”, which form the basis of the actual sampling process. Clear and unambiguous definition demands the existence or construction of a list (= sampling frame) of the sampling units. In the case of a Catch Assessment Survey (traditional and artisanal fisheries) the following hierarchy of sampling units can be introduced:

- Primary sampling units (PSU's): landing places
- Secondary sampling units (SSU's): fishing economic units

Items of information on the survey characteristics are collected from the above SSU's, which, are also called “survey units”.

For data collection one of the following two survey methods can be used: (a) The census method. This implies complete enumeration of the survey population; in a census method information is obtained from all the survey units in the population, and (b) The sampling method, where information is obtained from a properly selected fraction of units of the survey population. In large-scale surveys, the sample selection is from the existing sampling frame.

2. Selection of Sample Units

If there are N sampling units in the population and we want to draw a simple random sample1 of size n, we can work out all possible samples of size n and select one of them at random. The number of all possible distinct samples of size n which can be selected from a population N is given by:

where, ! stands for factorial e.g., 3! = 1x2x3, etc. For example, if N = 4 and n = 2, the number of distinct samples which can be selected is given by:

In practice, when N is large, it is not possible to enumerate all possible distinct samples and then select one of them. Normally, a simple random sample is drawn unit by unit. The units in the population are marked serially from 1 to N. We then refer to a table of random numbers (see Appendix Table 1) and draw from this table a series of n numbers lying between 1 and N, taking care to reject numbers above N and not allowing the same numbers to appear in the series more than once. The units in the population marked as per the number selected in the series constitute our sample of n selected units. It has been proved that this method produces simple random samples.

Example

There are N=28 landing sites in a district. We want a simple random sample of n=5 landing sites.

Since N=28 is a two-digit number, we refer to any row of two-digit numbers in the Random Number Table. Referring to the first row of two-digit numbers, we find the consecutive numbers are: 23, 5, 14, 38, 97, 11, 43, 93, 49, 36, 7, etc.

Now select those that lie between 1 and 28, until we have selected a series of 5 numbers. The selected series is: 23, 5, 14, 11 and 7.

1 This means, every unit in the population has an equal and no zero probability of being selected in the sample

The landing sites marked with these numbers in the population constitute our sample.

3. Estimation of Population Mean from a Sample and Precision of Estimate

If there are N units in the population and we measure a desired characteristic (y) of all units in the population, then we have:

The variability in the measured characteristics among the population units is given by S²y

Now, if we draw a sample of n units from the N units in the population, we can define:

and the variance per unit in the sample is given by:

If the same method of measurement of the desired characteristics is employed both for the population units and the sample units, the absolute value of the precision of the sample mean is given by:

Generally, the population mean is not known and the main purpose of sampling is to get an estimate of from the sample and also to have a measure of precision of that estimate. Now we know that in SRS we can produce Ncn samples (of n units) from a population of N units, and we can have a series of Ncn sample means 's.E() is equal to and thus is an unbiased estimate of . It has also been proved that in the case of SRS selection, the variance of is given by:

or,

The standard error of the sample mean is given by:

or,

S measures the degree of scatter of possible sample means around . The smaller it is, the probability of a large deviation of from will be small. For n > 30, it has been shown that at 95% probability level, the population mean will lie in the interval,

Thus we see that S provides a measure of precision of the sample estimate.

We generally do not know Sy in order to calculate Sy. In SRS, an unbiased estimate of Sy is provided by sy.

3.1 Estimation of Population Total and Precision

Example 3.1a

In a landing site, 30 boats land their catch on a particular day, and the catches (yi) of 10 boats selected at random are examined. Estimate the total catch of the day and its standard error and coefficient of variation: N = 30; n = 10.

Sample boatCatch (kg)
yii
112144
2  8  64
3  4  16
4  6  36
5  0  0
616256
7  5  25
8  9  81
911121
10  9  81

The various estimates are:

3.2 Sample Size

In Section 3 we have seen:

Therefore,

When N is large,

Now, for large N, at 95% probability level, the population mean will lie within the interval ± 1.96 s or roughly within ± 2 s. Therefore, represents percentage accuracy of the mean at 5% significance level.

Thus, the sample size n required for an a% accuracy of the mean at 5% significance level is given by:

Example 3.2a

In a survey sample n = 18 gave a mean of = 589.44 kg and sy = 531.79. How many units would be needed if it were desired to estimate at a 5% significance level, the estimated mean (a) within 10%, (b) within 5%, and (c) within 1% of the population mean.

We have,

Therefore,

(a) Number of units required for getting with an accuracy of 10% is,

(b) For an accuracy of 5%,

(c) For an accuracy of 1%,

Example 3.2b

In Example 3.1a, if we had derived an estimate of with a cv of 5%, what size of sample would be needed.

We have,

Therefore,

and,

4. ESTIMATION OF PROPORTIONS AND THEIR USES

Let there be N units in the population of which Ni belongs to i-class, so that the proportion belonging to class i is: Pi=Ni/N . We want to estimate Ni and Pi from a simple random of n units, in which ni is in class i so that pi=ni/n.

It has been shown that an unbiased estimate Pi of Pi is given by Pi, so that Pi = Pi = ni/n, and an unbiased estimate of Ni (where Ni is the number in the class i in the n population) is given by: Ni = N ·pi.

An unbiased estimate of variance of pi is given by:

When n/N is small, i.e., n is small compared to N, or N is very large,

An unbiased estimate of the variance of Ni is given by:

If the magnitude of N is itself an estimate, the estimated variance of Ni is given by:

Example 4.1

A random sample of 82 boats were taken out of 820 boats. It was found that 32 were using lines. Estimate the proportion and number of boats using lines.

Example 4.2

The number of cods landed was 2 000. A sample of 100 cods were taken and their ages determined, and the distribution is as follows:

 Age 8 9 10 11 12 Total Number (ni) 14 54 7 19 6 100

Find out the estimated number of cods in each age group in the total landings and the variance of these estimates.

Here we have: N = 2 000; n = n1 + n2 + n3 + n4 + n5 = 100

Age89101112Total
ni14547196100
Pi.14.54.07.19.06
qi.86.46.93.81.94
pi qi.12.25.07.15.06

5. STRATIFIED SAMPLING

It has been seen that in simple random sampling the variance of mean v() depends, apart from the sample size n, on the variability of the characteristics in the population, i.e., on S²y. If the population is heterogeneous, i.e., measurements vary considerably from one unit to another, then by using auxiliary information, it may be possible to divide it into sub-populations (or strata), each of which is internally homogeneous.

Let us suppose that there are N units in the population and these are stratified into k strata with Ni units in the ith stratum. Let a sample of n units be drawn, of which ni are from the ith stratum. Let yij be the measurement of the jth unit in the ith stratum.

Then we have the following:

and,

We also have,

The unbiased estimates of variances are:

If the sampling fraction ni/Ni is negligible for all strata, then we have:

Example 5

Out of 200 boats in a district, 70 were engaged in line fishing, 120 in gillnet fishing, and 10 in beach-seine fishing. For the purpose of estimating catch, 5 line fishing boats, 7 gillnet boats, and 3 beach-seine boats were selected, and their catches in tons for the month of January were noted as follows:

 Line boats : 2, 3, 4, 5, 6 Gillnet boats : 7, 8, 9, 10, 12, 13, 11 Beach-seine boats : 20, 23, 26

What was the estimated total catch in the district in January and the variance of the estimates? What is the mean catch per boat and its variance?

There,

Note: If there was no stratification, and we had chosen a simple random selection of 15 units, and their catches were as in Example 5, we would have:

and,

Ŷ = 10.06 × 200 = 2 012 t

Therefore,

and,

Therefore,

Clearly, by stratification, we have obtained an estimate with lower cv(Ŷ) than in the case of a simple random selection.

5.1 Sample Size in Different Strata

In Example 5, we selected a sample of 15 units, and the allocation of number of units in the different strata was done arbitrarily.

Now, when the sampling fraction is negligible, we know from equation (5.5) that variance of the population total is given by:

This equation suggests two methods of allocation of n among the different strata:

(a) Proportional allocation:

In this method, ni is proportional to Ni. If within-stratum variances are equal, the method gives the smallest sampling variance, i.e., the most efficient estimates. Generally, the proportional allocation is used when information on strata variances are not available.

(b) Optimum allocation:

When the within-strata variances differ greatly from stratum to stratum, the proportional allocation no longer provides best estimates. In such cases, it is better that the sampling fraction is taken proportional to the stratum standard deviation.

For further details on these, one is referred to books of sampling designs (e.g., Yates, Bazigos, 1974).

Example 5.1

The following catches (kg) were obtained in 18 hauls of a trawl survey:

200, 440, 600, 640, 700, 800, 900, 1 020, 1 600, 1 920 20, 10, 340, 400, 720 40, 100, 160

(a)   If the trawl net covered 40 ha per haul and if 50% of all fish in its path was caught and the total survey area was 6 × 106ha, estimate the total abundance of fish.

(b)   If the first 10 hauls were taken in depths 0–20 m, the next 5 in depths 20–40 m, and the last three in depths over 40 m and the areas of the depth zones are 1 × 106, estimate of abundance?

(c) Find the variances of the above two estimates.

Solution

(a) Unstratified Sample

Let be the mean catch, and if a is the area swept by each haul, the catch per hectare is /a. Since the net catches only 50%, i.e., the catchability coefficient q is 1/2, the density of stock per hectare is: /aq.

Therefore, estimated abundances for the survey area A are:

and,

where n is the number of sample hauls.

Now we have,

(b) Stratified Sample

In this case,

The nummerical calculations may be done conveniently in a tabular fashion:

Ratio Estimation

This is another method in which use is made of auxiliary information to increase the precision. Let us suppose we have selected at random n units out of N units in the population and for each of these selected units we have measured (x,y), where y is the survey variate and x is another correlated variate. The population total of x-variate is known to be:

but y may not be known for each unit of the population except for those in the sample. In this case, an estimate of the population total Y of the survey variate is given by: Ŷrat = R\?\ X, where the estimate R is obtained from the sample as:

The variance of the ratio estimate Ŷrat is given by:

where, r is the estimated coefficient of correlation between x and y.

Example 6.1

There are 50 landing centres in a country where shrimp trawlers land. The shrimp trawlers are registered and the total from the Registration Record is known to be 280. Now, 5 landing centres are selected at random and the catch (y) and the number of trawlers (x) at each of the 5 landing centres are obtained. Make a ratio estimate Yrat of the total landings by the shrimp trawlers in the country.

We have,

 Landing centres: Total - N = 50 Sample - n = 5 Trawlers: Total - X = 280

We have,

Sample
landing
centres
No. of
trawlers
(x)
Catch
(y)
(t)
xy
1 2 22  4 484 44
210 951009 025950
3 7 62493 844434
4 3 33 91 089 99
5 8 83646 889664
Total:3029522621 3312 191

Therefore, Ŷrat = R\?\ X = 9.83 × 280 = 2 752.40 t

and from equation (6.1),

7. UNEQUAL PROBABILITY SAMPLING

We have seen that by stratification and ratio estimation we can increase the precision of estimate. Another technique used for this purpose is pps sampling, i.e., where the sampling units are selected with probabilities proportional to their sizes. This is widely used in cases where sampling of clusters is preferred to direct sampling of individual units, the reasons being that it is economical to sample a fixed number of individual units when they are in clusters and that sometimes reliable frame of individual units are not available.

7.1 Method of Selection

Suppose there are 10 landing sites with number of boats at each landing sites shown in Col. 2. We want to select 3 sites with pps.

LandingNo.ofCumulativeAllottedSelected random no.
siteboatstotalnumbersor fishing site
(1)(2)(3)(4)(5)
112 12001–012
2 5 17013–017
Random no. 011
320 37018–037
Fishing site 01
4 2 39038–039
Random no. 027
530 69040–069
Fishing site 03
615 84070–084
Random no. 064
7 8 92085–092
Fishing site 05
8 6 98093–098
9 8106099–106
1014120107–120
120

Column 3 is the cumulative total. Now each landing site is given a number proportional to its size. Thus the landing site 1 gets 12 numbers, 001–012, allotted to it, the landing centre 5 gets 30 numbers from 040–069 allotted to it, and so on. Then we use the random number table and select 3 numbers between 1 and 120. These selected numbers are: 011, 027 and 064. The corresponding fishing sites selected are: 01, 03 and 05.

It may be noted that in this method of selection a unit with a larger size has a higher chance of selection than a unit of a smaller size.

7.2 Method of Estimation

Let there be N primary sampling units (fishing sites) and let xi be the number of secondary units (boats) in the ith landing site. If n primary units are selected with pps, then the probability of selecting the ith unit in the sample is: Pi=xi∑xi.

The estimate of the Population Total Y is given by:

where yi is the measurement of the ith unit in the sample; and the estimated variance of Y is given by:

Example 7.2

There are 20 fishing sites in a district. The number of boats at each centre is known, i.e., xi = number of boats at the ith centre is known, and therefore X = ∑xi is known to be 496. Four fishing sites are selected out of 20 fishing sites with pps. In the table below, Col. 1 gives the 4 fishing sites selected in the sample, Col. 2 gives the number of boats (x) in these sites, and Col. 3 gives the landings at these sites during a month. Estimate the total monthly landings Ŷ and v(Ŷ).

SampleNo.ofLandingsPi=xi/Xti=yi/pi
sitesboats(in t.)
(xi)(yi)
(1)(2)(3)(4)(5)(6)
122 810.04431 828 3 341 584
2301180.06051 950 3 802 500
3301180.06051 950 3 802 500
4421700.08472 007 4 028 049
Total:   7 73514 974 633

8. TWO-STAGE SAMPLING

In two-stage sampling, a sample of first-stage units are chosen first, and in each of the selected first-stage units, a further sample of survey units is chosen. A simple random selection may be made for the first-stage units or they can be selected with probability proportional to their sizes.

8.1 Selection of First-Stage Units at Random (SRS)

Let us have:

N = Number of first-stage units

n = Number of first-stage sample units

Mi = Number of survey units in the ith first-stage unit

mi = Number of survey units selected in the ith first-stage unit

The unbiased estimate of the population total of the survey characteristic (y) is given by:

Example 8.1

Let there be 8 fishing sites (N=8). We first select n=3 fishing sites at random and for each fishing site we select 3 traps and measure their catch. The number of traps existing at each selected fishing site and the catches of each selected trap are shown below. Calculate the estimated total catch of trap fisheries and its variance.

 Sample sites 1 2 3 No. of traps at each site (Mi) 6 9 7 No. of traps selected (mi) 3 3 3 Catches of selected traps 13 5 12 9 7 8 6 10 13 Sample total 28 22 33 s²i 12.3 6.3 7

Estimated total landings,

It may be noted that the contribution 1 473.3 to v(Ŷ) is due to difference in the obtained catches between the fishing sites and this is much greater than 673.3 which is due to difference among second-stage units within the first-stage units.

8.2 Selection of First-Stage Units with PPS

The estimated catch in the ith fishing site is given by:

The unbiased estimate of the population total is given by:

The variance of Y is given by:

Example 8.2

Three fishing sit es were chosen with pps and within each sample fishing site a simple random sample of boats were selected. In the table below we give catches (in kg) of selected sample. Calculate Ŷ and cv(Ŷ).

APPENDIX TABLE 1:
Table of Random Numbers
123456789101112
137043693881874212204115
269982789905229952328091
725395810798147452587310
220808683716366220023598
216190538572869487185011
473855665096967834455278
966813073129700916668109
459293448772267582317269
788571453216579152059320
519950886254905101391870
676230028817372542860032
030889771241152552309311
451004669470337497234097
624846970436312729848535
595933635343603015816759
726367172455683224801392
462815702898533603898374
210309163148051098621415
848253399214078404016617
756840903995461094683910
427729807338921181725088
635509846656921397148727
542970148595797277485792
429750611955385585578508
523047732654180575929508
884433024797470412389325
499193731415014702703096
454246069360410931295249
506974105189665157215495
185673160287410513871361
131415161718192021222324
769685278121753943778081
385109174185132066592220
409190517423548884121677
445323879153869742808337
312522301617323400072552
363520928112152842986752
361217038393486450325794
255140748516860922620638
723833973658909123911904
172075038553064129785115
755737776760704456910349
124735371517962495083955
736755641638587429714962
160229141678444934054696
489813291971987119518682
736542093992566836545546
229606415575086255191515
572611289816853967490230
477660922279706678139742
318030860854398838467421
915548362640177039940576
837010912064123315594328
283553143057073409562681
869162948396961702108971
248686526759632228764345
437370731941046025420950
526934016533196222412965
011592695378685874080511
944683724919980956832540
444206329517326780840969
815885331611871217391211
602584422294389652033897
531275597642734895575131
026801170900381231522224
096853928211960347313559

Tables of Random Numbers (from Bazigos, 1974)