Previous Page Table of Contents Next Page


5. Statistical methods in the analysis of epidemiological data


5.1 Introduction
5.2 Estimating population parameters
5.3 Formulating and testing statistical hypotheses in large-sized samples
5.4 Formulating and testing hypotheses in small-sized samples
5.5 Matched comparisons
5.6 A word of warning
5.7 Linear correlation and regression
5.8 Time series


5.1 Introduction

In this chapter readers will be introduced to some of the simpler statistical techniques used in the analysis and interpretation of epidemiological data. At this stage, it may be of use to make a few general points about analysing epidemiological data.

· Look at the data to gain an insight into the problem being studied. Some of the useful methods for setting out data were outlined in Chapter 3.

· If data generated by other investigators are being used, find out as much as possible about how the data were generated. This may reveal significant omissions or biases in the data which may influence the analysis.

· Do not ignore anomalies in the data; investigate them. Often such anomalies provide valuable clues to a deeper understanding of the problem being investigated.

· Avoid the temptation to use complicated statistical techniques if the quality of the data does not warrant it. Above all, avoid using such techniques to try and establish relationships between variables unless you can satisfy yourself that there are valid biological reasons for such relationships.

· Be cautious about making inferences from sampled to target populations. Your own experience should normally tell you whether such inferences are valid or not. If any inference is made, the populations involved should be clearly defined and the fact that an inference is being made clearly stated.

· When setting out findings, display the data used and the analyses undertaken in a simple, clear and concise form. A series of simple tables or graphs is preferable to one complicated table or graph. Long, complicated data sets should be placed in an appendix. Any limitations in the data presented should be clearly stated.

· Look at the data during the study, not just when it has been completed. This may enable the study design to be modified so as to include lines of inquiry which appear promising and to disregard those which do not.

· Finally, remember that a "negative" result, i.e. one that does not prove the hypothesis, is often as valuable as a "positive" one. Do not be afraid to record negative findings.

5.2 Estimating population parameters


5.2.1 Estimating a population mean
5.2.2 Sample size needed to estimate a population mean
5.2.3 Estimating a population proportion or rate from a simple random sample
5.2.4 Estimating a rate or proportion from a cluster sample


5.2.1 Estimating a population mean

Using the data in Table 1, we calculated that the mean weight of a sample of 150 chickens randomly selected at a large market was 1.3824 kg. Since the chickens were selected at random, the same data can be used to derive general statements about the population from which the sample was drawn. In particular, we would like to know how precise will be the information that we can obtain about the mean weight of all the chickens offered for sale in the market on the day we selected the sample.

Although our intuition tells us that the mean weight of the sample ought to be something like the mean weight for the whole population from which it was drawn, the sample mean will hardly ever have exactly the same value as the population mean. There are many millions of different possible samples of 150 chickens which could result from a total of 4000, and each possible sample of 150 chickens will have its own mean value. These means will mostly be different from one sample to another. We cannot know for sure in any particular case how close the mean value of the sample is to the population mean in which we are interested.

Furthermore, statistical methods of analysis cannot remove this uncertainty. Nevertheless, the theory of statistical inference does provide us with the means to measure it. For example, we will be able to say that "we can be 95% certain that the true population mean weight lies in the interval 1.3521 to 1.4127 kg" or that "we can be 99% sure that the true population mean weight lies in the interval 1.3425 to 1.4223 kg". Such statements about a population mean will always be possible provided that the information was obtained in a reasonably large random sample - a sample size greater than 50 ought to be enough.

There are four steps involved in the calculation of intervals. We will work through these steps using the example of chicken liveweights, and then state them in general terms.

· First, we have to calculate the mean chicken weight in the sample (1.3824 kg), which we shall use as an estimate of the population mean.

· We then calculate the standard error of the estimated mean using the rule:

where: n = sample size (150), and
f = sampling fraction i.e. the proportion of the total population which was sampled, in this case f = 150/4000.

In Chapter 3 we calculated the standard deviation of the sample as 0.1931 so that:


· Third, we have to decide how sure we wish to be that the interval we state will actually include the true value. Generally, 90%, 95% or 99% confidence is demanded, and the resulting interval is called a 90% (or 95% or 99%) confidence interval. There is a special multiplier corresponding to each of these levels of confidence (Table 16).

Table 16. Multipliers to give 90%, 95%, 99%, and 99.9% confidence that a stated interval includes the true population mean value.

Confidence

90%

95%

99%

99.9%

Multiplier

1.64

1.96

2.58

3.30

· Fourth, we calculate the interval from the formula:

Estimated mean ± multiplier x standard error of estimated mean.

For a 95% confidence interval, we have:

1.3824 ± 1.96 x 0.0155
or 1.3824 ± 0.0303.
or 1.3824 - 0.0303 to 1.3824 + 0.0303
i.e. 1.3521 to 1.4127 kg.

To sum up, the four stages in the calculation of a confidence interval for the true value of a population mean are:

i) Calculate an estimated mean of the sample.
ii) Calculate the standard error of the estimate.
iii) Decide on the level of confidence required.
iv) Calculate the interval from the formula:

Estimated mean ± multiplier x standard error.

The actual formulae used to calculate the estimate (step i) and its standard error (step ii) will depend on how the data were collected. The above calculations are appropriate for a simple random sample taken from a population which consists of a single group. In reality, however, we often use cluster samples.

We will illustrate now what difference cluster sampling would make to the estimation of the population mean. Table 17 gives the weights of chickens offered for sale by five traders selected at random from 132 chicken traders in the market.

The total and mean weights of chickens sold by each trader are given in Table 18.

The population mean will again be estimated by dividing total weight by the number of chicken sampled i.e.:

207.36/150= 1.3824 kg

Table 17. Weights (kg) of chickens offered for sale by five traders.

Trader 1










1.40

1.09

1.74

1.48

1.82

1.09

1.52

1.41

1.83

1.22

1.34

1.68

1.25

1.65

1.14

1.33

1.06

1.71

1.17

1.51

Trader 2










1.36

1.34

1.03

1.24

1.06

1.12

1.15

1.57

1.38

1.40

1.39

1.31

1.50

1.10

1.45

1.34

1.38

1.35

1.49

1.58

1.25

1.42

1.64

1.57

1.53

1.18

1.39

1.34

1.13

1.23

Trader 3










1.17

1.88

1.30

1.27

1.01

1.63

1.47

1.23

1.48

1.48

1.37

1.42

1.22

1.47

1.31

1.05

1.61

1.41

1.17

1.45

1.43

1.22

1.40

1.14

1.53

1.25

1.02

1.30

1.35

1.37

1.69

1.37

1.11

1.30

1.05

1.19

1.36

1.63

1.44

1.29

Trader 4










1.35

1.59

1.94

1.51

1.78

1.37

1.11

1.38

1.53

1.44

Trader 5










1.47

1.39

1.55

1.76

1.43

1.37

1.67

1.36

1.31

1.41

1.36

1.26

1.17

1.15

1.79

1.46

1.35

1.29

1.50

1.26

1.36

1.41

1.36

1.32

1.08

1.28

1.33

1.29

1.42

1.50

1.32

1.39

1.20

1.68

1.20

1.35

1.56

1.57

1.37

1.27

1.25

1.38

1.56

1.60

1.74

1.40

1.11

1.60

1.21

1.44

Table 18. Total and mean weights of chickens sold by each trader.

Trader

No. of chickens

Total weight (Y)

Mean weight

(X)

(kg)

(kg)

1

20

28.44

1.4220

2

30

40.22

1.3407

3

40

53.84

1.3460

4

10

15.00

1.5000

5

50

69.86

1.3972

Total

150

207.36


The standard error has to be calculated differently, however, as follows:

Let f = 5/132, the sample fraction of traders sampled.
Let m = 5, the number of traders sampled.
Let n = 150, the total number of chickens sampled.

Then the standard error (SE) is given by:


W = R2 S X2- 2R S XY + S Y2

The estimated mean (R) = 207.36/150 = 1.3824

S X2 = 202 + 302 + 402+ 102 + 502 = 5500
S Y2 = 28.442 + ... 69.862 = 10430.6472
S XY = 20 x 28.44 + ... 50 x 69.86 = 7572.0

Thus:

W = (1.3824)2x5500-(2 x 1.3824 x 7572) + 10430.6472=6.2453

So:

This is an increase of 20% on the standard error we calculated using simple random sampling. As a result, the 95% confidence interval would be:

Estimated mean ± multiplier x standard error of estimated mean

i.e. 1.3824 1.96 x 0.0183
or 1.3824 0.0359
or 1.3465 to 1.4183 kg.

The interval span now is 1.4183 - 1.3465 = 0.0718 kg or 71.8 g, compared to the 60.6 g spanned by the interval calculated using a simple random sample. This demonstrates that if the sample is clustered, our knowledge of the population mean will be less precise. There are two reasons for this. First, with a simple random sample we fix the sample size in advance. When we choose a number of traders, we do not know in advance how many chickens they will have for sale, and this introduces an extra element of uncertainty. Second, it may happen that one of the chosen traders specialises in either unusually large or unusually small chickens. The sample will then contain a disproportionately large number of heavy or light chickens and, for that reason, it will be more variable than a single random sample.

On the other hand, before we can take a simple random sample we have to know how many chickens there are offered for sale, which may not be easy to establish. It will be much easier to count the number of traders. The chickens in a simple random sample are also likely to be distributed over a large number of traders, and it will take much more time to find and weigh them than to weigh all the chickens of a few traders.

For these and other reasons discussed in the previous chapter, some degree of clustering will be required in most practical surveys. Remember that clustering will nearly always increase the standard error and hence the uncertainty involved in the estimation of population means and proportions. This is especially true for variables associated with infectious diseases.

Although confidence intervals can always be calculated from the formula used above, how to calculate the standard error will not always be obvious. Indeed, if a survey is carried out using a complex sampling method, it may not be simple even to obtain an estimate of the mean. The possible options are so numerous and some of the corresponding formulae so complex that it is not appropriate to attempt to discuss them here. It is better to consult a statistician with some knowledge of sampling theory, or relevant textbooks (e.g. Raj, 1968; or Yates, 1981).

5.2.2 Sample size needed to estimate a population mean

We are now in a position to establish a method for judging how large a sample we may need to estimate a population mean with a given precision, at least when random sampling is used. We will demonstrate the principle by working through a hypothetical example, after which we will define the general procedure.

Example: Suppose we were to return to the market on another day and tried to estimate the mean chicken liveweight in such a way that we would be 95% confident that the estimated mean value will not differ from the true mean value by more than 0.02 kg.

From the previous section we know that, for a simple random sample, we can be 95% confident that the true mean value lies inside the interval:

Sample mean ± 1.96 x standard error of the sample mean.

In other words, we can be 95% sure that the difference between the sample mean and the true mean is not greater than 1.96 x SE.

In our present example we require that this difference should not be greater than 0.02 kg. If we find the sample size for which 1.96 x SE = 0.02 kg, we will know that any sample at least this large will meet the specification.

For a simple random sample, the standard error of the sample mean is:

where:
S = standard deviation of chicken weights,
n = sample size, and
f = fraction of the population being sampled.

We therefore have to solve the equation:

If turns out to be less than 10% it can be ignored, and we will assume for the moment that this is the case. The equation then simplifies to:

or 1 962 x S2 =(0.02)2 x n

or

The next problem is that we do not know the value of S. If we have no idea what this value is, we cannot estimate the required sample size, and we will have to take the largest sample we can afford with the resources available for the study. In our present example, we can use the value of S calculated in Section 3.6, since it seems reasonable to assume that the variability in the weights of the chickens offered for sale will not change dramatically from one market day to the next. Thus writing 0.1931 for S in the above equation, we get:

and a sample size of about 360 is indicated.

If only 3000 chickens are available on the day we carry out the study, this sample would be a large proportion (greater than 10%) of the total, and it is then appropriate to use a more exact formula:

In general, the two formulae can be stated thus:

Approximate formula:

Extract formula

where:
d = maximum difference to be tolerated between the sample mean and the true mean, and
N = total population size.

The multiplier, chosen from Table 16, depends on the level of confidence required to ensure that the specification will be met.

Note that to apply any of the formulae provided above, the value of S has to be known before the study is carried out. If it is the first study of a particular variable under the prevailing conditions, it may not be possible to suggest a plausible value for S. In that case there is no way of deciding what sample size will be required to provide a given precision with a given level of confidence.

Note further that the formulae are relevant only when the sampling units are chosen by simple random sampling. If a clustered sample is used, the estimated sample size should be increased by a factor of four to give a rough estimate of the total number of units which will need to be sampled to meet the specification.

5.2.3 Estimating a population proportion or rate from a simple random sample

In many ways, estimating a population proportion or rate is similar to estimating a population mean. Proportions and rates play a central role in epidemiological investigations, and there are one or two rather special pitfalls to avoid in their estimation. The following discussion will be confined to estimating point prevalence (P) and attack rate (A).

Let us first estimate a prevalence whose true value in the whole target population is P. a fraction between 0 and 1. For example, suppose that 850 animals were chosen at random and 62 were found to be diseased. The sample prevalence (p), which will be used as an estimate of the population prevalence (P), will then be: p= 62/850= 0.0729

The standard error (SE) of this estimated prevalence can be obtained from the formula:

where:
n = the sample size, which, in random sampling, is fixed before the sample is taken, and
f = fraction of the total population sampled.

Thus:

If f is less than, say, 10%, little information is lost by ignoring the factor (1 - f). SE then is 0.0089.

We can indicate how precise we believe the estimate to be by constructing a confidence interval using the multipliers in Table 16. For example, a 95% confidence interval for the true prevalence would be given by:

Estimated prevalence ± 1.96 x standard error of the estimate

i.e. 0.0729 ± 1.96 x 0.0089
or 0.0729 ± 0.0174.

Hence we would be 95% confident that the true prevalence lay between the limits 0.0555 and 0.0903. It is more common to state the limits in percentage terms i.e. 5.55% and 9.03%. If these limits are too far apart for the purposes of the study, the sample size is too small. (See Section 4.4).

The attack rate (A) for a population can be estimated in a similar way. For example, suppose that we chose 1500 healthy animals at random from a population of, say, 18 000 animals and, by the end of the observation period, we find that 437 of these have suffered the relevant disease. The estimated attack rate (a) would be:

a = 437/1500 = 0.2913 or 29.13%.

The sampling fraction 1500/18000 = 0.0833,which is just over 8%. The standard error of the estimate is:


If we had ignored the factor (1-f), we would have calculated the standard error to be 0.0117, which supports the previous statement that the factor can be safely ignored if less than 10% of the total population has been sampled.

Note that the correct estimation of a population proportion or rate from a simple random sample depends on the occurrence of a sufficient number of cases in the sample. However large is the number of animals examined, if fewer than five cases are discovered in total, reliable estimation is not possible.

5.2.4 Estimating a rate or proportion from a cluster sample

Table 19 shows the numbers of sampled and diseased animals on 12 farms chosen at random from 943 farms containing the total population at risk.

Table 19. Results of a survey of 12 farms chosen at random from 943 farms available.

Farm

Total No. of animals (n)

Number diseased

Proportion diseased

1

183

22

0.120

2

92

12

0.130

3

416

37

0.089

4

203

23

0.113

5

107

17

0.159

6

388

32

0.082

7

79

36

0.456

8

243

29

0.119

9

314

24

0.076

10

83

17

0.205

11

113

59

0.522

12

294

26

0.088

Total

2515

334


If we ignore the fact that the data were collected in a clustered fashion, we would reach the following conclusions:

i) The estimated prevalence p = 334/2515 = 0.133
ii) The standard error of the estimate is:


A minor problem here is that we do not know f, the fraction of the available animals belonging to the complete population of the 943 farms. However, since we have chosen 12 of the 943 farms, i.e. 1.3%, we can guess that the 2515 animals sampled is well under 10% of the total and, therefore, can safely ignore the factor (1-f):

iii) A 95% confidence interval for the true population prevalence would then be:

0.133 ± 1.96 x 0.0068

This procedure would be incorrect. of the clustered nature of the sample, the standard error must be calculated in a different way. This point is frequently misunderstood, especially when estimating rates and proportions.

The correct approach involves three steps:

i) Estimate the prevalence:
p = total with disease/total examined
p = 334/2515 = 0.133, as before.

(It is not uncommon to find the prevalence for the population being sampled by calculating the mean of the prevalences of the sampled herds, thus:

p = (0.120 + 0.300 +... 0.522 + 0.088)/12 = 0.180

If this were done, the estimate of the true prevalence would be 18% rather than the 13.3% estimated earlier. Note that the mean of the sampled herd prevalences will give a misleading impression unless the herds are all of a similar size or the herd prevalences are roughly equal. Neither is true here.)

ii) To obtain the standard error we need first to calculate three quantities.
- The sum of squares of the herd sizes (H):
S H2= 1832+ 922+... 1132 + 2942= 688 191
- The sum of squares of the number of cases (C) in each herd:
S C2 = 222 + 122 +...592 + 262 = 10 998
- The sum of the products obtained by multiplying each herd size by the number of cases (HC):
S HC = 183 x 22 + 92 x 12 +....113 x 59 + 294 x 26 = 72575

These three quantities are combined, together with the estimated prevalence (p), into a single value (W) by the formula:

W = p2 (S H22) - 2p (S HC) + (S C2)
W = (0.133)2 x 688191 -(2 x 0.133 x 72 575) + 10 998
W= 3866.46

The standard error of the estimated prevalence can then be calculated by:

where:
m = number of clusters in the sample (12 in our example), and
f = fraction of clusters sampled.

Since f in this case is small enough, it can be ignored and the standard error will be:

iii) The correct 95% confidence interval for the true prevalence then is:

0.133 ± 1.96 x 0.0258 i.e. 0.0824 to 0.1836

Note that if the data were analysed ignoring the clustered nature of the sampling, we would conclude, erroneously, that we could be 95% confident that the prevalence of the disease in the whole population was between 12% and 14.6%. If the sample is analysed correctly, the prevalence is between 8.2% and 18.4%, which is a much wider interval.

This has occurred because the clustering has increased the standard error by a factor of almost four. Such large increases in the standard error can be expected whenever the prevalence or attack rate varies noticeably from herd to herd, and will be particularly troublesome for highly infectious diseases when a herd is likely to be in one of two conditions, either completely free of infection or almost entirely infected.

The implication is that when a cluster sample is taken, the minimum number of cases required for a reliable estimation of prevalence or an attack rate will be several times larger than the 5 suggested as being sufficient in a simple random sample. The minimum would be 20 cases, but if all of them were in the same herd there would be problems.

It may be better, therefore, to confine the analysis to an estimation of the proportion of infected herds rather than animals. If the herds are sampled in such a way that each herd is considered as a single unit, there will be no clustering involved, and we can use the procedure applicable for the estimation of a proportion based on a simple random sample.

The problem just discussed is only one example of the way in which the actual sampling process can affect the statistical analysis and the conclusions based on it. There is a wide range of possible sampling schemes, each of which may require a different formula both for estimating the prevalence or attack rate and calculating the standard error of the estimate. A detailed account of these possibilities can be found in Yates (1981), Raj (1968) or Cochran (1977). The latter two books are rather mathematical; Raj (1972), although less comprehensive, may be easier to understand.

5.3 Formulating and testing statistical hypotheses in large-sized samples


5.3.1 Testing for a difference in two means
5.3.2 Testing for a difference in two proportions
5.3.3 Sample size for detecting differences between two proportions in prospective and cross-sectional studies
5.3.4 Sample size for detecting differences between two proportions in retrospective studies
5.3.5 Testing for differences in prevalence between several groups simultaneously
5.3.6 Testing for differences in several means simultaneously


One of the common aims of an epidemiological study is to compare two different populations of the same species. For example, we may wish to know whether a given disease is equally prevalent under two different management systems or prophylactic regimes; or we may want to test the possible economic benefits of anthelminthics by investigating whether treated animals gain weight more rapidly than those left untreated.

5.3.1 Testing for a difference in two means

Let us suppose that an experiment was carried out to compare the weight gains of 50 pigs treated with anthelminthics with the gains of 63 untreated pigs of the same strain and age, kept under the same management system over the same time period. The mean weight gains and the standard deviation of the weight gains were calculated for each group (Table 20). On average, the treated pigs gained more weight than those in the untreated sample. Could this be due to the specific, individual characteristics of the pigs chosen, by chance, for each sample? How can we decide whether this apparent improvement is just a chance effect?

Table 20. Weight gains of two groups of pigs of which one was treated with an anthelminthic.


Treated group

Untreated group

Number of animals

50

63

Mean weight gain (kg)

6.0

5.3

Standard deviation

1.6

1.9

First we must estimate the mean extra gain in a treated pig. This mean difference (MD) is easily calculated as:

MD = 6.0-5.3 = 0.7 kg

As usual, we will also need to calculate the standard error of the estimated mean difference (SEMD). We can do this by using the formula:

where:
nt, nu = numbers of treated and untreated animals, and

St, Su = standard deviations of weight gains in the respective groups.

Thus we have:

Note that this is the correct method of calculating SEMD only if the two samples are chosen by simple random sampling. A more general method will be given later.

We now set up a working hypothesis, called by statisticians the null hypothesis, usually hoping that we can show it to be false. When comparing two means or proportions, the working hypothesis will always be that the two means or proportions in the two populations are equal. To test the hypothesis we need to know the value of the test statistic Z. which is calculated by dividing the estimated mean difference by its standard error:

Z = MD/SMD

Z = 0.7/0.34= 2.059.

The next step depends on the experimental hypothesis, called by statisticians the alternative hypothesis, which we are trying to prove. There are two possibilities. The first is that we know in advance which mean or proportion is likely to be the larger; in our example, we expect, or at least hope, that the treated animals will do better. This is called a one-sided alternative hypothesis.

To illustrate why the hypothesis is one-sided, let us plot the two mean weight gains on a line, thus:

Untreated

Treated

5.3

6.0

The mean for the treated group is on the right of the mean for the untreated group i.e. it has a larger value. If it had been on the left, i.e. was smaller than the mean for the untreated animals, there would have been no possibility of the experiment supporting the hypothesis that the treatment produced higher weight gains on average. In other words, the result we are testing for can be obtained only if the mean for the treated animals appears on the "correct" side of the mean for the untreated group.

There will be occasions when this restriction is not appropriate. For example, there may be two types of management operating in a particular area, and we may wish to test whether the attack rate of a disease differs the management regime. This will be true if the rates are sufficiently different, no matter whether the rate under the first management system lies to the left (i.e. is smaller) or to the right (i.e. is larger) of the rate under the second system. This is a two-sided experimental hypothesis, and an example is given in the next section.

If the sample of treated pigs does not have a higher mean, the analysis ends with the statement that there is no evidence that anthelminthics aid weight gain. If the treated sample does better, we need to assess whether the apparent improvement could easily be explained by sampling fluctuations or whether the evidence is so strong that a chance mechanism is an unlikely explanation. The key to the problem is the value of the test statistic Z which has to be compared with a set of fixed numbers, known as critical values of the test statistic (Table 21).

Table 21. Critical values of Z for comparing means or proportions.

Hypothesis

Significance level

10%

5%

1%

0.1%

One-sided

1.28

1.64

2.33

3.09

Two-sided

1.64

1.96

2.58

3.30

N.B. This table should be used only if the sample sizes are sufficiently large.

In our example we have used a one-sided experimental hypothesis, since we are investigating whether anthelminthics will increase the rate of weight gain. We will therefore consult the first row of Table 21. The first number in the row is smaller than the value of the test statistic produced by the data. If the test statistic were less than 1.28, we would say that the difference in mean weight gain is not significant. If it were greater than 1.28 but smaller than 1.64 we would say that the difference in mean weight gain is significant at the 10% level but not at the 5% level, and so on. In the present case Z is 2.059, which is greater than 1.64 but less than 2.33, so we can say that the difference in mean weight gain is significant at the 5% level but not at the 1 % level. The larger the value of the test statistic, the more significant is the result.

It is an unfortunate perversity of historical statistics that has led to the 5% significance level being "more significant" than the 10% significance level. The significance level is the probability that any apparent difference is due entirely to chance features of the sample. Clearly, the smaller this probability is, the stronger is the support for the experimental hypothesis. If there is a 5% probability that the apparent difference is a random effect, we can be 95% confident that the difference is a real effect. If there is a probability that the difference in a random sample is 1 %, there is a 99% confidence that it is a real effect. It is because of this correspondence between significance and confidence levels that the values in Table 16 are identical to those in the bottom row of Table 21.

If our hypothesis test indicates that there is evidence of a difference, a 95% confidence interval for the size of the difference can be estimated as usual by:

Mean difference ± 1.96 x SEMD i.e. 0.7 ± 1.96 x 0.34.

Hence we could say that we are 95% confident that the use of anthelminthics in pigs in this experiment is associated with an increase in weight gain between 0.034 and 1.366 kg per animal over the relevant time period.

5.3.2 Testing for a difference in two proportions

Our second example shows how to test for a difference between two proportions. Suppose that two very large herds are managed under different husbandry systems. Random samples of 45 animals from the first herd, and of 58 animals from the second, were chosen as sentinel groups just before the rainy season began, and the attack rate for a common wet-season complaint was recorded for each group (Table 22).

Table 22. Attack rates of a common wet-season complaint in two sample groups of animals managed under different husbandry systems.


No. of susceptible animals

No. of infected animals

Attack rate

Herd 1

45

18

18/45

Herd 2

58

15

15/58

The estimated attack rate for the first herd is P1 = 18/45 = 0.4000 and for the second herd it is P2= 15/58 = 0.2586. The test statistic (Z) appropriate to the working hypothesis of equal attack rates in the two herds, is obtained thus:

The difference between the sample attack rates is calculated by subtracting the smaller estimated attack rate from the larger; n1 and n2 are the two sample sizes; and P is obtained by dividing the total number of infected animals by the sample size i.e. P = 33/103 = 0.3204.

Substituting for all these values from Table 22 we get:

If there is no prior reason to suspect that the attack rate will be higher under one of the two management systems studied, but we simply wish to investigate whether there is a difference, the correct experimental hypothesis is that the herd attack rates may be different. This is a two-sided hypothesis, since either system might give a higher attack rate, and we will test the hypothesis by comparing Z with its critical values in the second row of Table 21. Since the calculated value of Z. 1.31, is less than the first tabulated value, 1.64, we would conclude that the apparent difference in attack rates could be due entirely to random differences in the chosen samples and that the herd attack rates could be the same in the two herds.

If the test indicates a likely difference, we can calculate an approximate 95% confidence interval for the difference as follows:

where the multiplier is chosen from Table 16. Despite having found no real evidence of a significant difference, if we carry out the calculation, we get the interval:


i.e -0.061 to 0.343

This interval includes the value 0 which indicates the possibility that there is no real difference, a conclusion we have already reached by testing the hypothesis.

The procedures described for testing whether a mean of a variable, or the proportion of cases, varies between two herds are correct under the assumption that both samples have been collected by simple random sampling. It is not difficult to extend them to more complex sampling schemes, provided that we have an estimate of the relevant quantity for each herd and have also correctly calculated the standard errors of these estimates. We can calculate the standard error of the difference (SED) by:


Note that there is a plus sign under the square root symbol. The test statistic (Z) can be calculated by:


If the sample sizes are fairly large, a test can be carried out by comparing this value with the critical values in Table 21.

5.3.3 Sample size for detecting differences between two proportions in prospective and cross-sectional studies

The detection of a difference between two proportions is often one of the purposes of an epidemiological study. The proportions might be prevalences in a cross-sectional study, attack rates or incidence rates in a cohort study, and so on. Unfortunately, the sample size required will depend on the true values of both proportions, as well as on the significance level at which the test will be carried out and the confidence we require that the difference will be detected.

An approximate formula for the calculation of the sample size (n) required from each group is:


where: P1, P2 = true values of the proportions in the two populations we wish to compare;

= 1/2(P1 + P2);


C1 = critical value corresponding to the significance level required (chosen from the bottom row of Table 21); and


C2 = critical value corresponding to the chance we are willing to accept of failing to detect a difference of this type (chosen from the top row of Table 21).

Example: Suppose we are going to try a new farm management method in the hope that it will reduce the incidence of a common disease. We intend to take a sample of animals managed under a "standard" system and another of animals managed under a new system. From previous experience we expect the first group to suffer an attack rate of approximately 20% (i.e. P1 = 0.2). We wish to discover whether this attack rate could be reduced to 15% (i.e. P2 =0.15).

Let us suppose that we would like the difference to be significant at the 5% level (C1 = 1.96) and that we are willing to accept only a 1% probability that the difference will not be detected (C2 = 2.33). Then we find that n = 2120, which means that the total sample is 2 x 2120 = 4240 animals. We can reduce this by increasing to 5% the probability that we fail to detect the difference, and then we get n = 1494 with a total sample of nearly 3000.

The size of the sample depends mostly on the magnitude of the difference we want to detect. If we reduce P2 to 0.1, so that we are now trying to detect the difference between attack rates of 20% and 10%, we find that n = 328 and the total sample size drops from 2984 to 656.

The formula given above will slightly underestimate the sample size for studies in which the animals are not paired, and may overestimate it slightly for studies where they are. However, given the degree of arbitrariness which will usually be involved in assuming values for the true proportions, it is to be expected that the indicated sample size will never be better than a rough approximation.

5.3.4 Sample size for detecting differences between two proportions in retrospective studies

The procedure for estimating the sample sizes required in case-control studies is similar to that described in the previous section. However, there is one important exception: unlike in cohort studies where one is comparing the proportions of disease in two groups - one with and one without the determinant - in case-control studies one is comparing the proportions with the determinant in two groups - one with the disease (cases) and one without the disease (controls).

The formula for the sample size required in these studies is the same as that given in Section 5.3.3, with the exception that P1 and P2 now refer to the proportions with the determinant in the two populations we wish to consider.

5.3.5 Testing for differences in prevalence between several groups simultaneously

We may want to consider the question of whether several herds or other groups of animals suffer from the same prevalence of a given disease. The technique will be demon strafed using an example involving three groups, but it is easily extended to as many groups as may be required. From Table 23 we see that the sample prevalences from the three herds are not exactly equal -we would not expect that even if the herd prevalences were the same, because of fluctuations in random sampling.

Table 23. Prevalences of a disease in samples of animals taken from three different herds.


Herd 1

Herd 2

Herd 3

Total

Size of sample

68

52

73

193

No. of infected animals

12

11

20

43

Sample prevalence

0.176

0.212

0.274

0.223

The question we would like to resolve is whether the differences are sufficiently large in the samples to indicate a real difference in the herds from which they were drawn. To answer this, we must first present the data in the slightly different form of Table 24, in which each animal of the overall sample contributes to one and only one of the cells of the table. Such a table of frequency counts is often called a contingency table.

We now calculate the numbers which we would expect to see in the different cells of the table if a total of 43 infected animals and 150 animals free of infection were to be found in samples of 68, 52 and 73, respectively, from three herds with the same disease prevalences. These numbers are called expected frequencies (Table 25), and they have been calculated using the following simple rule:

The expected frequency ej,j of the cell in the i-th row and the j-th column of a table is obtained by multiplying the total of the i-th row, ri, by the total of the j-th column, cj, and dividing the product by the grand total, N. Symbolically, we can write:

ei,j = (ri x cj)/N

Example: The expected frequency of the cell in the first row and second column in Table 25 is:

ei,j = (r1 x c2) / N = (150 x 52) /193 = 40.4

This is very similar to the observed frequency O1,2 of the same cell in Table 24, which was 41.

Table 24. Contingency table based on the data from Table 23.


Herd 1

Herd 2

Herd 3

Total

No. of animals not infected

56

41

53

150 (r1)

No. of animals infected

12

11

20

43 (r2)

Total

68 (c1)

52 (c2)

73 (c3)

193 (N)

Table 25. Expected frequencies for Table 24.


Herd 1

Herd 2

Herd 3

Total

No. of animals not infected

52.8

40.4

56.7

150

No. of animals infected

15.2

11.6

16.3

43

Total

68.0

52.0

73.0

193

Note: The row and column totals of the expected frequencies will be the same as for the original contingency table of observed frequencies, except for small rounding errors. For example, the total for row I seems to be 149.9 instead of 150, but this is because we have rounded all the expected frequencies to one decimal place.

The next step is to calculate a measure of the deviation of the observed frequency from the expected frequency for each cell. We do this by squaring the difference between the observed and expected frequencies and dividing the result by the expected frequency of the cell. Thus:

Deviance = (Observed frequency - Expected frequency)² / Expected frequency

Expected frequency

Using this formula the deviance for the cell in the first row and first column of Table 24 is:

(56 - 52.8)2/52.8 = 0.19

Table 26 shows deviances for all the cells in Table 24.

Table 26. Deviances for Table 24.


Herd 1

Herd 2

Herd 3

Total

Not infected

0.19

0.01

0.24

0.44

Infected

0.67

0.03

0.84

1.54

Total

0.86

0.04

1.08

1.98

The working hypothesis will be that the herd prevalences are effectively the same. The experimental hypothesis is that there is some difference between herds. The test statistic is the total deviance, 1.98. As usual, this will have to be compared with a set of critical values which, in turn, depend on a quantity called degrees of freedom (df). For any table, this quantity is calculated as follows: df = (number of rows - 1) x (number of columns - 1)

For Table 26: df= (2- 1) x (3- 1) = 1 x 2 = 2

The critical values of the test statistic, called the chisquare statistic, can be found in Table 27.

Table 27. Critical values of the chi-square statistic.

df

Significance level

10%

5%

1%

0.1%

1

2.71

3.84

6.63

10.83

2

4.61

5.99

9.21

13.82

3

6.25

7.80

11.34

16.27

4

7.78

9.49

13.28

18.47

5

9.23

11.07

15.09

20.52

6

10.64

12.59

16.81

22.46

The value resulting from our contingency table is 1.98 with 2 degrees of freedom. If we consult the second row of Table 27, we see that 1.98 is smaller than the 10% value, 4.61, and conclude that there is not sufficient support in the data for the experimental hypothesis and that, until further data are obtained, we must assume that the herd prevalences could be equal. If the chi-square value had been between 5.99 and 9.21, for example, we would find that there was a difference in the herd prevalences at the 5% significance level. The test may not be valid if some of the expected values are rather small. A useful guideline is that the expected values for each of the cells should be at least 5.

A similar analysis can be carried out on sample attack rates or any other rate or proportion based on simple random samples from different groups of animals. The problem with the chi-square test is that, if a difference is indicated, it is rather difficult to estimate the extent of the difference without the help of a statistician.

Let us test once again whether the two attack rates given in Table 22 are equal, using this time a chi-square test. Table 28 is a two-by-two contingency table based on Table 22. The figures in parentheses give the expected values for each cell.

Table 28. Two-by-two contingency talk based on Table 22.


Herd 1

Herd 2

Total

No. of animals not infected

27 (30.6)

43 (39.4)

70

No. of animals infected

18 (14.4)

15 (18.6)

33

Total

45

58

103

When the contingency table has only 2 rows and 2 columns, a slight modification has to be made in the calculation of the chi-square statistic. The deviance for each cell is calculated by finding the difference between the observed and expected value as before, but now always subtracting the smaller of these values from the larger. Before the difference is squared, it is reduced by 0.5. The remainder of the calculation is carried out exactly as before.

One point to note in a 2 x 2 table is that the difference between observed and expected frequency (ignoring signs) is the same for all four cells. In our example it is 3.6 for each cell. This has to be reduced by 0.5 i.e. 3.6 - 0.5 = 3.1. For each cell the reduced value is squared and divided by the expected value to obtain the deviance. The four deviances are then summed to give the value of the chi-square statistic thus:

3.12/30.6 + 3.12/14.4 + 3.12/39.4 + 3.12/18.6 = 1.74

Comparing this value with the first row of Table 27, we see that it is not significant and we reach the same conclusion as we did in Section 5.3.2, namely that the evidence does not give sufficient grounds to reject the hypothesis that the attack rates are equal in the two herds. In fact there is an exact correspondence between this chi-square test and the test carried out in Section 5.3.2. The value of Z we obtained there was 1.31 which is The value of Z which arises from that test will always be equal to the square root of the chi-square test based on the corresponding 2 x 2 contingency table. Furthermore, the values in the lower row of Table 21 are the square roots of the values in the first row of Table 27. As a result, the two tests are exactly the same.

5.3.6 Testing for differences in several means simultaneously

It is likewise possible to test the working hypothesis that several sample means are equal against the experimental hypothesis that there are some real differences. The technique is known as the analysis of variance (ANOVA) and can be found in most general statistical textbooks. A description of the technique is beyond the scope of this manual. The important point to realise is that it is not correct to compare the means of several different samples two at a time using the procedure described in Section 5.3.1.

5.4 Formulating and testing hypotheses in small-sized samples

All the procedures that have been recommended for comparing two groups depend on having a reasonably large sample size. The following points should be noted carefully:

i) When comparing two prevalences or attack rates, there must be at least five cases observed in both groups of animals for the test to be valid.

ii) When comparing ratios or proportions or rates across several groups by means of the chi-square test, all the expected values should be greater than

iii) When comparing two means, the combined sample size should be greater than 40. If it is less than 40, the same calculations are carried out, but the value of the test statistic, usually called the t-statisic, should be compared with the critical values given in Table 29 and not with those in Table 21.

Table 29. Critical values of the t-statistic.

df

One-sided test

Two-sided test

5%

1%

0.1%

5%

1%

0.1%

1

6.31

31.80

318.00

12.70

63.72

637.00

2

2.92

6.96

22.31

4.30

9.92

31.61

3

2.35

4.54

10.20

3.18

5.84

12.88

4

2.13

3.75

7.17

2.78

4.60

8.61

5

2.02

3.36

5.89

2.57

4.03

6.87

6

1.94

3.14

5.21

2.45

3.71

5.96

7

1.89

3.00

4.79

2.36

3.50

5.41

8

1.86

2.90

4.50

2.31

3.36

5.04

9

1.83

2.82

4.30

2.26

3.25

4.78

10

1.81

2.76

4.14

2.23

3.17

4.59

12

1.78

2.68

3.93

2.18

3.05

4.32

15

1.75

2.60

3.73

2.13

2.95

4.07

20

1.72

2.53

3.55

2.09

2.85

3.85

25

1.71

2.48

3.45

2.06

2.79

3.73

30

1.70

2.46

3.39

2.04

2.75

3.65

40

1.68

2.42

3.31

2.02

2.70

3.55

Like the chi-square statistic, the critical values of the t-statistic depend on the quantity known as "degrees of freedom" which, for this test, are calculated as the sum of the two sample sizes minus 2.

Example: Suppose that the experiment with anthelminthics had been carried out on two smaller groups comprising 23 treated and 19 untreated pigs. The mean weight gains for the two groups were 6.1 and 5.4 kg, respectively, and the sample standard deviations were 1.72 and 1.64. Then, using the formula already given, the standard deviation of the mean difference is:

SMD = 0.522

The difference in the two means is 6.1 - 5.4 = 0.7 kg. The test statistic is 0.7/0.522 = 1.34. The degrees of freedom are 23 + 19 - 2 = 40. We now compare the value of the test statistic, 1.34, with the last row of Table 29, and see that weight gain is not significant at the 5% level. We can not conclude, therefore, on such evidence that treatment by anthelminthics will cause a general weight increase. It could simply be that, by chance, naturally faster growing animals were chosen to receive the drugs.

5.5 Matched comparisons

The sensitivity of statistical hypothesis tests carried out to compare two treatments, or a treatment with a control, can be greatly increased if, instead of choosing two independent samples receiving different treatments, the two samples are chosen in pairs so that the two animals in each pair are as alike as possible. Consider again the study of the use of anthelminthics in pigs. This could have been carried out by matching pigs for sex, initial body weight etc. Let us suppose that this has been done for 10 pairs of pigs to give the results in Table 30.

Table 30. Weight gains in 10 matched pairs of pigs.

Pair

Treated (Y)

Untreated (X)

Difference (d)

1

6.1

5.7

0.4

2

5.2

5.3

-0. 1

3

5.4

4.8

0.6

4

5.9

5.2

0.7

5

6.3

6.4

-0. 1

6

6.0

6.3

-0.3

7

5.7

5.1

0.6

8

5.1

4.8

0.3

9

6.2

5.1

1.1

10

5.9

5.0

0.9

Mean

5.78

5.37

0.41

Standard deviation

0.4185

0.5774

0.4606

The analysis for such paired comparisons is carried out by considering the individual differences, d, between the two animals of each pair. The test statistic is calculated from the formula:


= sample mean,
Sd = sample standard deviation of the differences, and
n = number of pairs.

Note that when adding the differences to calculate d it is important to take into account whether the difference is positive or negative. From Table 30 we see that d = 0.41, Sd = 0.4606 and n = 10. Hence, the test statistic is:

with 10 - 1 degrees of freedom If we now consult Table 29, we see that the corresponding 1% significance value for a one-sided test is 2.82. There has been, therefore, a significantly higher weight gain in the treated animals.

If we had ignored the pairing and carried out the test presented earlier in this section, we would have obtained a value of t = 1.82 with 18 degrees of freedom, which is now just significant at the 5% level. Matching the animals has sufficiently increased the precision of the measurement of the difference in weight gain to affect the inference we make from the experiment.

Similar gains in precision can be obtained in case-control studies carried out to examine possible determinants of disease. Suppose that 100 cases and their paired controls were examined for the presence or absence of a suspected determinant, and this determinant was found to be:

- present in both the case and control individuals in 70 pairs;
- present in the control but absent in the case individuals in 5 pairs;
- absent in both the case and control individuals in 10 pairs;
- absent in the control but present in the case individuals in 15 pairs.

These results could be summarised in tabular form as was done in Table 31.

Table 31. Results of a paired case-control study of the effect of a suspected determinant on the occurrence of a disease.

Controls

Cases

Total

Factor present

Factor absent

Factor present

70

5

75

Factor absent

15

10

25

Total

85

15

100

It would be wrong to analyse this table as though it was a contingency table. An appropriate test would be the McNemar's test, which can be carried out as follows:

· Find the difference, D, in the frequencies of the two categories for which the case and its control are not in agreement with respect to the factor. Thus, for Table 31, D= 15-5= 10.

· Find the sum, S, of the same two frequencies. S= 15+5=20.

· The test statistic is (D -1)2/S = 81/20 = 4.05. This statistic should always be compared with the critical values of the chi-square statistic with one degree of freedom (see Table 27).

Since 4.05 is greater than 3.84, the result of the test is that there is a difference at the 5% significance level between the cases and the controls with respect to the presence or absence of the factor.

If the pairs are ignored, the data can be presented in a contingency table (Table 32).

Table 32. Two-by-two contingency table of the results of a case-control study.


Factor present

Factor absent

Total

Cases

85

15

100

Controls

75

25

100

Total

160

40

200

Using the procedure given earlier for analysing such tables, the expected frequencies are as shown in Table 33.

The total deviance is 4.52/80 + 4.52/80 + 4.52/20 + 4.52/20 = 2.53 with one degree of freedom. This is not significant at the 5% level.

Table 33. Two-by-two contingency table of expected frequencies of disease, derived from Table 32.


Factor present

Factor absent

Total

Cases

80

20

100

Controls

80

20

100

Total

160

40

200

5.6 A word of warning

There is no such thing as a working or null hypothesis that is exactly true. It is most unlikely, for example, that the use of anthelminthics in pigs bred in an environment where helminths are endemic will have no effect on weight gain whatsoever. The result of a hypothesis test will depend on:

· The extent to which the null hypothesis is incorrect;
· The natural variability in the population studied; and
· The size of the sample observed.

It is always possible to obtain a statistically significant result by choosing the sample size large enough. Even if, on average, a prophylactic induced an extra weight gain of only 1/10 th of a gram per year, a large enough sample would cause the null hypothesis of no gain to be rejected. It follows that no study is complete without giving some estimate of the magnitude of the effects it claims to have detected. Only then will it be possible to judge the economic value of a treatment, change in husbandry method etc.

5.7 Linear correlation and regression

In epidemiological studies we are very often interested in exploring a relationship between two variables. For example selenium is an essential nutritional element in the ovine diet, and disorders arise as a result of selenium deficiency. It is therefore of interest to have some measure of blood selenium levels. Unfortunately, the direct assessment of selenium concentration is lengthy and requires expensive and unusual equipment. The whole-blood selenium concentration (gram atoms per million per litre) is closely related, however, to glutathione peroxidase activ ity (enzyme units per milligram of haemoglobin), as can be seen in Figure 8.

Figure 8. Plot of whole-blood selenium concentration against glutathione peroxidase activity in 10 randomly selected sheep.

The measured values which were used to construct this graph are given in Table 34.

Table 34. Whole-blood selenium concentration (Y) and glutathione peroxidase activity (X) in 10 randomly selected sheep.

Sheep

Y

X

1

2.6

22.1

2

3.1

32.8

3

1.3

10.1

4

3.2

35.4

5

2.0

21.2

6

0.4

4.8

7

2.7

21.2

8

3.8

37.9

9

1.2

8.3

10

3.6

35.1

The points in the graph have a suggestively linear form, and it is possible to draw a straight line which comes close to passing through them. We have drawn in this line in the figure. Before explaining how to calculate it, we will discuss a measure of the degree to which the relationship between two variables can be described by a straight-line graph. This measure is called the product-moment coefficient of linear correlation or, sometimes, the Pearson's correlation coefficient (r).

To obtain this coefficient, we first have to calculate a quantity known as the sample covariance of the two variables X and Y from the formula:

where: n = number of pairs (X,Y) studied, and
S XY = the sum of products obtained by moltiplying together the two observations of each pair and adding the products. From Table 34 we have:
S XY = 22.1 x 2.6 + 32.8x 3.1 + ...35.1 x 3.6 = 667.45

 



n = 10

con(X,Y) = (667.45- 10 x 22.89 x 2.39) / 9 =13.38

The correlation coefficient is then calculated as:

r = cov(X,Y)/Sx Sy

where:

Sx = sample standard deviation of the observed values of X, and
Sy = sample standard deviation of the observed values of Y.

For this example, Sx = 12.20 and Sy= 1.13, so that:

r=13.38/(12.20 x 1.13) = 0.971

The value of r lies always between-1 and 1. A value close to 0 implies that the two variables are not linearly related, while a value close to 1 or - 1 means that it is possible to draw a straight line in such a way that it will come close to the plotted data points, as in Figure 8.

A positive correlation implies that the variables X and Y tend to increase or decrease together, while a negative correlation implies that as one increases the other decreases. The value of r² gives the proportion of the variation in one variable which is due to variation in the other. However, a high statistical correlation between two variables does not necessarily mean that one is the cause of the other. The correlation between two variables may be due to the fact that they have a common cause rather than that they are directly related.

In our example, r2 = 0 94 and we can say that 94% of the variation in enzyme activity is "due to" or is "explained by" the variation in blood selenium concentration in the observed animals. This suggests that it ought to be possible to get good information about blood selenium from the measurement of enzyme activity, a result already indicated by the rather good fit of the straight line to the sample points in Figure 8.

Any straight line can be represented by the formula:

Y = a + bX where:

Y = the variable plotted on the vertical axis,
X = the variable plotted on the horizontal axis, and
a,b = constants which define a particular straight line.

In our case a is the value of Y when X = 0, i.e. the point where the line crosses the Y axis, and b describes the slope of the line. If there is an exact linear relationship between X and Y. all pairs of points will lie on a single line and there will be only one possible value for a and one for b. When the points do not lie exactly on a straight line, there are several possible ways to define what is meant by the "best-fitting line" or the line that runs "closest" to the points. The values of a and b, which give the line known as the least squares regression line, are usually calculated using the formulae:

b = cov (X,Y) / Sx2 = 13.8 /12.22 = 0.09

For the data in Table 34 we then have the fitted regression line:

Y = 0.33 + 0,09X

Given any enzyme activity score (X) we can now estimate the corresponding value of the blood selenium concentration (Y) using the regression formula. For example, if a sheep has an enzyme activity of 32.8, we would predict that its blood selenium concentration is Y = 0.33 + 0.09 x 32.8 = 3.28. The observed concentration for an animal in this sample with this enzyme activity level was 3.10. The value 0.09 is the estimated slope or gradient of the regression line and indicates the change in selenium concentration which corresponds to a change of one unit of enzyme activity.

As always, whenever we make an estimate, we would like to know how good that estimate may be. We can obtain a 95% confidence interval for the blood selenium of any animal with an enzyme activity X as follows:

where:

· Sr = the residual standard deviation calculated by:

· The multiplier is chosen from Table 29 with n-2 degrees of freedom.

In this example X = 32.8 and we can say with 95% confidence that the selenium content lies in the interval:

3.28 ± 2.31 x 0.29 x 1.083
i.e. 2.55 to 4.01.

This interval may seem too wide to be useful. Part of the problem is that the estimation of the regression line is based on observations of only 10 animals. If a regression line is to be used in this way, it ought to be based on a much larger sample.

5.8 Time series

An epidemiologist will frequently be interested in examining the manner in which certain variables vary over time.

Example: Table 35 gives hypothetical values of neonatal deaths per month in a large pig-breeding project over 9 years. At first glance it appears that there may have been a general increase in the number of deaths per month between the beginning of 1974 and the end of 1982, and that there were seasonal variations during the year.

Table 35. Hypothetical neonatal mortalities in piglets by month and year.

Year

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

1974

359

361

363

455

472

545

598

729

874

587

483

380

1975

336

361

366

465

522

534

651

598

794

782

449

347

1976

308

329

354

391

467

633

846

950

989

830

676

531

1977

368

373

396

393

483

561

860

906

1095

780

764

543

1978

352

370

384

426

481

619

819

929

1090

805

711

559

1979

380

409

423

428

476

656

826

886

1058

803

725

543

1980

403

412

414

432

485

605

837

959

1152

773

784

515

1981

405

400

396

432

552

667

892

971

1076

821

789

570

1982

432

437

462

460

543

720

961

994

1042

890

780

573

A common approach to the analysis of such data is to try to examine separately the two major likely causes of variation - the gradual general increase or decrease (trend) from one year to another, and seasonal variations within each year. There are several different methods for doing this, but they will all give similar results to the method outlined below.

The first step is to estimate the linear trend. This can be done by fitting a linear regression line to the monthly means calculated over complete calendar years:

Year (X)

1

2

3

4

5

6

7

8

9

Mean (Y)

517.2

517.1

608.7

626.8

628.7

634.4

647.6

664.2

691.2

Note that the years 1974-1982 have been coded simply as 1, 2, etc.

We then calculate the least squares regression line of mean deaths on year number to get the trend line:

Y = 513.2 + 20.38X

The slope of the line, 20.38, tells us that the monthly deaths are increasing at an average rate of just over 20 each year. In other words, the number of deaths in a given month will be about 20 more than the number in the same month in the previous year. This does not necessarily imply that the death rate is increasing: the increase in the number of deaths could simply be-a response to an increase in the total number of births.

Having obtained a measure of the rate of increase, the trend, it would now be useful to have some information about the magnitude of the seasonal effects. These can be estimated by considering the extent to which the observed deaths for each month differ from the corresponding value on the trend line.

The first step is to calculate the value of the trend line corresponding to each calendar month. We will exemplify the procedure by carrying out the calculations for all the months of January in the sample. Note first that the trend line was calculated using mid-year averages centered on the end of June each year. The value corresponding to each month should be centered in the middle of that month. For example, the middle of January 1974 is five and a half months or 5.5/12 = 0.46 years before the end of June 1974. Since the value " 1.0 years" on the time axis corresponds to the end of June, 1.0 - 0.46 = 0.54 will correspond to mid-January, and the corresponding trend value will be:

Y= 513.2 + (20.38x0.54) = 524.2

The number of deaths in January 1974 was 359. The ratio of the observed number of deaths to the number predicted by the trend line in the middle of the month is called the specific seasonal, and its value for January 1974 is 359/524.2 = 0.68.

The point on the time axis corresponding to January 1975 is 2 - 0.46 = 1.54 and the corresponding trend value is:

Y = 513.2 + (20.38x 1.54) = 544.6

The number of deaths observed in January 1975 was 336 and the specific seasonal is 336/544.6 = 0.62. Proceeding in this manner, we can calculate the specific seasonals for any month. The specific seasonals for January in each of the study years are:

Year

1974

1975

1976

1977

1978

1979

1980

1981

1982

Specific seasonal

0.68

0.62

0.55

0.63

0.58

0.61

0.62

0.61

0.63

Averaging the specific seasonals for a given month over all the years in which it appears gives the typical seasonal for that month. The typical seasonal for January will be:

(0.68+0.62+0.55+0.63+0.58+0.61 +0.62+0.61 +0.63)/9= 0.61

The combined use of the typical seasonal and the trend line allows us to "predict" the number of deaths to be expected in January 1983. The trend line value will be:

Y= 513.2 + 20.38x9.54 = 707.6

The value of the seasonal tells us that the number of deaths in any January is only about 61% of the value suggested by the trend line. The prediction would be to expect about 707.6 x 0.61 = 432 deaths in January 1983. The accuracy of such a prediction depends on how stable both the trend and the seasonal effects are. The farther into the future we try to predict, the less faith we should have in the quality of the prediction.


Previous Page Top of Page Next Page