# 3. The use of descriptive statistics in the presentation of epidemiological data

3.1 Introduction
3.2 Tables and graphs
3.3 Bar and pie charts
3.4 Classification by variable
3.5 Quantification of disease events in populations
3.6 Methods of summarising numerical data

## 3.1 Introduction

Evidence of the presence, nature and severity of a disease will usually be contained in statistical data of some kind. These may take the form of counts of the numbers of diseased animals, physical measurements of a sample of animals, the measurement of one or more biological variables that are likely to be affected by the presence of the disease, and so on. Any report on the disease will have to include at least a descriptive presentation of the statistical evidence.

There are several basic methods and measures which are commonly used to display and summarise sets of data. The choice of technique used depends mainly on the kind of data involved. Data come in two main categories - categorical (discrete) and continuous (numerical) data. Categorical data are data that can be allocated to distinct categories, and normally take the form of counts. Categorical data found in epidemiology may take the form of dichotomous data i.e. data that can have only two values (e.g. diseased or non-diseased, infected or non-infected). Continuous data consist primarily of measurements, which, although they can be classified into defined categories, have the theoretical possibility of being infinitely subdividable. For example, the weight of a chicken could be 1.45 kg, 1.453 kg, 1.45327856 kg etc.

In this chapter we will be looking at some of the more common and useful methods for summarising both categorical and continuous data.

## 3.2 Tables and graphs

Table 1 consists of the liveweights of 150 chickens selected randomly in a large market during a day on which approximately 4000 chickens were sold.

Table 1. Weights (kg) of a sample of 150 chickens sold in a market.

 1.4 1.09 1.74 1.48 1.82 1.09 1.52 1.41 1.83 1.22 1.34 1.68 1.25 1.65 1.14 1.33 1.06 1.71 1.17 1.51 1.36 1.34 1.03 1.24 1.06 1.12 1.15 1.57 1.38 1.4 1.39 1.31 1.5 1.1 1.45 1.34 1.38 1.35 1.49 1.58 1.25 1.42 1.64 1.57 1.53 1.18 1.39 1.34 1.13 1.23 1.17 1.88 1.3 1.27 1.01 1.63 1.47 1.23 1.48 1.48 1.37 1.42 1.22 1.47 1.31 1.05 1.61 1.41 1.17 1.45 1.43 1.22 1.4 1.14 1.53 1.25 1.02 1.3 1.35 1.37 1.69 1.37 1.11 1.3 1.05 1.19 1.36 1.63 1.44 1.29 1.35 1.59 1.94 1.51 1.78 1.37 1.11 1.38 1.53 1.44 1.47 1.39 1.55 1.76 1.43 1.37 1.67 1.36 1.31 1.41 1.36 1.26 1.17 1.15 1.79 1.46 1.35 1.29 1.5 1.26 1.36 1.41 1.36 1.32 1.08 1.28 1.33 1.29 1.42 1.5 1.32 1.39 1.2 1.68 1.2 1.35 1.56 1.57 1.37 1.27 1.25 1.38 1.56 1.6 1.74 1.4 1.11 1.6 1.21 1.44

It is not easy to make sense of these figures displayed in this form. What can we do to make them more intelligible? Perhaps the first thing which will occur to most of us is to calculate the mean (i.e. sample average) by adding all these values and dividing by 150. Doing this, we find that the mean weight of chickens in the sample is 1.3824 kg. How useful is this number? By itself, not very useful. For example, it does not allow us to draw the conclusion that "most of the chickens weighed about 1.38 kg".

Adding the information that the lightest chicken weighed 1.01 kg and the heaviest 1.94 kg, we might say that the range of the sample was 0.93 kg (1.94 - 1.01), with a mean weight of 1.3824 kg. However, this does not rule out the possibility that the weights were evenly spread throughout the range, or indeed that about half were at the low end and the remainder at the upper end of the range. In other words, we would like to know precisely how the values were distributed throughout the range. The simplest way to do this is to draw up a frequency table (see Table 2).

Table 2. Frequency table of the individual weights of 150 chickens.

 Grouped interval of chicken weights (kg) Frequencya Relative frequency (%) Cumulative frequencyb Relative cumulative frequency (%) 1.00-1.09 10 (6.7) 10 (6.7) 1.10- 1.19 16 (10.7) 26 (17.3) 1.20- 1.29 21 (14 0) 47 (31.3) 1.30- 1.39 39 (26.0) 86 (57.3) 1.40-1.49 26 (17.3) 112 (74 7) 1.50- 1.59 17 (11.3) 129 (86.0) 1.60 1.69 11 (7.3) 140 (93.3) 1.70-1.79 6 (4 0) 146 (97 3) 1.80-1.89 3 (2.0) 149 (99 3) 1.90-1.99 1 (0 7) 150 (100.0)

a Number of values in each interval.
b Cumulative number of values up to the end of a particular interval.

The relative frequencies (column 3) were obtained by dividing the number of values in each interval by the total number of chickens in the sample and converting the result to a percentage. For example, the relative frequency of the first interval is:

(10/150) x 100 = 6.7%

Looking down the column of relative frequencies we see that 17.3% of the sampled chickens weighed between 1.40 and 1.49 kg, and over half (57.3%) weighed between 1.20 and 1.49 kg. The cumulative and relative cumulative frequencies also given in the table are useful in answering questions about the extremes or tails of the distribution. For example, 17.3% of chickens in the sample weighed less than 1.20 kg and 14% (100 - 86) weighed at least 1.60 kg.

The information in Table 2 can also be presented as a graph (Figure 2). Frequency tables are often presented as special types of graphs called histograms.

Figure 2. Histogram of the frequency distribution of chicken weights from Table 1.

The area of each block in the histogram should be proportional to the relative frequency of the corresponding interval. Only when the class intervals are all of equal size, as in this case, will the height of each block be proportional to the frequency.

Measured to the nearest hundredth of a kilogram, the chicken weights ranged from 1.01 to 1.94 kg i.e. there were 94 possible values in the range. If we had measured the weights to the nearest gram, there would have been 940 possible values in the range. In order to draw up a frequency table like Table 2, it is necessary to collapse the data into classes defined by intervals on the scale of measurement. Sometimes data can take only a limited range of values, and then it may be neither necessary nor desirable to group different values into the same classes. An example is Table 3 which gives the frequency of different parturitions in a herd of 153 cows.

Table 3. Frequency of different parturitions in a herd of 153 cows.

 Parturition number 0 1 2 3 4 Number of cows 26 38 47 24 18 Relative frequency (%) 0.17 0.25 0.31 0.16 0.12 Cumulative relative frequency(%) 0.17 0.42 0.73 0.88 1.00

It does not make sense to try to draw a histogram of this data set. Other possible methods of graphical presentation will be suggested below, though, in this case, the table is by itself a clear method of presenting the data.

We could use the data to calculate the mean number of parturitions -

[(26x0) + (38x1) + (47x2) + (24x3) + (18x4)]/153= 1.80

- but this is unlikely to be a useful piece of information unless we wanted to compare two different herds. Even then, it would be better to give the complete sets of parturition data for both herds.

## 3.3 Bar and pie charts

Categorical data that take only two possible values are often referred to as dichotomous, and we will be interested mainly in the proportions belonging to each category. Note that the use of numerical labels for categorical variables may sometimes be confusing, but it does not deprive the latter of their categorical status. The important question is whether the numerical labels still behave as numbers in the usual sense.

This may be demonstrated on the following example. Three common causes of death in chickens are salmonellosis, coccidiosis and Newcastle disease, and their frequencies in a sample of 59 dead birds are shown in Table 4. For convenience of data storage, the variables were given code numbers 1, 2, 3 and 4, as shown in the table. However, these are not numbers in the usual sense. For example, we cannot say that 2 (coccidiosis) is greater than 1 (salmonellosis), and so on. They are just simpler versions of the original labels. It would therefore be silly to try to work out the mean of these coded data; the most we can do is to give tables of frequencies or percentages.

Table 4. Frequencies of causes of death in a sample of 59 chickens.

 Cause Code No. of deaths Relative frequency (%) Salmonellosis (1) 12 0.20 Coccidiosis (2) 7 0.12 Newcastle disease (3) 30 0.51 Other (4) 10 0.17

As was pointed out a histogram would not be a suitable means of presenting the data in Table 3, and this applies also for Table 4. The data in these tables can be presented graphically either in a bar chart or a pie chart. Figure 3 is a bar chart showing the relative frequencies of the different parturition values given in Table 3.

Notice the differences between a bar chart and a histogram: there should be a gap between adjacent bars in the bar chart to emphasise that the data can take only the discrete values actually marked on the horizontal axis, and each bar should have exactly the same width, with the height proportional to the relative frequency of the value over which it is centred.

Figure. 3 Bar chart of parturition data from Table 3.

The data on chicken pathology (Table 4) can also be displayed in a bar chart (Figure 4). However, unlike in Figure 3 where the different parity values have the usual, natural ordering, in Figure 4 the order of the different "values", i.e. diseases, on the horizontal axis is arbitrary. Remember, when there is a natural order, it must be adhered to; when the data are categorical, any ordering may be chosen.

Figure 4. Bar chart of data on causes of death in chickens from Table 4

Frequently, it may be helpful to present categorical data in a decreasing order of frequency, as was done in Figure 5.

Figure 5. Alternative bar chart of data from Table 4.

For purely categorical data, the pie chart is a common alternative to the bar chart. The pie chart is a circle divided into as many sectors as there are categories. The area of each sector is made proportional to the relative frequency of the corresponding category by calculating the angle which the sector makes at the centre of the circle. As the total of all the angles is 360°, we need only to divide the 360° in the correct proportions among the various categories to obtain the corresponding areas.

From Table 4 we know, for example, that the relative frequency of salmonellosis is 0.20. The corresponding angle is 360 x 0.20 = 72°. Similarly, the angles corresponding to coccidiosis and Newcastle disease are 43° and 184°, respectively, rounded to the nearest degree. The resulting pie chart is shown in Figure 6.

Note that in histograms, pie charts and bar charts the sample size should always be quoted.

## 3.4 Classification by variable

All the examples discussed so far have involved observations of a single variable in a single population of animals. However, we may wish to subdivide a population into several subgroups in order to investigate possible differences between them. For example, cattle may be classified by sex, breed, geographic location, disease status etc. In epidemiological investigations, the classificatory variables will usually be categorical and will frequently be referred to as factors or determinants.

True numerical variables can also be used as classifying factors, either in the form of the values of the variable, if it takes only a small number of values, or class intervals.

Figure 6. Pie chart of relative frequencies of causes of death in 59 chickens based on Table 4.

For example, each animal that provided data for Table 3 could be classified by its number of parturitions, thus dividing the sample into five groups, while the chickens whose weights are given in Table 1 could be divided into 10 distinct weight groups, using the class intervals of Table 2 to define the different levels of the factor "liveweight".

The choice of factors and the number of levels of each factor will depend on the degree of prior knowledge of the population to be studied, the expected scientific significance of the factors, and the measures available to the investigator. Table 5 is a contrived table displaying counts of ascaris infections in pigs according to three factors: the management system (two levels; raised indoors or outdoors), the occurrence of ascaris eggs in a sample of faeces from each pig (two levels; present or absent), and the degree of whitespot observed in the liver of each pig after slaughter (three levels; absent, slight or severe).

Table 5. Contrived table based on evidence of ascaris infection in pigs: An example of a three-factor table with marginal totals.

 Whitespot Ascaris eggs Management system Any system Indoors Outdoors Absent Absent 503* 112* 615 Present 141* 38* 179 Total 644 150 794 Slight Absent 231* 75* 306 Present 87* 30* 117 Total 318 105 423 Severe Absent 79* 32* 111 Present 71* 17* 88 Total 150 49 199 Absent 813 219 1032 Any whitespot condition Present 299 85 384 Total 1112 304 1416

* Recorded data.

In any table, it is often useful to give the marginal totals i.e. to sum the counts over all the levels of the different factors. This makes it easier to extract any subtables that may be of interest, and the marginal tables are needed anyway for the analysis of the data (see Chapter 5). On the other hand, marginal totals can greatly increase the size of a table. In Table 5, for instance, only the values marked with an asterisk are strictly necessary, while the remaining entries (24 out of 36) give supplementary information. The use of marginal totals is a matter of personal judgement: in general, if it is thought that the complete table might confuse rather than clarify the issues, then the totals are better left out.

Table 6 shows one of the two-factor tables that can be derived from Table 5.

Table 6. Two-factor table derived from Table 5.

 Whitespot Ascaris eggs Total Absent Present Absent 615 (59)a 179 (47) 794 (56) Slight 306 (30) 117 (30) 423 (30) Severe 111 (11) 88 (23) 199 (14) Total 1032 (100) 384 (100) 1416 (100)

a Figures in parentheses give the relative frequencies (%) of whitespot conditions.

With multi-factor tables there are always several options for presenting relative frequencies. In Table 6, for example, the relative frequency of the different whitespot conditions is given for each level of the ascaris egg factor. Alternatively, the frequency of each level of ascaris eggs could be given relative to the totals within each level of whitespot severity, or the frequency of each of the six possible whitespot-ascaris egg combinations could be calculated relative to the total number of pigs in the sample. The option chosen will depend on the point that one wants to make, but the table should make it clear which relative frequencies are given. In interpreting tables presented by other investigators care should be taken to clarify which relative frequencies are being presented or discussed.

## 3.5 Quantification of disease events in populations

Data used to quantify disease events in populations are often dichotomous in nature i.e. an animal can either be infected with a disease agent or not infected. Such data are frequently presented in the form of an epidemiological rate.

In epidemiology, a rate can be defined as the number of individuals having or acquiring a particular characteristic (normally an infection, a disease or a characteristic associated with a disease) during a period of observation, divided by the total number of individuals at risk of having or acquiring that characteristic during the observation period. The expression is then multiplied by a factor, normally a multiple of 10, to relate it to a specified unit of population.

Rates are commonly expressed as decimals, percentages, or events per standard units of population e.g. per 1000, 10000 animals etc. This produces a standardised measure of disease occurrence and therefore allows comparisons of disease frequencies over time to be made between or within populations. Note that in a rate, the numerator is always included in the denominator, while in a ratio it is not included. In an epidemiological rate, the period of observation should always be defined.

It is difficult to make valid comparisons of disease events between or within populations unless a denominator can be calculated. The use of "dangling numerators" to make comparisons is one of the biggest "crimes" that the epidemiologist can commit, and it should be avoided whenever possible.

For example, suppose we were interested in comparing the numbers of cases of infection with a particular disease agent over a particular time period in two herds of cattle of the same breed but under different management systems. We are told that in herd A the number of animals infected with the disease agent in question in the month of June 1983 was 25, while in herd B the number of animals infected with the same disease agent in the same month was 50. We might therefore conclude, erroneously, that the disease was a greater problem in herd B than in herd A. Note that we did not know the denominator i.e. the population of animals at risk of being infected with the disease agent in each herd. Suppose we investigated further and found that the population at risk in herd A during the month of June was 100 while in herd Bit was 500. Then, calculating a rate for each herd, we find that the rate of infection in herd A was 25/100 or 0.25 or 25% or 250 in 1000, while in herd B it was 50/500 or 0.10 or 10% or 100 in 1000. The true position, therefore, is that the disease was a greater problem in herd A!

The two main types of rates used in veterinary epidemiology are:

· Morbidity rates, which are used to measure the proportion of affected individuals in a population or the risk of an individual in a population of becoming affected.

· Mortality rates, which measure the proportion of animals dying in a population.

Morbidity rates

Morbidity rates include incidence, attack, prevalence and proportional morbidity rates.

Incidence rate is the number of new cases of a disease occurring in a specified population during a specified time period, divided by the average number of individuals in that population during the specified time period.

For example, suppose that out of an average population of 4000 cattle in a quarantine camp, 600 animals developed symptoms of rinderpest during the month of June. The incidence of rinderpest in that quarantine camp for the month of June was 600/4000 = 0.15 or 15% or 150 new cases per 1000 animals.

The incidence rate is a way of measuring the risk that a susceptible individual in a population has of contracting a disease during a specified time period. Therefore, if a susceptible animal had been introduced into the quarantine camp on I June, it would have had a 15% chance of contracting rinderpest by the end of the month.

When calculating incidence rates, problems frequently arise in estimating the denominator. Because of births, deaths, sales, movements etc. livestock populations rarely remain stable over periods of time, and such fluctuations in the denominator will obviously affect the calculation of the incidence rate. There are various ways of estimating the denominator in incidence rate calculations. These normally involve measuring the population at various intervals during the study period and averaging the results.

For instance, suppose that in our previous example there were 4000 animals present at the beginning of June but that 100 animals died of the disease by the end of the second week and a further 300 by the end of the month. Assuming that no new animals were introduced or born, the animal population in the quarantine camp at the start of the observation period was therefore 4000, at the mid-period 3900 and at the end 3600. We might decide to calculate the denominator by taking the populations present at the beginning and end of the observation period and averaging them:

(4000 + 3600)/2 = 3800

The corresponding incidence rate would be 600/3800 = 0.158 or 15.8%.

Alternatively, we might take the populations present at the beginning, middle and end of the observation period and average them -

(4000 + 3900 + 3600)/3 = 3833

- and the incidence rate in this case would be 600/3833 = 0.156 or 15.6%.

Note that the different methods of calculating the denominator have resulted in slightly differing estimates of incidence. Because of this, the method used in calculating the denominator should always be specified when comparisons of incidence are being made, and the same method should be used throughout. Due to difficulties in the calculation of the denominator in incidence rates, another form of morbidity rate, the attack rate, is sometimes used.

The attack rate is the total number of cases of a disease occurring in a specified population during a specified time period, divided by the total number of individuals in that population at the start of the specified time period. The denominator, therefore, remains constant throughout the period of observation. Thus, in our previous example, the attack rate would be 600/4000 = 15%.

Strictly speaking, the definition of the attack rate requires that all cases of disease, not just new cases, are included in the numerator. Attack rates are normally used, however, to quantify the progress of a disease during an outbreak. In most instances there would have been no cases of the disease in question prior to the onset of the outbreak, so that all the cases are, in fact, new cases, and the attack rate becomes a modified form of incidence rate, sometimes referred to as a cumulative incidence rate.

Prevalence rate is the total number of cases of a disease occurring in a specified population at a particular point in time, divided by the total number of individuals in that population present at that point in time.

For example, suppose that in a population of 4000 cattle held at a quarantine camp there were 60 cases of rinderpest when the population was examined on June 18. The prevalence of rinderpest at that camp on 18 June would then be 60/4000 = 0.015 or 1.5% or 15 cases per 1000 animals.

Note that prevalence is a cross-sectional measure referring to the amount of disease present in a population at a particular point in time, hence the term point prevalence. However, when dealing with large populations, point prevalence becomes almost impossible to obtain, since it is not possible to examine all the individuals in that population at a particular point in time. In general, therefore, measurements of prevalence have to take place over a period of time, and this is known as period prevalence. Provided that the time taken to measure the prevalence remains reasonably short, this parameter retains a fair degree of precision. If, however, the time interval becomes too long, a significant number of new cases of the disease will have occurred since the start of the measurement period. The parameter then becomes a mixture of point prevalence and incidence and, as such, loses precision.

The terms incidence and prevalence are frequently confused and misused. Confusion normally arises due to a failure to define accurately the denominator i.e. the actual population being considered. This can result in the population at risk being either ignored or not considered in its entirety.

Examples of this can be found in reports from veterinary offices laboratories, in which the term "incidence" is often used to express the number of diagnoses or isolations of a particular disease agent as a percentage of the total number of diagnoses or isolations performed. In this case the denominator is not the population of individuals at risk from the disease, and the rate calculated resembles a form of a proportional morbidity rate.

A proportional morbidity rate is the number of cases of a specific disease in a specified population during a specified time period, divided by the total number of cases of all diseases in that population during that time period.

For example, suppose that an outbreak of contagious bovine pleuropneumonia (CBPP) occurs in a herd of cattle. During a 6-month period there are 45 cases of different diseases, including 18 cases of contagious bovine pleuropneumonia. The proportional morbidity rate for contagious pleuropneumonia in that herd for the 6 months would then be 18/45 = 0.4 or 40% or 400 cases of CBPP in 1000 cases of all diseases.

Mortality rates

The most commonly used mortality rates are crude death rate and cause-specific death rate.

Crude death rate is the total number of deaths occurring in a specified population during a specified time period, divided by the average number of individuals in that population during the specified time period.

The denominator for this rate can be estimated in the same ways as that for an incidence rate. Note, the method of calculating the denominator should always be defined and the same method used throughout to enable meaningful comparisons to be made.

Example: Suppose that in a herd of cattle there were 40 deaths in a year. The number of animals in the herd at the start of the year was 400, at mid-year 420, and at the end of the year 390. The average herd size could therefore be either

(400 + 390)/2 = 395
or
(400 + 420 + 390)/3 = 403

Depending on which method we used to calculate the denominator, the crude death rate would be either 40/395 = 0.101 (10.1%) or 40/403 = 0.099 (9.9%).

Cause-specific death rate is a useful mortality rate and can be defined as the total number of deaths occurring from a specified cause in a specified population during a specified time period, divided by the average number of individuals in that population during that time period. The denominator is calculated in the same way as for an incidence or crude death rate, and the same caveats apply in its calculation.

Example: Suppose that there were 20 deaths from babesiosis in the herd mentioned above, then the death rate due to babesiosis in that herd would be either 20/395- = 0.051 (5.1 %) or 20/403 = 0.050 (5.0 %).

Other useful mortality rates

Proportional mortality rate is the total number of deaths occurring from a specified disease in a specified population during a specified time period, divided by the total number of deaths in that population during that time period.

Example: Suppose that out of 40 deaths in a herd 20 were from babesiosis, then the proportional mortality rate due to that disease would be 20/40 = 0.5 or 50%.

Case fatality rate is the number of deaths from a specified disease in a specified population during a specified period, divided by the number of cases of that disease in that population during that time period.

Example: Supppose there were 50 cases of babesiosis in the herd, then the case fatality rate due to babesiosis would be 20/50 = 0.4 or 40%.

The rates described above are those that are most likely to be used in epidemiological studies in Africa. Details of other rates, how to calculate them, and their potential uses can be found in Schwabe et al (1977).

The use of specific rates

In epidemiology, we are nearly always involved in studying the effects of determinants on the frequency of occurrence of disease. This often involves the comparison of some of the rates mentioned previously, either in the same population over time - normally before and after a determinant is added or removed - or between populations - either with or without an added determinant, or with different frequencies of occurrence of the determinant, either at the same point in time or over a period of time.

For such comparisons to be valid, the comparison groups should differ from one another only in the presence, absence, or frequency of occurrence of the particular determinant being studied. Since epidemiology usually involves the study of determinants under uncontrolled field conditions, these criteria are extremely difficult to fulfil. Nevertheless, if rates are expressed in such a form as to ignore the different characteristics which may be present within the disease agents or host populations being compared, there is a danger that such rates may give an oversimplified and even false impression of the actual situation.

Rates can be made more specific, and the comparisons between them more valid, by taking into account various different characteristics. Differences in subspecies and strains of disease agents can be accounted for by clearly defining the subspecies or strain being studied and by making sure that only those individuals affected by that particular subspecies or strain are included in the numerator. Differences in the characteristics of host populations due to age, breed and sex can be expressed by calculating rates which take these specific characteristics into consideration.

Thus, for example, one could calculate an age-specific incidence rate which is defined as the number of new cases of a disease occurring among individuals of a specified age group in a specified population during a specified time period, divided by the average number of individuals in that specified age group in that population during that time period. Alternatively, one could calculate a breed-specific incidence rate which is defined as the total number of new cases of a disease occurring among individuals of a specific breed in a specified population during a specified time period, divided by the average number of individuals of that breed in that population during that time period. One could go even further and calculate an age-breed specific incidence rate which is defined as the total number of new cases of a disease occurring among individuals in a specified age group of a specified breed in a specified population, divided by the average number of individuals of that specific age and breed in that population during that time period.

The same procedures can be applied to other morbidity and mortality rates. A large variety of specific rates can thus be calculated by using appropriate definitions of the numerator and the denominator. As a general principle, rates should be made as specific as the data allow, but not so specific as to make the numbers involved too small for statistical analysis. For analytical purposes there is little or no advantage in calculating and comparing age- or breed-specific rates if an age-breed specific rate can be calculated.

The following is an example illustrating the advantages of using specific rates in making comparisons. Suppose we wished to assess the efficiency of a tick control programme in two East Coast fever (ECF) endemic areas, where the level of disease challenge, the environmental conditions and the systems of management were approximately the same. In area A there was an average population of 10 000 head of cattle present during a 1-month study period, and 500 animals from that population developed symptoms of ECF during that period. In area B there was an average population of 15 000 head of which 1500 developed symptoms of the disease during the study period. The crude incidence rate of the disease in area A was 500/10 000 = 5 % and in area B 1500/15 000 = 10%. We might conclude, therefore, that the tick control programme in area A was more efficient than in area B.

Suppose we also found that the cattle population in area A was made up of 400 crossbred Holsteins and 9600 East African Shorthorned Zebus, while that in area B con xxxsisted of 4500 crossbred Holsteins and 10 500 East African Shorthorned Zebus. We are now able to calculate breed-specific incidence rates as indicated in Table 7.

Table 7. Breed-specific incidence rates of East Coast fever in two cattle populations.

 Area Breed Number of cattle Number of new cases of ECE Incidence (%) A Crossbred Holstein 40.0 97 24.3 East African Shorthorned Zebu 9 600 403 4.2 Total 10 000 500 5.0 B Crossbred Holstein 4 500 1 059 23.5 East African Shorthorned Zebu 10 500 441 4.2 Total 15 000 1 500 10.0

Note that whereas the crude incidence rates remain 5% and 10% respectively, there is no difference in the breed-specific incidence rates for East African Shorthorned Zebus between the two areas and the rate for crossbred Holsteins is, if anything, less in area B than it is in area A. The difference in the crude incidence rates between the two areas is due to the fact that the much more susceptible crossbreds make up only 4% of the cattle population in area A whereas in area B they represent 30% of the cattle population.

## 3.6 Methods of summarising numerical data

We have already discussed the (arithmetic) mean and noted that, by itself, the mean gives no indication of how the data are dispersed about the mean value. We resolved this problem by drawing a histogram, but graphical presentation may not be always convenient and we might like to be able to reduce a data set to a few meaningful values.

At this stage, it is necessary to introduce some simple algebraic notations to express a set of data values. For example, we could refer to the data in Table 1 as X1, X2,....X150, where X1 = 1.40 and X150 = 1.44. If we wanted to refer to a more general data set without fixing the total number of values it contains, we could write X1, X2.....Xn, and say that the data contain n different values or observations. We will not always use the letter X; when we want to refer to different data sets in the same context, we will use a different letter for each set. The arithmetic mean for a given data set will be expressed by the appropriate letter with a bar over it. For example:

(X1 + X2 +.... Xn)/n

In statistics it is common to add sets of numbers together, and we shall use a special symbol to denote that operation, namely:

which means the sum of all X's from i = 1 to i = n i.e.: n

or often we just write S X or S Xi.
For example, we can write 1/n S X.

We now return to our problem of looking for a way to describe the "scatter" of values about the mean value X. It turns out, for a variety of reasons, that a convenient value is the standard deviation (S), calculated as follows:

This formula says: "Find the distance of each individual value X from the mean, square that distance, and then find the average squared distance; finish by taking the square root of the average". Many different formulae can be found in elementary books on statistics for calculating the standard deviation. The best solution is probably to buy a cheap calculator with this calculation built in. Alternatively, the following formula can be used:

This formula gives the same answer as the previous one but is easier to manipulate on a calculator. Using this formula, the standard deviation for the data in Table I was calculated as 0.1931.

There is a point to be made here about suitable levels of accuracy. A calculator may give S = 0.1930736, but this number has too many decimal places to be intelligible. About four significant figures is the maximum that will be absorbed by most readers of a paper or report, and many will notice only the first two.

How to make use of the pair of numbers and S to grasp the main features of a data set will be explained later. One problem with the mean as an indicator of the "centre" of the data is that its value can be markedly affected by the presence of a few extreme values. Suppose, to take an exaggerated case, there are 20 farmers living in a village of whom 19 earn US\$ 1000 per annum and the twentieth earns US\$ 1 000 000 per annum. The average (i.e. per caput) earnings of the 20 individuals is almost US\$ 51 000 per annum, which is very misleading. Data with a few very large or very small values as compared to the remainder of the set, are said to be skewed.

An indicator of the "centre" of data which is not affected in this way and which is therefore more likely to give a value typical of the whole data set is the median (m). This is a number so chosen that at least half the data have a value not smaller than m, and, simultaneously, at least half the data have a value not greater than m. The median value of the data in Table I is 1.37 kg, while for Table 3 the median parturition is 2. Of course, to discover the "middle" value in a set of data one has to write all the values in the correct order, and this can be time consuming unless it is done automatically by using a (micro) computer. In most practical contexts it will make little difference which of the indicators is used, and the mean is the most frequently chosen.