Previous Page Table of Contents Next Page

Modelling systems

Modelling: A review of systems and approaches for vector-transmitted and other parasitic diseases in developing countries
Modelling disease on a geographical surface
Statistical modelling of georeferenced data: Mapping tsetse distributions in Zimbabwe using climate and vegetation data

Modelling: A review of systems and approaches for vector-transmitted and other parasitic diseases in developing countries

G. Gettinby*, C.W. Revie+ and A.J. Forsyth+

* DEPARTMENT of Statistics and Modelling Science
University of Strathclyde
Glasgow, United Kingdom

+ Department of Information Science
University of Strathclyde
Glasgow, United Kingdom

Analytical models
Deterministic simulation models
Chemical resistance
Biological indices
Stochastic simulation models
East coast fever modelling
Weather characterization
Tools for simulation
Benefits of simulation
Data models
Information models


During the last 30 years, modelling has increasingly played an important role in improving the efficiency of agricultural production systems. In the case of vector-borne diseases, most of the approaches have focused around calculus methods to describe the dynamics of the interactions of host and parasite, or empirical simulation models to describe the parasite/vector behaviour in relation to environmental conditions. This review gives a brief tour of venous modelling techniques and their attributes. The techniques include the 'spherical cow' approach, analytical methods, simulation and statistical modelling. Within the agricultural community there is an expectation that models will become available as 'products' for a wide range of decision-making and problem-solving tasks. If the expectation is to be realized, the context in which models will be used and delivered will be important. The opportunity for creating information models that combine modelling techniques with the heuristic knowledge of experts and relevant disease information will be examined.


Extract from Assessment of Animal Agriculture in Sub-Saharan Africa, Executive Summary, Winrock International, 1992.

'Between 1990 and 2025, enormous demographic and social changes will sweep sub-Saharan Africa.... If livestock production grows no faster than it did between 1962 and 1987, the region will face massive deficits in supplies of meat and milk by 2025.

'The highest priority for animal health research is to develop sustainable means to prevent and control the environmentally related diseases including trypanosomiasis, theileriosis, anaplasmosis, babesiosis, cowdriosis and dermatophilosis.'

Measurement has played a key role in improving the efficiency of agricultural production systems in the developed countries. It has given birth to quantitative biology and stimulated the current interest in mathematical modelling as a basis for decision making and planning. Existing models of host-parasite behaviour raise the question as to how best they can serve the functions of sustaining and increasing food production in developing countries where parasitic diseases are a major constraint. Currently, there is a need for a greater understanding of what models can offer and how they can serve research programs.

In the case of vector-borne diseases most of the modelling approaches have focused around calculus methods to describe the dynamics of the interactions of the parasite/vector behaviour. This review briefly discusses some modelling techniques and their attributes. The techniques range from the 'spherical cow' approach popularized by Harte (1988) as an approach for environmental modelling, to the information models that combine traditional modelling techniques and the heuristic knowledge of experts within a disease management framework.

Analytical models

The 'spherical cow' modelling approach was used by Harte (1988) to illustrate that a great number of real problems could be solved using very simple approaches. In particular, the spherical cow refers to the case of a group of academics who when invited by a farmer to investigate milk production in cows presented their findings on the basis of each cow being a sphere. The key feature was the reduction of the problem to simple components. For example, a planner that replaces a national dairy herd of 1 million indigenous cattle with exotic cattle that produce 25% more milk will only require a national herd of size 0.8 million to produce the same amount of milk. However, if the prevalences of vector-borne disease in indigenous and exotic cattle are 10% and 15% respectively the number of cases will rise from 100,000 to 120,000. If the exotic cattle are capable of increasing milk production by 60% a national herd of size 0.625 million will be needed and the number of cases of disease will drop to 93,750. The evaluation of the increase in milk production needed from exotic cattle to produce the same amount of milk and not increase the number of cases of disease can be modelled by two equations:

Nindig L = Nexo L (1 + m)


Pindig Nindig = Pexo Nexo

which solve to give

m = Pexo/Pindig - 1

where Nindig and Nexo denote the number of cattle in indigenous and exotic herds, Pindig and Pexo the prevalence rates of disease in indigenous and exotic cattle, L the average amount of milk proposed by an indigenous cow and m the percentage increase in milk production by an exotic cow.

From this simple model more precise models can be built to take account of lactation, management practice, welfare and the cost of change. Often for vector-borne diseases such models lead to analytical methods and in particular the use of differential equations which model the instantaneous rate of change of a population or how population numbers change from one period to another. These often can be rearranged algebraically to give a solution or alternatively solved using computer methods. On the other hand they make very general assumptions about the behaviour of the vector, parasite and host populations. Good examples of the differential equation approach exist for trypanosomiasis (Milligan and Baker, 1988; Rogers, 1988) and for theileriosis (Medley et al., 1993), and the difference equation method was successfully used in the Garki Project to model the control of malaria in West Africa (Molineaux and Gramiccia, 1980). The recent work of Anderson and May (1991) provides a comprehensive treatment of the role of differential equations as models for infectious diseases of humans.

Other analytical methods do exist. Leslie (1945) (see Williamson, 1972 for applications) pioneered the use of matrix theory and Lewis (1977) the use of networks as methods for describing changes in populations divided into stadia. Both of these methods have been used to model the life cycles of parasites of domestic livestock (Gettinby and McClean, 1979; Paton and Gettinby, 1985; Gettinby et al., 1988).


Stimulation is a technique whereby a physical process can be mimicked using a model which preserves the essential features of the process. The model can be analogue, but more often it is abstract and expressed in mathematical terms. The purpose of simulation is to obtain results on the behaviour of complex processes. In addressing agricultural problems, simulation models have the obvious advantage that they enable large-scale processes which involve the relationship between man, plants, animals and the environment to be modelled. Such models are referred to as system simulations (France and Thornky, 1984). Results are normally obtained once data inputs have been given to the models and so the findings are specific and not intended to provide generalizations. In the past, this has been one reason why simulation models have not been widely adopted.

Deterministic simulation models

Simulation models appear in numerous forms. Like most modelling approaches, simple classification is into deterministic and stochastic models. Deterministic simulations deal with models which do not take direct account of uncertainty that may occur within the physical system. The resistance of a species to chemical treatment and the suitability of species to geographical sites have been the subject of deterministic simulation models.

Chemical resistance

In an investigation into the evolution of resistance to insecticide, Georghiou and Taylor (1977a, 1977b) simulated the survival behaviour of an insect population from one generation to the next using the logistic equation:

N(t+1) = N(t)exp[r(K-N(t))/K]

where N(t) is the number in the population in generation t and N(t+1) the number at time t+1. K is the parameter denoting the maximum size the insect population can attain and r is the growth rate parameter. Assuming a single locus model with two alleles R (for resistant) and S (for susceptible), separate simulations can be undertaken for insects with genotypes RR, RS and SS. The total number of insects to reach adult stage in each generation is then


where the Ws take account of the different survival rates of insects under insecticidal challenge.

For data inputs r, K and Ws, the model provides results for the population size and the R gene frequency in each generation, mimicking the real life development of a population of insects exposed to insecticidal treatment. By varying the basic model, the simulations provide information on the consequences of immigration, refugia and various insecticidal control regimes. In particular the work demonstrated that refugia, whereby insects avoid contact with the chemical due to sequestration within the plant etc., has a profound effect on slowing the rate of evolution of the R allele.

Biological indices

An index which reflects the state of a biological system is one of the basic types of simulation models. Climatic indices which provide measures of the likelihood of the presence of species which vector disease have been sought ever since the study of epidemiology began. MacLeod (1932) made one of the earliest references to this approach, when it was noted that tick activity appeared to be related to weekly mean maximum air temperature. However, an index based on temperature alone has never proved to be an adequate indicator of the incidence of tick-transmitted diseases.

More recently, there have been more successful attempts. The Ecoclimatic Index (EI), calculated by the Climex model (Sutherst and Maywald, 1985) has proved useful in the study of the distribution of African ticks (Lessard et al., 1990; Perry et al., 1990). The index can be calculated for different geographical locations and consequently is particularly well suited for computer mapping. The index is the product of a growth index and a survival probability and takes the form

EI = 100 S GIj/{52(1-CS) (1-DS) (1-HS) (1-WS)}

where GIj is the growth index in week j, CS is the cold stress index, DS is the dry stress index, HS is the hot stress index and WS is the wet stress index. The growth index is calculated from temperature, moisture and day-length data, and the survival probability is calculated from the values of stress estimated for the species at a particular site. The index is a measure of the propensity of a particular species to exist at a particular site based on environmental factors.

Bioclim (Busby, 1986), is another example of an index which can be used to simulate the distribution of a species or vegetation type which is influenced by climate. Unlike the EI index, Bioclim works on an induction principle. Species prediction is based on matching climate with sites where the species is known to exist. The method requires detailed and reliable meteorological surfaces.

Stochastic simulation models

Statements about expected performance usually consist of a single number or point estimate without any measure of confidence. Random variation is at the very centre of biological systems. Consequently, stochastic models can be important. These models allow for the complex interactions between systems such as the random movements of vectors that transmit disease to animals or the random occurrence of environmental mishaps.

East coast fever modelling

ECFXPERT (Byrom and Gettinby, 1992) is a systems stochastic model which deals with the biological details relevant to the transmission of the disease East Coast fever (ECF) to cattle from ticks under different environmental conditions. It is stochastic and random numbers generated within the computer are used to simulate day-to-day variations in the transmission of the disease and also to mimic the changes in daily temperatures which control the development of the tick population. The computer model contains four simulation models: a tick model, an ECF model, a dipping model and a chemotherapy model. The tick model is for the investigation of tick populations alone. The ECF model investigates the incidence of disease in cattle by modelling tick, parasite and herd interactions. The dipping and chemotherapy models look at the effect of tick and parasite control on the incidence of disease. An important aspect of these simulation models is that they are based on empirical data and expert findings and opinions extracted from the ECF literature, covering 80 years of research. The model is designed to be used by people from different disciplines in helping to answer differently motivated questions concerning ticks and disease. The model can also be used as a planning tool to determine effective research programs by performing computer experiments to assess the impact of findings and to assist with the design of experiments.

ECFXPERT contains comprehensive data on ticks and ECF, providing a learning facility for users with limited knowledge of the disease. On request, help messages appear in windows on the screen, and dictionary key words on which more information is available are highlighted. In a similar way a bibliography containing references to relevant literature and scientific papers can be accessed.

The ECFXPERT model has been used to examine the effects of year-to-year variation in climate and the effect of changing trends in annual temperatures on the distribution of attached ticks and disease incidence over a 20-year period. Recent findings using ECFXPERT suggest that, in the absence of wild hosts and other reservoir hosts for ticks, the infection dies out within a herd of cattle after several years if infected cattle develop sterile immunity. In contrast, if infected cattle are carriers the infection can remain within the herd indefinitely.

Weather characterization

The presence and severity of many diseases of animals is greatly influenced by climate, or more precisely weather. Diseases such as malaria, schistosomiasis, leishmaniasis, filariasis and trypanosomiasis depend on vectors such as mosquitos, molluscs and flies for their transmission. These vectors have life cycles with periods of development and activity which are regulated by temperature and rainfall. Weather conditions are therefore one of the most important factors in determining short-time patterns of disease. Much work has been done on analysing historical meteorological records in an attempt to find parsimonious representations of the data (Thornthwaite, 1948). Using these representations, simulation methods can be employed to generate synthetic weather data representing the pattern typical for season and region (Richardson, 1985).

In the UK, meteorological parameters are normally summarized over a standard average period of 35 years by the meteorological office. This period is considered suitable to provide adequate estimates of short- and long-term trends. For many areas, averages and standard deviations of the daily maximum and minimum screen temperatures for each month are usually sufficient, as daily temperatures can be generated using random numbers from appropriate probability distributions. Changing trends in temperature can be predicted using time series methods. However, the most important natural resource in agricultural ecosystems is water. Compared with temperature, it is precipitation which is the most variable, particularly in the tropics. In a study of dry spells, Stern et al. (1982a, 1982b) identified the proportion of dry spells in the month of July as being very different at Sholapur and Hyderabad in India, yet both areas have the same average rainfall and a rainy season which occurs between June and October. Models of rainfall have generally focused on first determining a probability model for whether or not a day is wet or dry, and then determining a probability model for the amount of rain falling on each wet day. For the former, Markov models which predict wet days depending on whether the previous days were wet or dry have been widely adopted (Gabriel and Neumann, 1962), whereas probability distributions such as the gamma curve have been used as sampling distributions for the amount of rainfall.

Tools for simulation

Historically, the development of simulation has been closely aligned with the development of computers. In the case of stochastic simulations of systems, Monte Carlo techniques were first proposed around the middle of this century. Much of the effort has been and still is concentrated on how computers can best produce pseudo-random numbers using random number generators. Genuine random numbers are difficult to produce and pseudo-random numbers must pass various tests before being accepted as suitable approximations. There are few comprehensive textbooks on the technical aspects of simulation. One of the earliest treatments of random number generation and systems simulation is given by Tocher (1963). More recently, there has been a renewed interest in the subject as reflected by undergraduate texts by Morgan (1984) and Ripley (1987).

Applications of simulation are still widely written in fundamental programing languages such as Fortran, Pascal and more recently C. A systems simulation developed using these languages can take time and is not flexible. There have been many attempts to produce computer languages specifically for the purpose of simulation. Most of these have focused on the simulation of events in discrete time but these are not particularly useful for models of agricultural systems. Examples are GPSS, SIMSCRIPT and GASP, which simplify the tedious task of writing program code. There have been some attempts at constructing software for systems simulations. CSMP and DYNAMO are Fortran-based languages which can be used to solve equations used to describe the relationships between components of the system. In contrast, STELLA allows the user to specify the model in graphic form using different letters to denote interrelationships. These graphical relationships are converted into equations which are then simulated. The philosophy adopted in STELLA means that the process of specifying the model is simplified and no programing experience is needed to run simulations. Other recent developments in simulation have focused on object-oriented programing and the use of icons within models to depict the behaviour of a system, and to improve the graphical presentation of results.

Benefits of simulation

Simulations usually depend on the computer generation of random numbers to mimic environmental variation relevant to vector populations. By repeating the simulations using different patterns of random numbers, different results but with a similar pattern are obtained. This enables confidence intervals to be placed on findings. The use of simulation models as a means of undertaking computer experiments has become important. Computer experiments make use of existing information and they provide an alternative to expensive and prohibitive field studies.

Computer simulation models will often generate results that are intuitively obvious or, more realistically, generate facts that should have been obvious. However, counter-intuitive results are one of the important benefits of a simulation model. Obtaining results which appear puzzling or which are not consistent with expectation often leads to a thought model. The thought model is a simplification of the simulation model in order that the counter-intuitive results can be explained.

Data models

Until the 1900s, agricultural systems evolved very slowly. The selection of animals that could resist disease and increase food production was based on the subjective preferences of breeders. It was not until the scientific advances of this century did the benefits of detailed recording and analysis become apparent. Simple statistical analyses were to reveal that increased productivity could be achieved from cattle by selection on the basis of liveweight gain, which had a high heritability, and not calving intervals which had a low heritability. During the last 20 years, new technologies centred around the collection, analysis and use of data have continued to improve the efficiency of agricultural production systems. Spreadsheet models have become an effective tool for the management of production resources and databases can be used as models or to support analytical and simulation models. The ILCA Bio-Economic Herd Model for microcomputers (IBIEHM) is an example of a spreadsheet model which predicts coatings associated with herd dynamics (von Kaufmann et al., 1990). The program is designed to interface factors such as milk offtake and liveweight gain of animals with coatings and so produce balance sheets. In particular, it compares the performance of a herd over a period of years with the performance which might be expected intervention. The modus operandi of the model requires any intervention to be expressed in terms of its potential effect on liveweight gain, milk production, etc. The model generates yearly predictions and requires a careful specification of all terms relevant to herd structure and details of all associated costs such as fodder production.

Databases represent an important source of knowledge and their use has become prominent due to inexpensive storage methods and ease of access. Geographical information systems of vector-borne diseases have been at the vanguard of database models. Traditionally, data have been considered the domain of researchers for the purpose of statistical analysis and the identification of 'significant' differences. However, statistical analysis is a form of model building which can and should provide a statistical model. The data reduction methods now widely available with General Linear Model procedures in statistical software packages make it possible to construct regression and logistic regression models that describe the relationship between states of disease and environmental factors. This approach has been the subject of analyses of the ILRAD Tick Unit database where there is evidence that linear models of animal and tick characteristics can help predict the pattern of disease within an animal once infection has taken place.

Alternatively, databases can simply be interrogated and inferences based on observed frequencies of events. This makes the assumption that historical findings are a good predictor of future behaviour which is often not unreasonable for short-term predictions of disease patterns.

Information models

Vector-borne diseases are complex and their study occupies a large and diverse community of people. The development of models to serve the needs of planners and decision makers must recognize the cross-disciplinary nature of the issues involved. This means that results from any model must be interpreted in the context of the problem under consideration.

Trypanosomiasis is unique amongst vector-borne diseases because of its diversity and complexity. At least seven species of trypanosomes are known to be pathogenic to man, domestic animals and wildlife. These species can infect over 30 species and sub-species of flies and often a host or vector will have concurrent infections. The tsetse vector survives in habitats ranging form dry savanna to humid forest. Control of the disease in cattle can be attempted using trypanocidal drugs, trypanotolerant cattle or tsetse population reduction. Each method of control has environmental implications. A substantial body of research literature has been generated for trypanosomiasis and this desperately needs to be brought together in a fashion that can make it useful. A number of analytical mathematical models exist which attempt to provide insight into disease transmission and control (Milligan and Baker, 1988; Rogers, 1988).

As with most vector-borne diseases, people and text are the two traditional knowledge sources in trypanosomiasis. People are the original source and text is used to capture knowledge in a fixed reliable form which is widely available. The knowledge found in text and used by people to solve problems can be represented using the 'production rule' approach to expert systems whereby knowledge is expressed in a series of if . . . then rules.

In addition, the information within text can be organized to be read in non-linear fashion using hypertext. Data found within literature forms the basis for mathematical models. By combining expert systems, hypertext and modelling as illustrated in Figure 1, there is the potential to construct information models which are more effective than the use of each approach in isolation.

This philosophy has been undertaken in developing an experimental Trypanosomiasis Information System (Forsyth et al., 1992) for use on personal computers. As illustrated in Figure 2 the Trypanosomiasis Information System primarily consists of browse, literature sources and expert systems. The literature sources represent key papers which have hypertext links so that the user may easily move from one piece of text to another. This provides the user with a quick learning environment. Two expert systems are illustrated, one solves problems on which traps and targets might be appropriate for controlling tsetse populations and the other advises on the diagnosis of trypanosomiasis in cattle and the recommendation of viable drug treatment. The latter expert system includes the use of a geographical database for the Berenil Index. Although such a database does not yet exist, it serves to illustrate the way in which results from analytical and data models could be used within the problem solving context. As illustrated in Figure 3, modelling could be delivered as an integral part of the information system software or alternatively external modelling routines could be called. Throughout the expert system's question-and-answer session, the hypertext links are fully accessible.

Figure 1. The Use of traditional knowledge sources within information models.

Figure 2. User options within an experimental trypanosomiasis information system.

Figure 3. The potential role of mathematical models and databases within information models for vector-borne diseases. Source: After a model originally proposed in A.J. Forsyth, A Hybrid Information System for Animal Trypanosomiasis, MSc thesis, Department of Information Science, University of Strathclyde, 1991.

Models that serve a range of agricultural research areas are gradually becoming a reality. A UK register of agricultural models exists (Squire and Hammer, 1990) and the Australian Bureau of Rural Resources has recently devoted several issues of Agricultural Systems and Information Technology newsletters to special issues on livestock models (see Agricultural Systems and Information Technology on Sheep Industry Software 4 (No. 1), May 1992, and Animal Health 5, (No. 1), September 1993).


ANDERSON, R.M. and MAY, R.M. 1991. Infectious Diseases of Humans: Dynamics and Control. Oxford: Oxford University Press, 757 pp.

BUSBY, J.R. 1986. A bioclimatic analysis of Nothophagus cunninghamii (Hook?) oerst. in southeastern Australia. Australian Journal of Ecology 11: 1-7.

BYROM, W. and GETTINBY, G. 1992. Using the computer model ECFXPERT to study the control of ticks and East Coast fever. Insect Science and its Applications 13: 527-535.

FORSYTH, A.J., GETTINBY, G. and REVIE, C.W. 1992. Integrating hypertext and expert systems: a hybrid information system for the domain of animal trypanosomiasis. In: Weckert, John and McDonald, Craig, eds. Intelligent Library Systems. Riverina: Centre for Information Studies, Charles Sturt University, pp. 175-197.

FRANCE, J. and THORNLEY, J.H.M. 1984. Mathematical Models in Agriculture. London: Butterworths, 335 pp.

GABRIEL, K.R. and NEUMANN, J. 1962. A Markov chain model for daily rainfall occurrence at Tel Aviv. Quarterly Journal of the Royal Meteorological Society 88: 90-95.

GEORGHIOU, G.P. and TAYLOR, C.E. 1977a. Genetic and biological influences in the evolution of insecticide resistance. Journal of Economic Entomology 70: 319-323.

GEORGHIOU, G.P. and TAYLOR, C.E. 1977b. Operational influences in the evolution of insecticide resistance. Journal of Economic Entomology 70: 653-658.

GETTINBY, G. and McCLEAN, S. 1979. A matrix formulation of the life cycle of live fluke. Proceedings of the Royal Irish Academy 79B: 155-167.

GETTINBY, G., NEWSON, R.M., CALPIN, M.M.J. and PATON, G. 1988. A simulation model for genetic resistance to acaricides in the African brown ear tick, Rhipicephalus appendiculatus (Acarina: Ixodidae). Preventive Veterinary Medicine 6: 183-197.

HARTE 1988. Consider a Sperical Cow: A Course in Environmental Problem Solving. Mill Valley, California: University Science Books, 283 pp.

LESLIE, P.H. 1945. On the use of matrices in certain population mathematics. Biometrika 33: 183-212.

LESSARD, P., L'EPLATTENIER, R., NORVAL, R.A.I., KUNDERT, K., DOLAN, T.T., CROZE, H., WALKER, B., IRVIN, A.D. and PERRY, B.D. 1990. Geographical information systems for studying the epidemiology of cattle diseases caused by Theileria parva. Veterinary Record 126: 255-262.

LEWIS, E.R. 1977. Network Models in Population Biology. Berlin: Springer-Verlag, 402 pp.

MacLEOD, J. 1932. The bionomics of Ixodes ricinus L., the 'sheep tick' of Scotland. Parasitology 24: 382-400.

MEDLEY, G.F., PERRY, B.D. and YOUNG, A.S. 1993. Preliminary analysis of the transmission dynamics of theileriosis in eastern Africa. Parasitology 106: 251-264.

MILLIGAN, P.J.M. and BAKER, R.D. 1988. A model of tsetse transmitted trypanosomiasis. Parasitology 96: 211-239.

MOLINEAUX, L. and GRAMICCIA, G. 1980. The Garki Project: Research on the Epidemiology and Control of Malaria in the Sudan Savanna of West Africa. Geneva: World Health Organisation, 311 pp.

MORGAN, B. 1984. Elements of Simulation. New York: Chapman and Hall, 351 pp.

PATON, G. and GETTINBY, G. 1985. Comparing control strategies for parasitic gastro-enteritis in lambs grazed on previously contaminated pasture: a network modelling approach. Preventive Veterinary Medicine 3: 301-310.

PERRY, B.D., LESSARD, P., NORVAL, R.A.I., KUNDERT, K. and KRUSKA, R. 1990. Climate, vegetation and the distribution of Rhipicephalus appendiculatus in Africa. Parasitology Today 6: 100-104

RICHARDSON, C.W. 1985. Weather simulation for crop models. Transactions of American Society of Agricultural Engineering 28: 1602-1606.

RIPLEY, B. 1987. Stochastic Simulation. New York: Wiley, 237 pp.

ROGERS, D.J. 1988. A general model for the African trypanosomiasis. In: de Muynck, A. and Rogers, D.J., eds. Proceedings of Workshop on Modelling Sleeping Sickness Epidemiology and Control, Prince Leopold Institute of Tropical Medicine Held in Antwerp, 25-29 January. Annals de la Societe Belge de Medecine Tropicale 69 (Supplement 1): 73-88.

SQUIRE, G.R. and HAMER, P.J.C. 1990. United Kingdom Register of Agricultural Models. Bedford: Agricultural and Food Research Council, Institute of Engineering Research, 92 pp.

STERN, R.D., DENNETT, M.D. and DALE, I.C. 1982a. Methods for analyzing daily rainfall measurements to give useful agronomic results. I. Direct methods. Experimental Agriculture 18: 223-236.

STERN, R.D., DENNETT, M.D. and DALE, I.C. 1982b. Methods for analyzing daily rainfall measurements to give useful agronomic results. I. A modelling approach. Experimental Agriculture 18: 237-253.

SUTHERST, R.W. and MAYWALD, G.F. 1985. A computerized system for matching climates in ecology. Agriculture, Ecosystems and Environment 13: 281-299.

THORNTHWAITE, C.W. 1948. An approach towards a rational classification of climate. Geographical Review 38: 55-94.

TOCHER, K.D. 1963. The Art of Simulation. London: English Universities Press Ltd., 184 pp.

von KAUFMANN, R., McINTYRE, J. and ITTY, R. 1990. ILCA Bio-Economic Herd Model (IBIEHM) for Microcomputers. Addis Ababa: International Livestock Centre for Africa.

WILLIAMSON, M. 1972. The Analysis of Biological Populations. London: Arnold, 180 pp.

Modelling disease on a geographical surface

R.S. Morris, R.L. Sanson, D.U. Pfeiffer, M.W. Stern and B.M. Butler

Department of Veterinary Clinical Sciences
Massey University
Palmerston North, New Zealand

Mechanisms of representing spatial aspects within a model
Example models
Potential application to African trypanosomiasis


Until recently, models of animal diseases concentrated almost exclusively on trends over time and on differences in model predictions between various categories of animals within the modelling process. Despite the obvious importance of spatial aspects in many diseases, representation of these issues was either absent, or included in some simplified form. The reason for this was simple enough. There were no satisfactory ways of representing geography in a model without (for example) constructing a matrix which represented a grid of locations of interest and defining a vector of variables for each grid point that contained the values which would control the spatial variability of the model. While we have used this approach successfully, it is cumbersome and such models would run far too slowly unless various tricks were used to reduce processing time. Moreover it was difficult to avoid the models being somewhat artificial in the way they handled spatial issues.

It is now becoming quite practical to represent these spatial aspects much more effectively, as a result of two developments. Firstly, computer hardware development has of course been extremely rapid. Arising in part from the hardware developments, the second step has been the growth of software capable of representing spatial issues in ways which enable modelling to incorporate geographical aspects of disease realistically without imposing unworkable demands either on data inputs or on processing time. Within the spectrum of such software developments, the highest level programs are the true geographical information systems (GIS), which not only represent physical relationships among different locations accurately (i.e. contain topology), but more importantly represent different attributes of each location within separate coverages or layers of the GIS in such a way that information about each feature of the landscape can be kept separate, yet interrelated however may be required for modelling purposes. The two main methods of representation in a GIS, vector and raster (grid), have their merits for modelling purposes, and we choose whichever suits our needs best for a particular model. The distinction is in any case gradually disappearing as the more advances GIS programs take up characteristics of both systems. Examples of various forms of geographical modelling are provided in the paper.


Until recently, models of animal diseases concentrated almost exclusively on trends over time and on differences in model predictions between various categories of animals within the modelling process. Despite the obvious importance of spatial aspects in many diseases, representation of these issues was either absent, or included in some simplified form. The reason for this was clear enough. There were no satisfactory ways of representing geography in a model without stylizing the spatial information in some way, such as considering all points to lie on a regular grid and relating variables to each of the fixed grid positions. While we have used this approach successfully, it is cumbersome and such models would run far too slowly to be of practical value for large areas unless various tricks were used to reduce processing time. Moreover it was difficult to avoid the models being somewhat artificial in the way they handled spatial issues.

It is now becoming quite practical to represent these spatial aspects much more effectively, as a result of developments in computer hardware and software. Not only are processor speeds much faster and memory capacity far higher (so that data can be handled in RAM during a simulation rather than constantly writing to and from disk), but just as importantly the move to graphical screen management rather than text-based screen presentation has allowed far greater realism in handling spatial issues both within the model and in visual presentation of findings.

Arising in part from the hardware developments, the second step has been the growth of software capable of representing spatial issues in ways which enable modelling to incorporate geographical aspects of disease realistically without imposing unworkable demands either on data inputs or on processing time.

Mechanisms of representing spatial aspects within a model

Within the spectrum of software developments, the highest level programs and the most useful are the true geographical information systems (GIS), which not only represent physical relationships among different locations accurately (i.e. contain topology), but more importantly represent different attributes of each location within separate coverages or layers of the GIS in such a way that information about each feature of the landscape can be kept separate, yet interrelated to other information about the same location in whatever ways may be required for modelling purposes. This makes it very efficient to link a dynamic modelling process to geographical information, drawing on location information as a factor in the dynamics of the disease process, and writing results back to the GIS so that it can represent them visually.

The two main methods of representation in a GIS, vector and raster (grid), each have their merits for modelling purposes, and we choose whichever suits our needs best for a particular model. The distinction is in any case gradually disappearing as the more advanced GIS programs take up characteristics of both systems. Vector representation within a GIS means that location data is stored as points, lines and polygons, and that any point or area can be given characteristics which accurately represent its status with regard to a set of variables of interest. Thus if it is necessary in a model to represent true geographical boundaries (between farms, provinces or countries, for example) then a vector approach is necessary. In a raster system information is stored in relation to units within a grid structure, and a single unit within the grid must have a single value for any particular feature. Although in some raster systems the grid unit can vary in size across a particular map area, the ability to accurately represent boundaries is sacrificed for simpler representation which reduces data storage and processing requirements. Many models are well suited to a raster representation either directly or with some adjustment, and the output can be mapped if necessary to a vector base map. Models which use a true GIS can in principle be transferred from one geographical area to any other area for which the required input values to the model are available.

The choice of GIS software to use depends in part on the availability of a method of bridging between the geographical data and the modelling process, which is better developed in some systems than others. In certain cases it is even possible to model within the GIS, rather than by drawing upon data in the GIS to influence the operation of an external model.

Below the level of the full GIS, it is possible to buy or create programs which have sufficient of the attributes of a GIS to enable models to interact with them to create a geographic representation of a disease process without the high cost of buying a GIS. Such models do not however have inherent capacity to be 'moved' to different locations without the user personally capturing the necessary structural information for the new location.

Example models

The principles of geographical modelling can be best explained by describing three example models from our work, which contain the spectrum of these characteristics.


This is a model of a mixed livestock grazing farm in a temperate climate which can be used to evaluate management and disease control strategies. Feed intake, metabolism and productivity of each species and category of livestock are modelled on a pasture-based grazing system, with pasture supply and regrowth being calculated in a sub-model. In order to do all of this the model must 'understand' each paddock, and have information on such items as its soil type, available soil nutrient levels, aspect and slope. To achieve this, a scanned image of the farm map is used to create a quasi-geographical representation within the program of each of the paddocks on the farm, so that the model can interpret a paddock rotation system effectively for that farm (or any other farm) and can manage the farm in a sensible fashion using information supplied about the pasture growth capacity of each area.

When a farmer creates the image of his farm in the model, he can then provide information about each paddock which can be stored with a link to the map 'paddock'. This is not a true GIS but it appears to the user to have geographical information in it, and the model takes proper account of the geography of the farm in conducting the simulation. However it does this by storing information for each paddock in a database which simply understands each paddock as a management unit, but does not understand that paddock 3 is north-west of paddock 1 and contiguous with it. Such information is not necessary for a model of this kind, and would only add superfluous detail.


This is a model of the Australian brushtail possum in New Zealand, where it has become a wildlife reservoir for bovine tuberculosis. The modelling work is part of a larger study of the epidemiology of the disease, which provides the model parameters. The model needs to have substantial ecological content in order to accurately represent the wildlife population, and will have information on the associated populations of farmed cattle and deer, to individual farm level.

It is still under development, and when complete will have a three-level nested structure. The lowest level is a single habitat type, and for New Zealand three habitat types have been identified as adequately representing the range of environments for possum population modelling at this micro-scale. As in any model formulation, this requires a compromise between the maximum achievable degree of realism and issues such as model execution time and availability of study findings which can be used to set parameters in the model. Within each of these habitat types a model of possum population dynamics and the epidemiology of disease can be run, varying the parameters to suit data for the specific environment, since each habitat supports different possum densities and influences the ecology of the particular possum populations.

Time patterns of ecological and epidemiological indices derived for each of the individual habitats will be used to predict behaviour in a larger habitat mosaic comprising a mix of habitats as derived from vegetation maps of the country, and at this level the model will be linked to data on farm boundaries, so that interactions between domestic livestock and the wildlife reservoir can be realistically considered in the modelling process. In addition, features which only become important at these larger scales, such as dispersal of older juvenile animals to distant locations, can be represented realistically at this meso-scale level of aggregation.

In order to predict the epidemiology of the disease at a regional level (thousands of square kilometres), output from representative habitat mosaic models will be fed to a macro-scale model in which major topographical features such as rivers and mountain ranges can influence the effectiveness of control policies. Within this format the cost-effectiveness of various control policies can be assessed in simulations covering 20 to 30 simulated years. Through the use of the hierarchical modelling approach the speed of the model can be kept quite fast while it maintains an adequate approximation to field reality.

The final version will have full linkage to the relevant GIS data on topography, vegetation type and other issues. Thus the model will be truly 'transportable' in that, if data on these features are available for another area, the model is designed to accept and operate with the new data. Model output will simply be treated as attribute data in a database file, equivalent to the physical attribute data used as input. It will therefore be possible to map expected infection prevalence or possum populations in space as well as in time, to show predicted trends under the influence of alternative control strategies.

The current version of PossPOP models the disease on a single habitat and also on a habitat mosaic covering 400 hectares, with the first model feeding data to the second. It is possible to examine tuberculosis control strategies at farm level using this approach, and it has allowed the solving of most of the major technical problems in formulating truly geographical models of populations. No insurmountable problems are seen in extending the approach to much larger areas, given the current capacity of GIS programs to provide efficient access to location and attribute data.


This is a comprehensive decision support system designed for the emergency control of foot-and-mouth disease and other exotic animal diseases, should they ever enter New Zealand (Morris et al., 1992). It comprises a database, geographical information system, expert system elements, and models for each of the major mechanisms of spread of foot-and-mouth disease. The model for airborne spread of FMD virus calculates the quantity of virus which would be produced by affected animals on an outbreak farm, then uses a meteorological air flow model to predict the concentration of virus at various distances downwind from the outbreak site, currently using a Gaussian dispersion model for the purpose. This is then overlaid on the GIS farm map and farms at risk of being exposed to virus are identified automatically by the GIS as those lying under the plume, differentiating those farms holding animals of various species which lie under a cattle-infective dose of virus from those which lie under the much higher sheep-infective dose. This ability to identify areas in one coverage (layer) of the GIS which are matched spatially to areas in other coverages and to the attributes of those polygons (for example, holdings of various livestock species) is one of the very powerful capabilities of a GIS which cannot be realistically replicated by any alternative approach. Allowances can also be made for a wider margin around the plume than the exact plume prediction would calculate. Although this technique has proved in practice to have valuable predictive capacity for foot-and-mouth disease outbreaks, the nature of the Gaussian plume method of predicting airborne virus transmission means that the calculations do not take true account of the three-dimensional topography over which the plume is passing. Newer techniques such as Lagrangian puff models (Dr T. Mikkelsen and co-workers, Department of Meteorology and Wind Energy, Ris National Laboratory, Roskilde, Denmark) can take this into account and can also more precisely account for specific features of weather conditions which may affect virus dispersion. They can also make use of output data from numerical weather prediction models as an alternative to using data from specific local weather recording stations. Such improved mathematical techniques can greatly enhance the predictive power of geographical models of virus dispersion, although limits on the accuracy of the biological data which can be supplied to the meteorological model means that their full power cannot always be captured for veterinary purposes. Nevertheless it appears that such models will progressively allow airborne spread of disease to be assessed in greater detail, as the importance of airborne spread of various disease agents achieves growing recognition.

EpiMAN also includes other geographical modelling features, such as a model of inter-farm spread by various mechanisms (Inter-Spread), which can be used both for real-time evaluation of likely outbreak development and as a training tool by creating realistic outbreak scenarios for transmission of the disease. A further model allows prospective evaluation to be carried out of various control options, such as ring vaccination and contact slaughter.

EpiMAN is not simply a series of models, but an integrated decision support system (DSS) which handles incoming data of many different types, links the data items to geographical locations through the GIS, guides management actions through knowledge-based priority setting systems, and provides up-to-date evaluations of progress in control procedures. Through the epidemiologist's workbench, it also offers a series of tools which can be applied to the data to compare expected with actual trends, and where necessary modify the operation of the DSS to take account of new findings.

Potential application to African trypanosomiasis

The trypanosomiasis-tsetse fly complex is an ideal (although very challenging) application to which geographical computer simulation could be applied, probably most usefully as part of the development of a decision support system for control of the disease. It would be a long-term DSS, rather than a real-time emergency system of the EpiMAN type. Because the ecology of the fly and hence the trypanosomes is very dependent on landscape and climatic factors as well as host-related issues, modelling of large-scale control options could be a very powerful tool, utilizing a hierarchical model with a habitat-mosaic approach to handling the landscape diversity of African countries.

Models which lack a spatial dimension will have continuing difficulty in handling the geographical reality of this disease and the emergence of techniques for modelling on geographical surfaces with access to climatic and other data offers an ideal starting point for applying modelling at a practical level in the epidemiological study of such complex diseases.


MORRIS, R.S., SANSON, R.L. and STERN, M.W., 1992. EpiMAN - A decision support system for managing a foot-and-mouth disease epidemic. In: Proceedings of the Fifth Annual Meeting of the Dutch Society for Veterinary Epidemiology and Economy. Wageningen, pp. 1-35.

Statistical modelling of georeferenced data: Mapping tsetse distributions in Zimbabwe using climate and vegetation data

B. Williams*, D. Rogers+, G. Staton++, B. Ripley++ and T. Booth§

* Tropical Health Epidemiology Unit
Department of Epidemiology and Population Sciences
London School of Hygiene and Tropical Medicine
Keppel Street, London WC1E 7HT, UK

+ Department of Zoology
Oxford University
South Parks Road, Oxford OX1 3PS, UK

++ Department of Statistics
Oxford University
South Parks Road, Oxford, OX1 3TG, UK

§ CSIRO Division of Forestry and Forest Products
P.O. Box 4008
QVT, Canberra ACT 2600, Australia

Sources of data


It is important to be able to predict the distribution and abundance of insect vectors of disease in order that intervention programs may be targeted at appropriate areas and control operations may be designed and executed in the most efficient manner. Climate and vegetation data are becoming more widely available, partly through the increased use of satellites for remote sensing. In this paper we compare and contrast a number of advanced statistical techniques that can be used to predict the distribution of tsetse flies in Zimbabwe. We consider the relative merits of the different techniques in helping us to make accurate predictions but also in helping us to understand the biological factors that determine the distributional limits of tsetse flies. The simpler methods, such as linear discriminant analysis and tree-based induction, tend to be less precise but easier to interpret biologically than the more sophisticated methods, such as non-linear discriminant analysis and neural networks.


It is important to know the distribution and abundance of insects, especially those that are the vectors of disease. Eventually, we would like to be able to produce risk maps that tell us how the risk of disease varies over space and time. Such maps would be valuable for planning intervention strategies in epidemic situations and for planning control strategies in endemic situations.

Among the most important determinants of the distribution and abundance of insects are climate and vegetation. Many insects are limited in their distribution by high or low temperatures or by dry-stress. Even for haematophagous insects, vegetation cover is often important for their survival. Furthermore, satellite-derived vegetation data may serve as a surrogate for climate data, when the latter is unavailable, since the vegetation might respond to the same climate variables as does the insect.

The general problem, then, is: given estimates of the distribution of an insect, for example, together with a set of climate- and satellite-derived data, all on a suitable raster grid, how can one best predict the distribution of insects? The problem appears to be fairly straightforward. One imagines a parameter space of several dimensions in which each axis corresponds to one environmental variable. A volume in this parameter space is then identified that encloses the values of the various parameters in which the insect vector or the disease, for example, occurs and excludes all those in which it does not occur. Unfortunately, this apparently straightforward procedure turns out to be difficult to handle with real data that do not satisfy the usual assumptions of normality and linearity that underlie most standard parametric statistical techniques. Furthermore, standard techniques, such as discriminant analysis, assume that the parameter space can be separated by a single linear function, an assumption that is rarely valid.

In recent years new mathematical techniques, including non-linear discriminant analyses, neural networks, decision tree induction methods and k-nearest neighbour analysis, have been developed to analyse multivariate data. In this paper we investigate the relative merits of these methods in helping us to identify the factors that determine the limits of tsetse fly distributions.

Sources of data

There are, unfortunately, few places for which reliable maps of the natural distribution of tsetse flies as well as good climatic and vegetational data are available. In some places the distribution of the flies has been altered as a consequence of human interventions and in others as a consequence of biological events such as the rinderpest pandemic that swept through Africa at the end of the last century destroying most of the favoured hosts of tsetse flies and eliminating the flies from much of the country (Ford, 1971). Fortunately, maps of the pre-rinderpest distribution of flies are available for Zimbabwe. Figure 1 shows the distribution of tsetse flies in 1896 as deduced by Fuller (1923), Jack (1914, 1933) and Curson (1932) and reported by Ford (1971).

Tsetse Flies in Zimbabwe

The two species of tsetse flies that are found in Zimbabwe, Glossina morsitans and G. pallidipes, are both savannah species preferring open woodland to forested areas. At the end of the last century G. morsitans was overwhelmingly the dominant species of fly in Zimbabwe and the distributions can be taken as referring to this species alone. One belt of flies extended across the north of the country (along the Zambezi River) and another across the south of the country (along the Limpopo River) as shown in Figure 1. In other parts of Africa the distribution is more patchy and we are applying the methods that we have developed to the distribution of G. pallidipes and G. morsitans in eastern Africa where the distribution of flies is more complex than it is in Zimbabwe.

Figure 1. Map of Zimbabwe showing areas in which tsetse flies are believed to have been present and absent before the rinderpest pandemic in 1896. Light areas - present, dark areas - absent. The small black region in the north west is Lake Kariba.

Climatic Data

The climatic data were assembled by Booth et al. (1990). The available meteorological data for Zimbabwe were interpolated on a 5' grid (about 10 x 10 km) using Laplacian smoothing splines (Hutchinson et al., 1984). For each month of the year and for each of 4999 grid cells the elevation, rainfall, evaporation, maximum, minimum and mean temperatures were estimated.

The climatic data set alone has 420,000 data points and the first priority was to reduce the data to manageable proportions. The elevation, annual rainfall, evaporation and the following temperature variables were used*:

1. Maximum-mean-maximum temperature or XMX (average value of the daily maximum temperature for the hottest month).

2. Maximum-mean-mean or XMM (average value of the daily mean temperature for the hottest month).

3. Mean-mean-mean or MMM (average temperature over the whole year).

4. Minimum-mean-mean or NMM (average value of the daily mean temperature for the coldest month).

5. Minimum-mean-minimum or NMN (average value of the daily minimum temperature for the coldest month).

* In the following definitions the last word refers to day, the last but one to month and the first to year. Maximum-mean-mean is therefore determined by taking the mean temperature for each day, calculating the mean value for each month and then taking the maximum of the resulting 12 values.

Temperatures 1 and 5 give extreme values while temperature 2,3 and 4 give the variation in the mean temperatures.

The elevation contours (Figure 2a) show the eastern highlands rising to 2000 metres and the high ground falling away to about 1000 metres from east to west and then falling away to about 500 metres to both the north and the south. The NMN temperatures (Figure 2b) almost reach freezing and the XMX temperatures (Figure 2c) reach 35 °C. The rainfall (Figure 2d) is very high in the eastern highlands, is lowest in the south and south-west of the country and does not vary greatly over the rest of the country. Typical values are about 500 mm per year. Evaporation (Figure 2e) is low in the eastern highlands and increases as one moves to the west which borders on the Kalahari sands.

Vegetation Data

The vegetation data are the monthly maximum value composites of the normalized vegetation index derived from the 8 km NOAA-AVHRR data for the years 1984 to 1989. The vegetation data were interpolated onto the same grid as the climate data and to reduce the size of the data set we used only the-data for February and September which correspond to the highest (late wet season) and lowest (late dry season) values, respectively.

One Dimension

In one dimension it is easy to determine the optimal threshold value that divides places in which flies are present from those in which they are absent and the resulting threshold values with the corresponding number of correct predictions are given in Table 1.

The five temperature variables give the best overall predictions with NMM being the best of all, indicating that low temperatures are the most important factor in limiting the distribution of flies in Zimbabwe. The best prediction based on evaporation (Figure 3a) excludes flies in areas in which the evaporation is less than 1930 mm per year. However, the boundaries of the tsetse fly distributions lie along lines that run roughly from west to east while the evaporation and rainfall contours lie along lines that run roughly north to south. Using the range of NDVI values gives a better prediction than that based on evaporation (Figure 3b) but produces a more speckled distribution due in part to the inherently noisy nature of NDVI measurements and the extensive changes in vegetation cover brought about by human intervention since 1896. Although evaporation and the range of NDVI produce predictors of similar quality when judged by the proportion of grid cells for which the prediction is correct, one might prefer the prediction based on the vegetation index which, if one overlooks the speckled nature of the prediction, gives a better overall shape. Any criteria of goodness of fit should include reference to the spatial properties of the fit and cannot rely on a single overall statistic such as the proportion of correct predictions.

Figure 2. a) Elevation contours at 100 m intervals.

Figure 2. b) Minimum-mean-minimum temperatures at intervals of 1 °C.

Figure 2. c) Maximum-mean-maximum temperatures at intervals of 1 °C.

Figure 2. d) Annual rainfall at intervals of 100 mm.

Figure 2. e) Annual pan evaporation at intervals of 100 mm.

Table 1. Threshold values and percentage of correct predictions, P, for each of the variables used in this analysis. For the rainfall and the September NDVI, the flies are predicted to be present below the threshold values. For the other variables the flies are predicted to be present above the threshold values.


Threshold value

Percentage correct, P
















Range NDVI






September NDVI



February NDVI






Many Dimensions

In many dimensions the analysis becomes more difficult because a multi-dimensional space can be divided in many different ways: by a linear surface, a curvilinear surface or even one or more isolated volumes.

Linear Discrimination

Standard linear discriminant analysis allows us to determine a linear function that separates the parameter space into regions where the flies are present and absent. The function is chosen to maximize the ratio of the between groups variance to the within groups variance assuming that the probability that an observation belongs to a given class follows a multivariate normal distribution with the same covariance matrix for all classes (Green, 1978).

Non-Linear Discrimination

Figure 3. a) Predicted distribution of tsetse flies with a threshold value of 1930 mm/year for the pan evaporation.

Figure 3. b) Predicted distribution of tsetse flies with a threshold value of 0.22 for the difference between the February and September NDVI values.

White areas - flies present, predicted present; light shading - flies present predicted, predicted absent; dark shading - flies absent, predicted present; black areas - flies absent, predicted absent.

In addition to the assumption of normality, standard methods of discriminant analysis also depend on assumptions of linearity that may not be valid. These limitations are largely overcome by non-linear discrimination based on projection pursuit regression. A direction vector is chosen in the parameter space and the independent variable (presence/absence) is plotted against the projection of the points in the parameter space onto this vector. The data are then smoothed using a numerical algorithm and this smoothing function is used to make the predictions. The direction of the vector is then varied and the smoothing process repeated until the best prediction is obtained. Another direction vector is then chosen and treated in the same way. The fraction of the unexplained variance in the data is used to determine the number of direction vectors to include in the regression.

Figure 4. a) Predicted presence and absence of tsetse flies using linear discriminant analysis.

Figure 4. b) Predicted presence and absence of tsetse flies using the projection pursuit regression discussed in the text.

Figure 4. c) Predicted presence and absence of tsetse flies using the tree-based classification given in Figure 5.

Figure 4. d) Predicted presence and absence of tsetse flies using the k-nearest neighbour analysis with k equal to 1 and using only the NMM temperature, the XMX temperature, the rainfall and the evaporation.

Figure 4. e) Predicted presence and absence of tsetse flies using a neural network with 24 hidden neurons, starting with a seed of 2345. + indicates present, · indicates absent.

Decision Tree Induction

Decision tree induction is an extension of the optimal threshold predictor described above. Each predictor variable is tested to find which one gives the best classification. Each of the two classes are then tested against each of the predictor variables to find the variable that gives the best discrimination in each class. The process continues until all observations are correctly predicted and the tree is then pruned to provide a reliable classification.

k-Nearest Neighbour Analysis

In k-nearest neighbour analysis a small integer k and a set of points that will serve as the training set are chosen. For each new point the k points that are closest to the new point in the parameter space are then identified and the new point is assigned to the class which is most common among these k nearest neighbours.

Neural Networks

Neural networks are layers of connected nodes. The first layer comprises the input to the system and the last layer the output. The number of input nodes is equal to the number of parameters in the fitting procedure and, since we have only two possible values for the output (presence/absence) there is one output node that can be either on or off. The input to each node is tested against a threshold to produce an output of 0 or 1. The output from each node in each layer is multiplied by a weighting factor and fed to the nodes in the next layer. The sum of the inputs to each node is then tested against a threshold and the procedure is repeated. Points in the parameter space are presented in random order to the network and the output from the network is calculated and compared with the observed values (present or absent). If the prediction is wrong the weights and thresholds are recalculated using a back propagation algorithm, a method of steepest descent, which minimizes the mean square error in the prediction. A weighting factor is used to control the amount by which the weights are changed on each iteration; small values of the weight will cause the algorithm to converge slowly while large values will cause it to oscillate.


From the 4999 grid cells for the data, a random sample of 1000 were used to train the various classifiers and the remaining cells were used to test the classifications.

Linear Discriminant Analysis

In addition to the variables given in Table 1 we used n15 the number of months for which the minimum value of the mean temperature was less than or equal to 15 °C.

The linear discriminant analysis gave the following model:

y = -2.04 - 0.45n15 + 0.11XMM - 4.85Sep + 1.58Feb

The flies are therefore excluded from cold areas in which many months have mean minimum temperatures less than 15 ÉC and the maximum value of the mean temperature is low, and from very wet areas in which the September (dry season) NDVI is high or dry areas in which the February (wet season) NDVI is low. For the training set 88% of the predictions were correct, for the test set 87% were correct. The predicted distribution of flies is given in Figure 4a. Although the overall shape is reasonably good, the boundaries are not picked out very precisely.

Non-Linear Discrimination: Projection Pursuit Regression

Using projection pursuit regression the fraction of the unexplained variance is used to determine the number of projection vectors to include and the value chosen was 5. For the training set 95% of the predictions were correct, for the test set 92 and 93% were correct.

The final model gives some weight to all of the variables and for each projection a nonlinear transformation of the projection axis is used. Although the non-linear discriminant technique gives a better classification than the linear discriminant technique, the model effectively contains about 70 parameters making interpretation very difficult. However, comparison with the linear discrimination provides an indication of the limitations imposed by the assumptions of normality and linearity. Figure 4b gives the predicted distribution based on projection pursuit regression and comparing this with Figure 4a shows that the non-linear method picks out the northern limits more precisely, picks out the Sabi River valley more precisely in the north-east part of the Limpopo fly belt (in the south) and allows the flies to occupy regions further to the west in the Limpopo fly belt.

Tree-Based Model

Figure 5 shows the pruned tree-based classification. The top node indicates that if we assume that the flies are present everywhere then, in the training set, 498 out of 1000 grid cells are mis-classified. The first discrimination corresponds to the NMM (low temperature) classifier given in Table 1 and indicates that after this condition is applied 128 out of 1000 cells are mis-classified so that 87% of the cells are correctly classified. Using evaporation to reclassify the areas in which the flies are present (on the first criterion) increases the proportion of correctly classified cells to 89%. Using the XMX temperature to reclassify the areas in which flies are absent (on the first criterion) does not improve the classification but excludes flies from 262 cells while leaving 272 cells free to be reclassified. The best classifier for these 272 cells is rainfall and this increases the proportion of correctly classified cells to 90%. Reclassifying the cells in which flies are present, using evaporation, and in which they are absent, using elevation, we get a final classification in which 92% of the cells are correctly classified. Applying this tree to the test set gives 91% correct predictions.

Figure 5. The pruned tree-based classification. Each ellipse indicates whether or not flies are present (1) or absent (0).The rectangles indicate the final classifications. The ratios indicate the proportion of mix-classifications at each node in the tree. The threshold values for successive nodes are indicated on the lines joining nodes.

The predicted distribution of the flies using the tree-based classification is shown in Figure 4c. The tree-based classification is, like the linear discriminant analysis, easy to interpret biologically. It indicates that the overwhelmingly dominant factor is the low temperature threshold. Small, but significant improvements in the classification can be obtained using a combination of rainfall and evaporation. Where the flies should be present, according to the low temperature limit, they should nevertheless be excluded in very wet areas where the evaporation is low. Where the flies should be absent, according to the low temperature limit, they may still be present if the rainfall is sufficiently high but not when the evaporation is very low.

k-Nearest Neighbour Analysis

After investigating the performance of the k-nearest neighbour analysis for different values of k it was found that the best error rate was obtained with k equal to 1. Each member of the test set is then put into the same class as its nearest neighbour (in the parameter space) in the training set. Using all of the available variables about 93% of the cells in the test set are classified correctly.

Using k-nearest neighbours the error rate on the test set was improved slightly by including only the environmental variables that were indicated as being the most important in the tree-based classification, namely the NMM temperature, the XMX temperature, the rainfall and the evaporation. The spatial distribution of the flies predicted on the basis of the k-nearest neighbour analysis is shown in Figure 4d. It gives the highest proportion of correct predictions and does even better than the non-linear projection pursuit regression. Like the latter, however, it does not, in itself, help us to interpret the data biologically.

Neural Network Analysis

Networks with a single hidden layer containing 6, 12 and 24 neurons were used. Random initial weights were chosen and for each network three runs were carried out using different random number seeds. Between 93 and 96% of the predictions were correct except for one run in which the network became trapped in a local minimum and only 86% of the predictions were correct. Figure 4e shows the predictions based on the neural network with 24 hidden neurons and the overall classification is very good. Unfortunately, the predictions of the neural networks are also difficult to interpret biologically.


We have a range of techniques that we can use to fit the distribution of tsetse flies to the environmental variables and these techniques can of course be used for any environmental variables and for any observed distribution. Eventually we hope to understand the biology of the organism whose distribution we are trying to explain sufficiently well that we can simply define criteria for the presence or absence of the organism and make predictions accordingly. We would then have a set of rules that we use to define a volume in our variable space of any complexity that we choose as has been done in the world-wide classification of vegetation types by Woodward and Williams (1987) and in identifying areas within Zimbabwe suitable for particular tree species (Booth et al., 1990). Before we reach that stage, however, we need to be able to use our data to help us to identify variables that are likely to be significant and to determine threshold values for such variables.

Probably the most useful way to begin is to consider each variable separately in order to decide which single variables are likely to be the most important and to decide if more than one threshold is suggested by the data for a particular variable.

Generally we will of course want to include more than one environmental variable and the simplest way to proceed is to use a linear discriminant analysis. If a particular variable seems to require more than one threshold, it is likely that the discriminant analysis will have difficulty producing a good prediction using that variable. At this stage we might examine the spatial distribution of the relevant variable and, if necessary, fit different models in different environmental regions. For example the highlands in the east of Zimbabwe are too wet for the flies while the Kalahari sands in the west of Zimbabwe are too dry for the flies. We might therefore consider dividing the country along a north south line and then analysing the eastern and the western regions separately. The larger and the more diverse the geographical area under study, the more likely it is that splitting the region up will help. An advantage of the nonlinear methods of analysis is that they should be able to deal directly with problems of this nature without having to fit different models in different places.

The next most useful analysis involves carrying out a tree-based classification. While the discriminant analysis divides the parameter space over a hyper-plane, the tree-based classification divides the space into a series of nested hyper-rectangles. This too is done in a forward stepwise manner including at each successive branching the variable that gives the greatest improvement in the number of correct predictions. This provides a useful contrast with the linear discriminant analysis. If the two methods identify quite different variables as being important one should examine the data and try to determine the reasons for this.

If the classification is not very good or if one is concerned about the validity of some of the assumptions, we can use a k-nearest neighbour analysis. This has the advantage of being simple to carry out and is likely to give a good fit to the data. Comparing the linear discriminant analysis and the tree-based classification with the k-nearest neighbour analysis should give some idea as to how much we are likely to be able to improve on the simple analysis schemes using more sophisticated techniques. The disadvantage of the k-nearest neighbour analysis is that it affords us no biological interpretation. However, if we are concerned to use our predictions simply as part of a management or planning operation, it may be that this is in fact the best method to use.

If we still feel that it should be possible to improve the prediction further, the next step would be to carry out a non-linear projection pursuit regression. Again one is unlikely to be able to use the fit to interpret the data biologically but it does tell us if the limitations imposed by the more restrictive assumption of the simpler techniques are important. And again, if the purpose was simply to use the predictions in a management or planning context this may be all that we need.

The final possibility is to use neural networks. These are very powerful but also the most demanding on computing time and the most difficult to execute and interpret. A neural network of sufficient complexity can pick out regions of parameter space of essentially any shape or form. However, the study of neural networks and their application to problems such as this is still in its infancy. The neural network took several orders of magnitude longer to converge than even the non-linear discriminant analysis. Many flexible methods of discrimination are currently under development and these will incorporate the best features of the currently available methods.


The NOAA-AVHRR NDVI data was provided by Dr. B. Hendriksen, courtesy of USAID/FEWS NASA GFSC NDVI, and was prepared by R. Kruska. Gareth Staton thanks the Science and Engineering Research Council and Brian Williams thanks the Royal Society and the Overseas Development Administration for financial support.


BOOTH, T.H., STEIN, J.A., HUTCHINSON, M.F. and NIX, H.A. 1990. Identifying areas within a country climatically suitable for particular tree-species: an example using Zimbabwe. The International Tree Crops Journal 6: 116.

CURSON, H.H. 1932. Distribution of Glossina in Bechuanaland Protectorate. 18th Report of the Director of Veterinary Services and Animal Industry, Onderstepoort, August 1932, Pretoria, South Africa.

FORD, J. 1971. The Role of the Trypanosomiases in African Ecology: A Study of the Tsetse Fly Problem. Oxford: Clarendon Press, 568 pp.

FULLER, C. 1923. Tsetse in the Transvaal and surrounding territories. An historical review. 9th and 10th Reports of the Director of Veterinary Education and Research, Pretoria, South Africa.

GREEN, P.E. 1978. Analyzing Multivariate Data. Hinsdale, Illinois: Dryden Press.

HUTCHINSON, M.F., BOOTH, T.H., McMAHON, J.P. and NIX, A. 1984. Estimating monthly mean values of daily total solar radiation for Australia. Solar Energy 32: 277-290.

JACK, R.W. 1914. Tsetse fly and big game in Southern Rhodesia. Bulletin of Entomological Research 5: 97.

JACK, R.W. 1933. The tsetse fly problem in Southern Rhodesia. Rhodesia Agricultural Journal 30: 365.

WOODWARD, F.I. and WILLIAMS, B.G. 1987. Climate and plant distribution at global and local scales. Vegetation 69: 189-197.

Previous Page Top of Page Next Page