**Collection, collation, analysis and dissemination of data on vector-borne and other parasitic diseases**

*T.E. Carpenter*

Department of Epidemiology and Preventive Medicine

School of Veterinary Medicine

University of California

Davis, California 95616, USA

Abstract

Introduction

Collection

Collation

Analysis

Dissemination

References

Disease modelling frequency has been criticized due to the numerous assumptions that typically go into the construction of a model. This criticism is no doubt significantly linked to the paucity of appropriate data available to the modeller. Given this deficiency, how can international agencies such as the Food and Agriculture Organization (FAO) and the International Laboratory for Research on Animal Diseases (ILRAD) assist the modeller in obtaining data in a form necessary to construct, verify and validate a vector-borne or other parasitic disease model.

Before identifying the requisite data, it is necessary to establish the system being modelled, the hypothesis being tested and the intended audience. Models designed to mimic zoonotic infections require not only demographic information on human populations, but also frequently comparable data on wildlife and domestic animal populations. Such information would include - but not be limited to - density, movement, reproductive rate, death, prevalence and incidence or infestation rates and levels in the respective populations. Host-specific information such as age, sex or breed-specific risk or susceptibility needs to be identified and quantified if existing. Once the host populations have been identified, similar demographic data as well as maturation rates are required for parasite and vector populations. In addition, climatic and control specific susceptibility data are critical to permit modelling of these later populations.

There is need for long-term data collection if the modeller is to attempt to perform a time series analysis of data in order to account for seasonal, cyclic or secular trends in the respective populations and infection patterns. Collection and presentation of information in a temporal form could be greatly enhanced by the addition of spatial dimension, whether it is through satellite imaging of soil or vegetation types of mapping of populations.

For both parasitic and vector-borne diseases, it is important to determine the level of aggregation or clustering of infection or infestation within the population of interest. Such data could be collected from both the laboratory and field. Data such as that discussed above could be useful to the modeller, whether or not it has been analysed, if it is available as either an ASCII file, or as a data base.

Disease modelling frequently has been criticized due to the numerous assumptions that typically go into the construction of a model. This criticism is no doubt significantly linked to the paucity of appropriate data readily available to the modeller. Typically, a modeller will rely on data available in the literature for initial parameter estimates necessary to construct the model. Given this deficiency, how can international agencies and research institutes such as the Food and Agriculture Organization (FAO) and the International Laboratory for Research on Animal Diseases (ILRAD) assist the modeller in obtaining data in a form necessary to construct, verify and validate vector-borne or other parasitic disease models?

One view of what an epidemiological model is was given by Anderson (1976) when he said it was to '... describe temporal changes in the number of susceptibles, infected and recovered or immune hosts within a population, and depend on a few parameters which specify the nature of the incubation and infectivity period and the rate of transmission of disease'. If it were as simple as the quotation implies, the data needed for model construction would be limited. However, as discussed below, it is not often simple and additional data must be collected in order to avoid misinterpretation of model results, or mix-specification of model parameters.

Lamenting inaccuracies of parasite models, Anderson went on to state that: 'Unfortunately... very few field or laboratory studies have yielded quantitative estimates of the population process such as infection rates, rate of host mortality caused by parasite infection or even survival rates of the various stages in the life cycle.... The paucity of our knowledge is no doubt due to the complexities of parasite life cycles, but the lack of experimental and field information is also a direct lack of the intimacy of the relationship between host and parasite'.

The purpose of this paper is to discuss the need for collection, collation, analysis and dissemination of appropriate (useful in modelling of parasitic and vector-borne diseases) data. The focus will be on the collection and analysis of these data. In the section on analysis, in addition to a limited discussion of some of the appropriate techniques which could be used, some of the errors encountered in the analysis of epidemiologic and production data will be highlighted.

Before identifying the requisite data, it is necessary to establish the system being modelled, the intended goals or objectives, the hypothesis being tested and the intended audience for which the model is being constructed. Models that are developed to better visualize a system may be satisfactorily completed when the modeller has constructed a basic flow diagram representing the system. Typically, however, the modeller constructs a model that is to be used either to understand the basic components of the system, or for predictive purposes.

Data needs may also be determined after the initial model has been constructed. That is during the model verification and validation stages, it often becomes apparent that due to the responsiveness of the model to changes in a given parameter, additional data may be required to more accurately estimate the parameter which may allow the model to perform more realistically.

Data may be in several forms: production, health, nutrition and other inputs and outputs; economic, border and farm-gate prices, demand and supply elasticities, and accounting or shadow prices; health, efficacy of vaccine, acaricide and chemotherapy; climatic, temperature, relative humidity, evaporation, and rainfall; topographic, altitude, vegetation and land use; demographic, host, vector, and parasite population dynamics, including spatial distribution and movement; and laboratory, serological, and DNA fingerprinting.

Since data used to estimate model parameters, out of necessity, are often secondary data, i.e. collected by someone else or for another purpose, it is important that they be well understood by the modeller. An example of this necessity for a more critical understanding of the data is the interpretation of serological data. Serological results are dependent on test sensitivity, specificity and prevalence (Figure 1). As any of these parameters change, interpretation of test results will change. Specifically, a test result may either be correctly or incorrectly interpreted as test positive or negative. Measures of these interpretations are referred to as the predictive values of the test. For example, the predictive value (+) of a test is the probability of an individual being infected (D+), given it has a positive test result (T+). This is expressed as a conditional probability, P(D+/T+). As can be seen in the figure, assuming a test sensitivity and specificity of 90% each, the interpretation predictive value (+) of a positive test result varies, depending on the prevalence, which may be calculated from the apparent, or serological, prevalence, P(T+). Given these test parameters, although one can be confident, e.g. 90% probability when the prevalence is moderate to high (³ 50%) and even 80% when the true prevalence is ³ 30%, this predictive power falls off precipitously, e.g. only 50% when the true prevalence is 10%. This potential for misinterpretation of positive serological results therefore has the potential of being serious at low prevalence levels.

Data may be collected from long or short term, longitudinal or cross sectional, or retrospective or prospective in nature. Long-term data although in some cases essential, e.g. in time series analysis, is often prohibitively expensive in terms of both labour and monetary cost. However, long-term data collection is critical, if the modeller is to attempt to perform a time series analysis to account for seasonal, cyclic or secular trend in the respective populations and infection patterns. Time series analysis is also useful in quantifying parasite or vector development and maturation (Mullens and Lii, 1987).

Data needs also vary, depending whether the system being modelled is direct or indirect (involving definitive as well as intermediate hosts), and for a parasitic infection or infectious disease. These factors must be considered throughout the data collection and analysis process. Models designed to mimic zoonotic infections require not only demographic information on human populations, but also frequently comparable data on wildlife and domestic animal populations. Such information would include but not be limited to density, movement, and reproductive, death, prevalence and incidence or infestation rates and levels in the respective populations. Host-specific information, such as age, sex or breed specific risk or susceptibility needs to be identified and quantified if existing. Once the host populations have been identified, similar demographic data, as well as maturation rates, are required for parasite and vector populations. In addition, climatic and control specific susceptibility data are critical to permit modelling of these later populations.

The design of the method in which data are to be selected is an important step in achieving an answer to the proposed questions. Several types of sampling methods have been used in veterinary epidemiology, including stratified, random, stratified random, cluster and biased. As will be discussed in the section on analysis, the type of sampling will also determine the type of analysis which is appropriate for use and the types of interpretations which may be drawn from the analysis.

**Figure 1. Predictive value positive (PV(+)) and predictive value negative (PV(-)), assuming sensitivity and specificity of 90% and a varying prevalence.**

Although computers have been available and heavily used in the developed world for several years, they have been in limited supply in many of the lesser developed countries. As a result, the data from these lesser developed countries typically exist on paper. This problem is also typical of animal health data collected by organizations such as FAO. It is critical that if these data are to utilize modern statistical techniques, that they be coded and entered into appropriate digital databases. It is unrealistic to assume that data would be disseminated, much less evaluated by modellers, if it were not available in a computerized form.

In addition, to be useful, data should be in a form appropriate for statistical analysis. Appropriate forms would be either as ASCII, database, or spreadsheet files. In one of these forms, the contemporary modeller would have the capability of easily importing and analysing the data. Necessary data validation and checking for errors in the data should be performed both prior to and after data have been obtained by the modeller. Frequently, after data have been collected and coded into the computer, it may take weeks before the data are in a form amenable to analysis. It is important, therefore, that during the early stages of study design, data collectors and analysts get together and design proper questionnaires and databases, in order to minimize the task of organizing the data in a form in which it may be analysed.

Once the data have been computerized, the next step in the process is compiling it in the form necessary for analysis. This often includes transformations of the data, such as arcsin transformation of percentages, or other distribution transformations to conform to the assumption of normality, if that is a necessary assumption for the analysis. Once the transformations have been performed, the data are often re-categorized from continuous data to other forms such as categorical or nominal data. The classification, or cut points used to determine the classification, should have sound biologic criteria for making such selections.

The type of analysis needed for epidemiological modelling will depend on the question of interest. Appropriate analyses could include case-control or cohort studies, or they may focus on the quantification of the estimation and significance testing of specific population and disease dynamics parameters. Analyses may also be made in order to specify the appropriate distribution of these parameters to be used in a simulation model.

Analytical techniques which have probably been underutilized in epidemiological models include logistic regression and survival analysis. These techniques enable the analyst to determine relevant risk factors which should be included in the model (logistic regression) and predicted time to the occurrence of a particular event, which may include the potential confounding of a time dependent covariate (survival analysis). Among these parameters are: 1) population - birth, death, maturation and migration rate, and 2) disease - transmission, incubation, latent and infectious periods.

Additional tests that have been discussed in this workshop include time series and spatial statistics tests. The time series analyses would be important in both modelling the growth and activity of the host, parasite and vector populations. For example, it is important to know the timing of the infective larval stage of a parasite in order to determine what action should be taken to insure that newborn lambs are at reduced risk of infection.

Spatial statistical tests are important due to the fact that the distribution of individuals (hosts, vectors and parasites) is rarely of the random form which is often assumed during at least the initial stages of model construction. Just as it was important to consider age and sex, in order to avoid any potential confounding which may occur in an analysis, it is necessary to also consider potential confounding which may arise from assuming a spatial (as well as temporal) distribution. Several methods are available which could asses the level of dispersion, or clustering, in a population. These include tests for autocorrelation of areal and point data, using among other techniques nearest neighbour analyses.

Data should be analysed with respect to its proper distribution. Many statistical tests (parametric) assume data are normally distributed. However, this is often not the case for data used in estimating parameters in epidemiological models. Frequently these data are distributed as binomial, negative binomial, or Poisson random variables. Calculation of the means and variance of such variables would give erroneous model results. This would also negate the possibility of using the simpler deterministic type models instead of the more complex stochastic or Monte Carlo models, which depend on the ability to assign random distributions to numbers.

Case- and cause-specific fatality and morbidity rates, as well as attributable risk for the infection should be calculated. These rates should all be adjusted for differences that exist in age, sex, breed, or species, of the host, vector or parasite.

Confounding

One of the most frequent explanations of the occurrence of bias in epidemiological studies is that of confounding. Martin *et al. * (1987) explained it as follows: 'As a working definition, a confounding variable is one associated with the independent variable and the dependent variable under study. Usually, confounding variables are themselves determinants of the disease under study, and such variables if ignored can distort the observed association. Preventing this bias is a major objective of the design and/or analysis of observational studies'. The most common sources of confounding include age, sex, breed and location. Cattle of a particular breed tend to be more trypanotolerant than others. Calves are more susceptible and will have a lower seroprevalence than adult cattle. Therefore, when doing a seroprevalence study or attempting to assess risk factors, it is important to take this biologic knowledge into consideration when designing a study as well as during the analysis. Three methods of dealing with confounding are exclusion, matching and analytic control, or stratification. Through these methods, the analyst may either control or adjust for potential confounding, or through exclusion focus on a single group, e.g. a single breed, and in that way avoid bias through confounding.

Repeated Measures

Another statistical problem that may occur with data that have been serially collected, as mentioned above, is that of multicollinearity (Vågsholm, 1989). Presence of multicollinearity in the data will lead to an inflation bias of the variance of the parameter estimate.

In a multivariable analysis, this bias may result in either an increased or decreased estimate of the variance and hence a decreased or increased estimate of the statistical significance of the parameter coefficient. The result could therefore be incorrectly concluding that a parameter is statistically significant when it is not, or not statistically significant when in fact it is (Mousing *et al.,* 1988; Carpenter *et al.,* 1988). In either case, this potential bias resulting from autocorrelation should be considered and adjusted for when present.

Simultaneous Equation Bias

In estimating model parameters, it is essential that the analysis reflects the fact that the data, as with the model we may attempt to construct, are collected from and therefore intended to represent a system and are therefore not independent of that system (Working, 1927). One of most commonly overlooked examples of this in veterinary medicine deals with the case of simultaneous equation bias. This bias implies that our model, or system, consists of outcomes, referred to as endogenous variables, that are a function of explanatory variables, often referred to as exogenous variables. However, additional relationships may exist whereby not only the traditional exogenous variable explains variability of the endogenous variable, but, in addition, there may be additional variables, for example either current or lagged endogenous variables which also significantly effect the system being analysed.

A simultaneous equation, and its associated bias may occur in one of three ways: 1) presence of true biological interactions, e.g. disease and production; 2) aggregation of observation periods, when data are collected; and 3) multiple-output production processes, e.g. milk and calves.

An example of simultaneous equation bias is seen with the relationship between a parasite population and an associated control measure, pesticide application. The primary objective of a model may be to evaluate the efficacy of pesticide application on a parasite-infested population. Classical regression would have a model specified as parasite population as a function of, among other things, pesticide application. However, inherent in this equation is the fact that pesticide application is a function of, among other things, the perceived risk of an individual or the population being infected. Therefore, two endogenous variables exist in this simple system, parasite population numbers and pesticide use. The bias arises because one of the endogenous variables, pesticide use, is not independent of the error term. The resulting bias will be in the estimate of the coefficient in the equation used to estimate parasite population. The bias may give an increased or decreased estimate of the risk. For example, although biologically, we know that the application of a pesticide, discounting drug resistance, will act to decrease the population size. However, through statistical analysis, it is likely that pesticide use will be associated with an increased risk of infestation, or an increased population size. This is due to the fact that instead of measuring the impact of the pesticide, we are instead measuring the decision makers' response to a problem.

Traditional analysis, e.g. ordinary least squares regression analysis, of the data of such a system could lead to serious mix-specification of the model and consequently misinterpretation of the results (Vågsholm *et al.,* 1991). Alternative methods, e.g. three-stage least squares analysis, are available in biostatistical or econometric software packages and will eliminate such bias.

Once the appropriate data have been collected, collated and analysed, what is the appropriate form(s) in which they should be disseminated? Meaningful dissemination of these data and analytic results could occur in at least two ways. The first is to make the data available, in a collated from, to all interested scientific researchers. This method has recently been adopted by the National Animal Health Monitoring System, which is a livestock health surveillance system operated by the United States Department of Agriculture (USDA-APHIS/VS). It is provided to interested researchers, primarily at US universities, on magnetic tapes in a SAS format. A second method of data dissemination would be through publications. These could occur either through proceedings of conferences or workshops, refereed journals, books or annual reports of institutions such as ILRAD or FAO.

ANDERSON, R.M. 1976. Some simple models of the population dynamics of eucaryotic parasites, In: Levin, S., ed. *Lecture Notes in Biomathematics*. New York: Springer-Verlag, pp. 16-57.

CARPENTER, T.E., SNIPES, K.P., WALLIS, D. and McCAPES, R. 1988. Epidemiology and financial impact of fowl cholera in turkeys: a retrospective analysis. *Avian Diseases* 32: 16-23.

MARTIN, S.W., MEEK, A.H. and WILLEBERG, P. 1987. *Veterinary Epidemiology: Principles and Methods*. Ames, Iowa: Iowa State University Press, 343 pp.

MOUSING, J., VÅGSHOLM, I., CARPENTER, T.E., GARDNER, I.A. and HIRD, D.W. 1988. Financial impact of transmissible gastroenteritis in pigs. *Journal of the American Veterinary Medical Association* 192: 756-759.

MULLENS, B.A. and LII, K.-S. 1987. Larval population dynamics of *Culicoides variipennis* (Diptera: Ceratopongonidae) in Southern California. *Journal of Medical Entomology* 24: 566-574.

VÅGSHOLM, I. 1989. Repeated measure, a problem in animal health economics. In: *Proceedings of the 5th International Symposium of Veterinary Epidemiology and Economics, Acta Veterinaria Scandinavia Supplement* (Copenhagen, Denmark) 84: 374-376

VÅGSHOLM, I., CARPENTER, T.E. and HOWITT, R.E. 1991. Simultaneous-equation bias of animal production systems. *Preventive Veterinary Medicine* 11: 37-54.

WORKING, E.J. 1927. What do statistical 'demand curves' show? *Quarterly Journal of Economics* 41: 212-235.