EVALUATION OF DIAGNOSTIC TESTS: THE EPIDEMIOLOGICAL APPROACH ^[1] - DANIEL F. FEGAN

BIOTEC, Thailand

The use of diagnostic tests is widespread in studies of disease in aquatic animals. These are often used in isolation and the results interpreted or applied to populations without sufficient regard to wider implications of the results. Much effort is devoted to the understanding of disease processes at the individual animal, organ, cellular and genetic levels, and the complex interplay between individuals in populations and the environment can be forgotten. At the population level the use of diagnostic tests is made more complicated by population effects such as prevalence of the pathogen, expression and impact of the disease on the population and potential for pathogen spread among others. The inadequacy of the Henle-Koch postulates in animal disease has long been recognised as they do not work well with multi-factorial causes of disease and the impact of predisposing factors. As a result, the familiar "epidemiological triad" concept (Host-Pathogen-Environment), illustrated in the famous diagram of Snieszko (1974) was introduced (Figure 1).

Figure 1: The epidemiological triad (Snieszko, 1974)

This neatly illustrates the complex interplay of factors which result in disease at the individual and population levels. The existence of multiple contributing factors to disease outbreak is summarised in the epidemiological definition of the cause of a disease as "an event, condition or characteristic that plays an essential role in producing an occurrence of the disease" (Baldock, 1996). This implies that the presence of a pathogen may not, in itself, be sufficient to cause disease in the absence of other factors, a concept expressed in the statement that a pathogen is a necessary but not sufficient cause of a particular disease. This is classically seen in epizootic ulcerative syndrome (EUS) of fish where epidermal damage by a stress such as lowered pH is required to before infection with Aphanomyces sp. and the resultant characteristic lesions can occur.

Application of these concepts requires a different approach to the interpretation of diagnostic test results, particularly where they will be applied to a decision-making process. This paper is intended to briefly introduce the basic concepts of epidemiology as they relate specifically to diagnostic tests. [For a fuller treatment, the reader is referred to the textbooks of Thrusfield (1995) or Pfeiffer (1998).]

In veterinary medicine, a diagnosis is a statement of an animal's state of "normality" and represents an interpretation of one or several observations that form the basis for a decision on further action. The decision is based on a number of factors including factual knowledge, experience and intuition as well as clinical diagnostic tests and it is the correct use of all of these which increases the probability of correct diagnosis (Figure 2). This definition clearly identifies the uncertainty associated with diagnosis and the outcome of a given course of action taken as a result.

Figure 2: Factors influencing veterinary diagnoses (from Pfeiffer, 1998)

This differs somewhat from the classical concept of diagnosis in the Henle-Koch postulates as the consistent isolation and identification of a particular aetiological agent associated with a disease.

Definitions

Unfortunately, some of the terms used in veterinary epidemiology are the same as those used in clinical pathology but with different definitions. The terms "sensitivity" and "specificity" in particular, have been the cause of considerable confusion. Some definitions of terms used in veterinary diagnosis are given above.

*Accuracy*	The accuracy of a test refers to the level of agreement between the test result and the "true" clinical state.
*Bias*	Bias measures the systematic deviation from the "true" clinical state
*Precision*	Represents the degree of fluctuation of a series of measurements around the central measurement.
*Sensitivity*	Proportion of animals with the disease which test positive (i.e. proportion of true positives). This equates to the laboratory definition where it means the ability of an analytical method to detect very small amounts of the analyte (such as an antibody or antigen). Thus a test which is highly "sensitive" from a laboratory perspective is also likely to be "sensitive" from an epidemiological perspective.
*Specificity*	Proportion of animals without the disease which test negative (i.e. proportion of true negatives). This equates to the laboratory definition where it means the ability of the test to react only when the particular analyte is present and not react to the presence of other compounds. Thus a test which is highly "specific" from a laboratory perspective is also likely to be "specific" from an epidemiological perspective.
*PPV*	(Positive Predictive Value) The probability (or likelihood) that an animal which returns a positive test result actually has the disease in question.
*NPV*	(Negative Predictive Value) The probability (or likelihood) that an animal which returns a negative test result actually does not have the disease in question.
*True prevalence*	Proportion of animals in the population which really do have the disease in question regardless of their test result. From a test result point of view, it includes the "true" positives and the "false" negatives.
*Apparent Prevalence*	Proportion of animals in the population giving a positive test result regardless of their true status for the disease in question. From a test result point of view, it is all the test positive animals, some of whom will be "true" positives and some which are "false" positives.

Diagnostic testing

Diagnostic tests are more or less objective methods which reduce the uncertainly factor in diagnosis. Diagnostic tests are often interpreted using a dichotomous outcome (normal/abnormal, diseased/healthy, treat/don't treat) which poses less difficulty when the test itself is dichotomous (presence or absence of a pathogen) but can cause considerable difficulty in interpretation when it is continuous (e.g. serum antibody levels or cell counts). In such cases, the selection of an appropriate cut-off point to separate 'positive' and 'negative' results introduces a level of uncertainty. In most diagnostic tests false positives and false negatives occur. Some of the reasons for positive and negative results in serology, for example, are given in Table 1. Consequently any diagnostic test which does not directly identify the presence of the infection can only produce an estimate of the apparent prevalence of a disease (i.e. the proportion of animals giving a positive test result) and does not equate to the presence of infection. Estimates of true prevalence, however, can be made taking into account test sensitivity and specificity where these are known.

*Positive Results*
Actual infection	True positive
Group cross-reaction Non-specific inhibitors Non-specific agglutinins	False positives
*Negative Results*
Absence of infection	True negative
Natural/induced tolerance Improper timing Improper selection of test
Non-specific inhibitors Toxic substances Antibiotic induced immunoglobulin suppression Incomplete or blocking antibody Insensitive tests	False negative

Table 1: Reasons for positive and negative results from serological tests (from Stipes et al. 1982).

Estimates of true prevalence, however, can be made taking into account test sensitivity and specificity.

True prevalence	=	apparent prevalence + (specificity - 1)
		specificity + (sensitivity -1)

Sensitivity and specificity are indicators of the validity of diagnostic tests (Thrusfield, 1995). When a cut-off point is used, sensitivity and specificity show an inverse relationship - as sensitivity increases, specificity decreases and vice versa. Estimation of the sensitivity and specificity requires testing of animals for which the disease status is known. This requires the use of an appropriate unequivocal diagnostic method as a "gold standard". For example, in the case of the protistan oyster pathogen Haplosporidium nelsoni (MSX disease)^[2], data on which was presented at the workshop, the histological demonstration of the disease may be used as an estimation of true status (the "gold standard") and to evaluate the PCR data obtained by constructing the following simple table.

True Status

	+ve	-ve
Test 1 +ve	a	b	a + b
Test 1 -ve	c	d	c + d
	a + c	b + d	a +b + c + d

In the table, "a" represents the true positives, "d" the true negatives and "b" and "c" the false positives and false negatives respectively. The various epidemiological values can also be simply calculated as follows:

Sensitivity = a/(a+c)
Specificity = d/(b+d)
PPV = a/(a+b)
NPV = d/(c+d)
Apparent prevalence = a+b/(a+b+c+d)
True prevalence = a/(a+b+c+d)

Substituting the data for incidence of MSX:

Histology

	+ve	-ve
PCR +ve	74	55	129
PCR -ve	2	127	129
	76	182	258

Using the above formulae, the calculations are:

Sensitivity = 74/76 = 97%
Specificity = 127/182 = 70%
PPV = 74/129 = 57% for the particular prevalence of 29%
NPV = 127/129 = 98% for the particular prevalence of 29%
Apparent prevalence = 129/258 = 50%
True prevalence = 74/258 = 29%

From the table, it appears that the PCR test has a high sensitivity but only a moderate specificity. In other words, 97% of animals with the disease test positive using PCR (a false negative rate of 3%) but only 70% of animals without the disease test negative with PCR (i.e. a false positive rate of 30%). Therefore, although the test would be useful for screening to reduce the possibility of introducing infected individuals into a population (for which false positives are not a major concern), it would not be sufficient on its own to make a definitive diagnosis of the disease due to the high false positive rate, and would certainly not be appropriate as the basis for a decision on action to be taken.

The selection of the appropriate level of sensitivity and specificity often depends upon the particular need. When screening for a disease or pathogen (for example, testing animals to eliminate infected individuals) we require a reliable positive result with few false negatives and a reasonable number of false positives (within an economically justifiable level of rejection). This would require a test with a high sensitivity and reasonable specificity. This type of test would be used in a quarantine situation, for example, to reduce the risk of disease introduction or when demonstrating absence of a disease to establish "disease-free" zones. On the other hand, if we need as few false positives as possible (e.g. to confirm a tentative diagnosis) a test with a high specificity and reasonable sensitivity is used. It is, however, important to note that the consequence of any diagnostic test with imperfect specificity (less than 100%) is that if a large number of tests are made on a single uninfected animal, there is a significant chance of finding a positive result.

Predictive values

For a diagnostic decision, it is also useful to make some estimate of the predictive value of a diagnostic test. The predictive value quantifies the probability that a positive test result for a particular animal or sample correctly identifies the presence of infection and a negative test result correctly identifies the absence of infection. This requires knowledge of not only the sensitivity and specificity of the test but the prevalence of the condition. The effect of prevalence on predictive values is considerable. As prevalence increases, Positive Predictive Value (PPV) increases and Negative Predictive Value (NPV) decreases.

Formulae for calculating predictive values are based on Bayes' theorem of conditional probability (Fleiss, 1981) and are as follows:

Se: sensitivity; Sp: specificity; Prev: pre-test probability of disease (or true prevalence).

Predictive values are functions of prevalence and the test characteristics of sensitivity and specificity. As prevalence declines so does positive predictive value. The converse is true for negative predictive value (see Table 2).

If the sensitivity and specificity of a diagnostic test are known for a particular target population, then predictive value graphs can be drawn for the range of all possible pre-test probabilities of disease from 0 to 1 (100%).

Table 2: Effect of prevalence on positive predictive value (PPV) with a hypothetical serological test (Se and Sp = 0.95)

Prevalence (%)	0.1	1	2	5	10	50	90	100
PPV (%)	1.9	16.1	27.9	50.0	67.8	95.0	99.4	100

The important point that this table indicates is that despite using a good test (Se and Sp = 0.95) most reactors are non-infected (false positives) when the disease is present in the population at a low prevalence. Of the 2 test properties, it can be shown that specificity exerts a greater influence on PPV than does sensitivity. On the other hand, sensitivity exerts a greater influence on negative predictive value (NPV).

Again using the data for oysters, the PCR test has a good negative predictive value but poor positive predictive value (Proportion of PCR -ve animals which do not have disease = 98%; Proportion of PCR +ve animals which have the disease = 57%) where the true prevalence is 29%. In other words, the test is a poor predictor of disease occurrence and would be of limited use in confirming the existence of suspect disease. As a rule of thumb, highly specific tests should be used to confirm tentative diagnoses while highly sensitive tests should be used to rule out possible disease.

Finally, the impact of the test on estimated prevalence is clearly seen. Because of the low specificity and PPV, the prevalence of infection is overestimated considerably (apparent prevalence is 50% compared with the true prevalence of 29%). This would diminish the usefulness of PCR as a diagnostic tool in this particular case.

The PPV of a particular test can be improved by appropriate selection strategies (Baldock, 1996):

1. Testing of "high risk" groups (animals with clinical signs rather than normal animals)

2. For the same test using a higher cut-off with higher specificity or use a second test with a higher specificity)

3. Use of multiple tests for interpretation of results.

Population level test interpretation

When dealing with testing a group of animals (such as a tank of shrimp postlarvae or a pond of fish) for disease rather than an individual, some additional factors have to be taken into account. In addition to the sensitivity and specificity of the test, the number of animals from the group which are tested (the sample size), the true prevalence and the number of positives required to classify the population as infected are important.

At the group level we require high sensitivity and high specificity in a test, the same as for individual level tests. It is important to note, however, that individual and group level test characteristics are not equivalent. At a group level, sensitivity and specificity are influenced by sample size (as the sample size increases, so does sensitivity) and the number of positives required (as the number of positives required increases, there is a corresponding increase in specificity). Again, as with individual level tests, sensitivity and specificity are inversely related.

It should be noted that even relatively good tests with high sensitivity and specificity will have a low predictive value at low levels of prevalence. For example, if a test with sensitivity of 99% and specificity of 99.9% was used at a high prevalence, say 10%, a single test conducted on 10 million animals would give 9,000 false positives and 990,000 true positives. On the other hand, if the prevalence were 0.01% the test would give 9,900 true positives and 9,990 false positives. This has important implications for eradication campaigns, quarantine screening and other situations where prevalence may change with time.

Evaluation of diagnostic techniques

As previously explained, evaluation of diagnostic techniques requires some independent, valid measure of the true condition of the animal (the 'gold standard'). The 'gold standard' may be a single unequivocal test (histological or post-mortem demonstration of the disease, for example) or a combination of alternative tests which, when simultaneously positive, identify animals which are true positives. The assessment or comparison of diagnostic tests requires their application, with the 'gold standard', to a sample of animals with a typical disease spectrum. The characteristics of the test are compared with the gold standard in terms of their sensitivity and specificity (see definitions).

Frequently, however, no 'gold standard' exists for a particular condition and it is necessary to evaluate the diagnosis by the level of agreement between different tests. This assumes that agreement between tests is evidence of validity, whereas disagreement suggests that the tests are not reliable. The kappa test can be used to measure the level of agreement beyond that which may be obtained by chance. The kappa statistic lies within a range between -1 and +1.

The kappa test uses the same table as for calculation of epidemiological values with the observed agreement given by the formula: OA = (a + d)/(a + b + c + d).

This is compared to the expected agreement which would be obtained by chance which is given by the formula: EA = [{(a + b)/n} x {(a + c)/n}] + [{(c + d)/n} x {(b + d)/n}]

Kappa is the agreement greater than that expected by chance divided by the potential excess.

(OA - EA)/(1-EA)

The kappa values are evaluated according to arbitrary "benchmarks" as shown in Table 3.

Kappa value	Evaluation
> 0.81	Almost perfect agreement
0.61 - 0.80	Substantial agreement
0.41 - 0.60	Moderate agreement
0.21 - 0.40	Fair agreement
0.01 - 0.20	Slight agreement
0.00	Poor agreement

Table 3: Evaluation of kappa statistic (Everitt, 1989)^[3]

For example, again using the data from oysters used previously.

OA	= (74 + 127)/258 = 0.779
EA	= [{129/258} x {76/258}] + [{129/258} x {182/258}]
	= (0.500 x 0.295) + (0.500 x 0.705)
	= 0.1475 + 0.3525
	= 0.500

The maximum possible agreement beyond chance = 1 - 0.500 = 0.500

k	= (0.779 - 0.5)/0.5
	= 0.279/0.5
	= 0.558 indicating moderate agreement between the two tests.

It should be noted that the kappa value gives no indication which of the tests is better and that a good agreement may indicate that both tests are equally good or equally bad.

Another important characteristic of a test is its repeatability or the consistency of the test results in two or more replicates on the same animal. For a test whose outcome is either positive or negative, the level of agreement will give an indication of the reliability of the test result. The statistical tests used are outside the scope of this paper and can be found in Thrusfield (1995) or standard statistical texts. However, if the test is repeated twice, then McNemar's chi square test for related samples can be used, and for three or more, Cochran's Q-test is used. If the proportion of positive and negative results are significantly different between the replicates, the repeatability of the test may be low.

Selection of diagnostic tests

The selection of an appropriate diagnostic test depends upon the intended use of the results. If the intention is to rule out a disease, reliable negative results are required for which a test with high sensitivity (i.e. few false negatives) is used. If it is desired to confirm a diagnosis or find evidence of disease (i.e. to "rule in" the disease) we require a test with reliable positive results (i.e. high specificity). As a general rule of thumb, a test with at least 95% sensitivity and 75% specificity should be used to rule out a disease and one with at least 95% specificity and 75% sensitivity used to rule in a disease (Pfeiffer, 1998).

Conclusions

The interpretation of diagnostic tests depends upon the definition of clinical disease and its distinction from the presence of the pathogen. It is the case in most disease outbreaks that the presence of the pathogen is a necessary but not sufficient cause of disease. This is because there are often other factors involved in the expression of the disease condition, an important consideration when making a diagnosis for a population in which a decision has to be made. In studying disease outbreaks, especially in populations, we need to look at them from both a pathological and epidemiological standpoint. Ideally, a diagnostic test can be evaluated based on a clear relationship with an unequivocal "gold standard" diagnosis.

The analytical sensitivity of a method and its relationship with the epidemiological sensitivity, at a population level can change as prevalence increases, as sample size increases and depending upon the number of positive reactions we accept as sufficient on which to base a diagnosis. Highly sensitive (in the analytical sense) methods such as PCR may pick up early stages of a disease condition and this will often manifest itself by a change in the number of false positives over time. Thus, it can be the case that a simplified interpretation of data taken at one point in time may represent a snapshot view. However, as data accumulates, it should be possible to establish a more accurate picture.

Pathologists and researchers involved in lab-based diagnostic work should consider the epidemiological approach required if such results are to be extrapolated to populations. The use of epidemiological methods in the planning and analysis of diagnosis, or better still, a greater co-operation between pathologists and epidemiologists, will assist greatly in the development and interpretation of better diagnostic tests.

References

Thrusfield. M. (1995). Veterinary Epidemiology 2^nd Edition. Publ. Blackwell Science Ltd., Oxford, UK.

Baldock, C. (1996). Course notes from the Australian Centre for International Agricultural Research Workshop on "Epidemiology in Tropical Aquaculure" Bangkok, 1-12 July, 1996.

Snieszko, S.F. (1974). The effects of environmental stress on outbreaks of infectious diseases of fishes. Journal of Fisheries Biology 6, 197-208.

Pfeiffer, D. (1998). Veterinary Epidemiology. An Introduction. Institute of Veterinary, Animal and Biomedical Sciences. Massey University, Palmerston, New Zealand.

Stites, D.P., Stobo, J.D., Fundenberg, H.H. and Wells, J.V. (1982). Basic and Clinical Immunology, 4^th Edition. Lange Medical Publications, Los Altos, USA.

^[1] This paper draws heavily on the information on diagnostic testing in Chapter 17 of Thrusfield (1995). This should be referred to for a fuller explanation than is possible here.
^[2] Data used with kind permission of Dr. Eugene Burreson, Virginia Institute of Marine Science.
^[3] Note that these interpretations are relatively arbitrary and that other authors may use different values for the level of agreement

EVALUATION OF DIAGNOSTIC TESTS: THE EPIDEMIOLOGICAL APPROACH[1] - DANIEL F. FEGAN

EVALUATION OF DIAGNOSTIC TESTS: THE EPIDEMIOLOGICAL APPROACH ^[1] - DANIEL F. FEGAN