# APPENDIX I: THE NEGATIVE BINOMIAL DISTRIBUTION

THEORETICAL JUSTIFICATION OF THE NEGATIVE BINOMIAL DISTRIBUTION

We have derived the Poisson Distribution from the Binomial Distribution, and the necessary condition for the Binomial Distribution to hold is that the probability, p, of an event E shall remain constant for all occurrences of its context-events. Thus, this condition must also hold for the Poisson Distribution.

If, however, it is known that p is not constant in its context-events, another distribution known as the Negative Binomial Distribution (N.B.D.) may provide an even closer “fit”.

Suppose we have a Binomial Distribution for which the variance V,(x) = s2 = npq is greater than the mean m = np.

In such a case the following equalities/inequalities are held:

(i) npq > np

and (ii) since p + q = 1, p must be negative, i.e. But np being positive, n must be negative also (writing n = -k).

The trouble about this type of distribution lies in the interpretation, for we have defined probability in such a way that its measure must always be a number lying between 0 and 1 and so, essentially positive. Again, since n(= -k) is the number of context-events how can it possibly be negative?

It is often found that observed frequency distributions are represented by Negative Binomial Distributions. This is theoretically justified when in frequency distributions the variance is greater than the mean.

This often arises when the probability of an event E does not remain constant for all occurrences of its context-events.1

1 The concentration of units varies between different parts of the population (non-randomly distributed throughout the whole population).
From the above (ii) we have, and , where substituting we get The parameters of the distribution are the arithmetic mean (m) and the exponent k.

Since the variance of the population is, ,

substituting we get,

(iii) The probability series of the N.B.D. is given by the expansions The individual terms of are given by By using the recurrence formula the individual terms of the series are, and Note that k is no longer the maximum possible number of individuals a sampling unit could contain, but is related to the Spatial distribution of the surveyed population (k is a measure of the heterogeneity of the distribution). Unlike the positive Binomial, k is not necessarily an integer in the Negative Binomial Distribution.

From above (iii) we have, The above formula indicates that, the reciprocal of the exponent k, i.e., is a measure of the excess of variance or clumping of the individuals in the population. Specifically, as approaches zero and k approaches infinity, the distribution coverges to the Poisson series (s2 Þ m). Conversely, if clumping increases , 1 approaches infinity (k Þ 0) and the distribution converges to the Logarithmic Series.

Example:

The Table below gives the number of aquatic invertebrates on the bottom in 400 square units. Fit a Negative Binomial Distribution to the empirical data.

 Number of aquatic invertebrates (x) 0 1 2 3 4 5 Total Frequency (f) 213 128 37 18 3 1 400

Estimated mean: Estimated variance: Calculated q: or

0.81 = 0.68q, and q=1.19, Calculated : and Estimated Estimated probabilities:

Recurrence formula:
P(x=0) = q-k Therefore,       Estimated theoretical frequencies (N.B.D.):
Nx=0 = 400 × P(x=0) = 400 × 0.5365 = 214

Nx=1 = 400 × P(x=1) = 400 × 0.3065 = 123

Nx=2 = 400 × P(x=2) = 400 × 0.1120 = 45

Nx=3 = 400 × P(x=3) = 400 × 0.0332 = 13

Nx=4 = 400 × P(x=4) = 400 × 0.0087 = 4

Nx=5 = 400 × P(x=5) = 400 × 0.0022 = 1 Testing goodness of fit:

A problem that arises frequently in statistical work is the testing of comparability of a set of observed (empirical) and theoretical (N.B.D.) frequencies.

To test the hypothesis of goodness of fit of the N.B.D. to the empirical frequency distribution we calculate the value of where

fi = empirical frequencies
qi = theoretical frequencies
The estimated X2 - value is compared2 with the tabulated -value. The hypothesis is valid if X2 < , the hypothesis is discredited if X2 > 2 It should be noted that, since x2 curve is an approximation to the discrete x2 frequency function care must be exercised that the x2 test is used only when the approximation is good. Experience and theoretical investigations indicate that the approximation is usually satisfactory - provided that the frequencies of the class intervals are usually ³ 5 and that the number of classes in the frequency distribution are ³ 5.
The following Table gives the empirical and theoretical frequencies of the previous example and the estimated X2 - value.

Table X2 test of goodness of fit N.B.D. to spatial distribution of aquatic invertebrates

 Number of squares Number of aquatic invert. (x) Empirical frequencies (fi) Theoretical frequencies (qi) (fi + qi) Remarks 0 213 214 -1 0.0047 1 128 123 +5 0.2033 2 37 45 -8 1.4222 3 18 13 +5 1.9231 45 4 5 -1 0.2000 combined X2 =3.7533

The tabulated (n - degrees qf freedom, n =5 classes -(2 estimated parameters + 1)

Since , 3.7533 < 5.991 the hypothesis of goodness of fit is valid.

Note: A second estimate of k

From the above (iii) we have , and A second estimate of k is given by In the above example, Original distribution Special conditions Estimated parameters Transformation 1. Poisson 1, No counts less than 10 Replace x by 2, Some counts less than 10 Replace x by 2. Negative Binomial 1. k greater than 5 - Replace x by 2. k between 2 and 5 - Replace x by y = log(x+k/2) 3. No zero counts Replace x by y = log x 4. Some zero counts Replace x by y = log(x+1)