APPENDIX I: THE NEGATIVE BINOMIAL DISTRIBUTION

THEORETICAL JUSTIFICATION OF THE NEGATIVE BINOMIAL DISTRIBUTION

We have derived the Poisson Distribution from the Binomial Distribution, and the necessary condition for the Binomial Distribution to hold is that the probability, p, of an event E shall remain constant for all occurrences of its context-events. Thus, this condition must also hold for the Poisson Distribution.

If, however, it is known that p is not constant in its context-events, another distribution known as the Negative Binomial Distribution (N.B.D.) may provide an even closer “fit”.

Suppose we have a Binomial Distribution for which the variance V,(x) = s² = npq is greater than the mean m = np.

In such a case the following equalities/inequalities are held:

(i) npq > np
and
(ii) since p + q = 1, p must be negative, i.e.

But np being positive, n must be negative also (writing n = -k).

The trouble about this type of distribution lies in the interpretation, for we have defined probability in such a way that its measure must always be a number lying between 0 and 1 and so, essentially positive. Again, since n(= -k) is the number of context-events how can it possibly be negative?

It is often found that observed frequency distributions are represented by Negative Binomial Distributions. This is theoretically justified when in frequency distributions the variance is greater than the mean.

This often arises when the probability of an event E does not remain constant for all occurrences of its context-events.¹

¹ The concentration of units varies between different parts of the population (non-randomly distributed throughout the whole population).

From the above (ii) we have,

and , where

substituting we get

The parameters of the distribution are the arithmetic mean (m) and the exponent k.

Since the variance of the population is,

substituting

we get,

(iii)

The probability series of the N.B.D. is given by the expansions

The individual terms of are given by

By using the recurrence formula the individual terms of the series are,

and

Note that k is no longer the maximum possible number of individuals a sampling unit could contain, but is related to the Spatial distribution of the surveyed population (k is a measure of the heterogeneity of the distribution). Unlike the positive Binomial, k is not necessarily an integer in the Negative Binomial Distribution.

From above (iii) we have,

The above formula indicates that, the reciprocal of the exponent k, i.e., is a measure of the excess of variance or clumping of the individuals in the population. Specifically, as approaches zero and k approaches infinity, the distribution coverges to the Poisson series (s² Þ m). Conversely, if clumping increases , 1 approaches infinity (k Þ 0) and the distribution converges to the Logarithmic Series.

Example:

The Table below gives the number of aquatic invertebrates on the bottom in 400 square units. Fit a Negative Binomial Distribution to the empirical data.

Number of aquatic invertebrates (x)

0

1

2

3

4

5

Total

Frequency (f)

213

128

37

18

3

1

400

Estimated mean:

Estimated variance:

Calculated q:

0.81 = 0.68q, and q=1.19,

Calculated :

and

Estimated

Estimated probabilities:

Recurrence formula:
P(x=0) = q^-k
Therefore,

Estimated theoretical frequencies (N.B.D.):

N_x=0 = 400 × P(x=0) = 400 × 0.5365 = 214
N_x=1 = 400 × P(x=1) = 400 × 0.3065 = 123
N_x=2 = 400 × P(x=2) = 400 × 0.1120 = 45
N_x=3 = 400 × P(x=3) = 400 × 0.0332 = 13
N_x=4 = 400 × P(x=4) = 400 × 0.0087 = 4
N_x=5 = 400 × P(x=5) = 400 × 0.0022 = 1

Testing goodness of fit:

A problem that arises frequently in statistical work is the testing of comparability of a set of observed (empirical) and theoretical (N.B.D.) frequencies.

To test the hypothesis of goodness of fit of the N.B.D. to the empirical frequency distribution we calculate the value of

where

f_i = empirical frequencies
q_i = theoretical frequencies

The estimated X² - value is compared² with the tabulated

-value. The hypothesis is valid if X² <

, the hypothesis is discredited if X² >

² It should be noted that, since x² curve is an approximation to the discrete x² frequency function care must be exercised that the x² test is used only when the approximation is good. Experience and theoretical investigations indicate that the approximation is usually satisfactory - provided that the frequencies of the class intervals are usually ³ 5 and that the number of classes in the frequency distribution are ³ 5.

The following Table gives the empirical and theoretical frequencies of the previous example and the estimated X² - value.

Table X² test of goodness of fit N.B.D. to spatial distribution of aquatic invertebrates

Number of squares

Number of aquatic invert. (x)

Empirical frequencies (f_i)

Theoretical frequencies (q_i)

(f_i + q_i)

Remarks

0

213

214

-1

0.0047

1

128

123

+5

0.2033

2

37

45

-8

1.4222

3

18

13

+5

1.9231

4
5

4

5

-1

0.2000

combined

X2 =3.7533

The tabulated

(n - degrees qf freedom, n =5 classes -(2 estimated parameters + 1)

Since , 3.7533 < 5.991 the hypothesis of goodness of fit is valid.

Note: A second estimate of k

From the above (iii) we have
, and
A second estimate of k is given by

In the above example,

(See also Appendix II)

Transformations

Analysis of variance, correlation analysis, testing hypothesis and other statistical methods of analysis of data associated with the normal distribution are performed on the transformed counts (see Table below).

Transformations

Original distribution

Special conditions

Estimated parameters

Transformation

1. Poisson

1, No counts less than 10

Replace x by

2, Some counts less than 10

Replace x by

2. Negative Binomial

1. k greater than 5

-

Replace x by

2. k between 2 and 5

-

Replace x by y = log(x+k/2)

3. No zero counts

Replace x by y = log x

4. Some zero counts

Replace x by y = log(x+1)

When the statistical analyses are complete, the arithmetic mean of the transformed counts has to be transformed back to the original scale and thus becomes a derived mean.

As the derived mean is smaller than the arithmetic mean of the original counts before transformation, it is not comparable with arithmetic mean obtained by direct averaging. Therefore small adjustments have to be made to the derived means. (See section 10.4.4.).

Number of aquatic invertebrates (x)	0	1	2	3	4	5	Total
Frequency (f)	213	128	37	18	3	1	400

	Number of squares
Number of aquatic invert. (x)	Empirical frequencies (f_i)	Theoretical frequencies (q_i)	(f_i + q_i)		Remarks
0	213	214	-1	0.0047
1	128	123	+5	0.2033
2	37	45	-8	1.4222
3	18	13	+5	1.9231
4 5	4	5	-1	0.2000	combined
				X2 =3.7533

Original distribution	Special conditions	Estimated parameters	Transformation
1. Poisson	1, No counts less than 10		Replace x by
1. Poisson	2, Some counts less than 10		Replace x by
2. Negative Binomial	1. k greater than 5	-	Replace x by
	2. k between 2 and 5	-	Replace x by y = log(x+k/2)
	3. No zero counts		Replace x by y = log x
	4. Some zero counts		Replace x by y = log(x+1)