1. STATISTICAL METHOD IN SCIENTIFIC RESEARCH

Like in any other branch of science, forestry research is also based on scientific method which is popularly known as the inductive-deductive approach. Scientific method entails formulation of hypotheses from observed facts followed by deductions and verification repeated in a cyclical process. Facts are observations which are taken to be true. Hypothesis is a tentative conjecture regarding the phenomenon under consideration. Deductions are made out of the hypotheses through logical arguments which in turn are verified through objective methods. The process of verification may lead to further hypotheses, deductions and verification in a long chain in the course of which scientific theories, principles and laws emerge.

As a case of illustration, one may observe that trees in the borders of a plantation are growing better than trees inside. A tentative hypothesis that could be formed from this fact is that the better growth of trees in the periphery is due to increased availability of light from the open sides. One may then deduce that by varying the spacing between trees and thereby controlling the availability of light, the trees can be made to grow differently. This would lead to a spacing experiment wherein trees are planted at different espacements and the growth is observed. One may then observe that trees under the same espacement vary in their growth and a second hypothesis formed would be that the variation in soil fertility is the causative factor for the same. Accordingly, a spacing cum fertilizer trial may follow. Further observation that trees under the same espacement, receiving the same fertilizer dose differ in their growth may prompt the researcher to conduct a spacing cum fertilizer cum varietal trial. At the end of a series of experiments, one may realize that the law of limiting factors operate in such cases which states that crop growth is constrained by the most limiting factor in the environment.

The two main features of scientific method are its repeatability and objectivity. Although this is rigorously achieved in the case of many physical processes, biological phenomena are characterised by variation and uncertainty. Experiments when repeated under similar conditions need not yield identical results, being subjected to fluctuations of random nature. Also, observations on the complete set of individuals in the population are out of question many times and inference may have to be made quite often from a sample set of observations. The science of statistics is helpful in objectively selecting a sample, in making valid generalisations out of the sample set of observations and also in quantifying the degree of uncertainty in the conclusions made.

Two major practical aspects of scientific investigations are collection of data and interpretation of the collected data. The data may be generated through a sample survey on a naturally existing population or a designed experiment on a hypothetical population. The collected data are condensed and useful information extracted through techniques of statistical inference. This apart, a method of considerable importance to forestry which has gained wider acceptance in recent times with the advent of computers is simulation. This is particularly useful in forestry because simulation techniques can replace large scale field experiments which are extremely costly and time consuming. Mathematical models are developed which capture most of the relevant features of the system under consideration after which experiments are conducted in computer rather than with real life systems. A few additional features of these three approaches viz., survey, experiment and simulation are discussed here before describing the details of the techniques involved in later chapters.

In a broad sense, all in situ studies involving non-interfering observations on nature can be classed as surveys. These may be undertaken for a variety of reasons like estimation of population parameters, comparison of different populations, study of the distribution pattern of organisms or for finding out the interrelations among several variables. Observed relationships from such studies are not many times causative but will have predictive value. Studies in sciences like economics, ecology and wildlife biology generally belong to this category. Statistical theory of surveys relies on random sampling which assigns known probability of selection for each sampling unit in the population.

Experiments serve to test hypotheses under controlled conditions. Experiments in forestry are held in forests, nurseries and laboratories with pre-identified treatments on well defined experimental units. The basic principles of experimentation are randomization, replication and local control which are the prerequisites for obtaining a valid estimate of error and for reducing its magnitude. Random allocation of the experimental units to the different treatments ensures objectivity, replication of the observations increases the reliability of the conclusions and the principle of local control reduces the effect of extraneous factors on the treatment comparison. Silvicultural trials in plantations and nurseries and laboratory trials are typical examples of experiments in forestry.

Experimenting on the state of a system with a model over time is termed simulation. A system can be formally defined as a set of elements also called components. A set of trees in a forest stand, producers and consumers in an economic system are examples of components. The elements (components) have certain characteristics or attributes and these attributes have numerical or logical values. Among the elements, relationships exist and the consequently, the elements are interacting. The state of a system is determined by the numerical or logical values of the attributes of the system elements. The interrelations among the elements of a system are expressible through mathematical equations and thus the state of the system under alternative conditions is predictable through mathematical models. Simulation amounts to tracing the time path of a system under alternative conditions.

While surveys and experiments and simulations are essential elements of any scientific research programme, they need to be embedded in some larger and more strategic framework if the programme as a whole is to be both efficient and effective. Increasingly, it has come to be recognized that systems analysis provides such a framework, designed to help decision makers to choose a desirable course of action or to predict the outcome of one or more courses of action that seems desirable. A more formal definition of systems analysis is the orderly and logical organisation of data and information into models followed by rigorous testing and exploration of these models necessary for their validation and improvement (Jeffers ,1978).

Research related to forests extends from molecular level to the whole of biosphere. The nature of the material dealt with largely determines the methods employed for making investigations. Many levels of organization in the natural hierarchy such as micro-organisms or trees are amenable to experimentation but only passive observations and modelling are possible at certain other levels. Regardless of the objects dealt with, the logical framework of the scientific approach and the statistical inference can be seen to remain the same. This manual essentially deals with various statistical methods used for objectively collecting the data and making valid inferences out of the same.

2. BASIC STATISTICS

2.1 Concept of probability

The concept of probability is central to the science of statistics. As a subjective notion, probability can be interpreted as degree of belief in a continuous range between impossibility and certainty, about the occurrence of an event. Roughly speaking, the value p, given by a person for the probability P(E) of an event E, means the price that person is willing to pay for winning a fixed amount of money conditional on the event being materialized. If the price the person is willing to pay is x units for winning y units of money, then the probability assigned is indicated by P(E)= x / (x + y). More objective measures of probability are based on equally likely outcomes and that based on relative frequency which are described below. A rigorous axiomatic definition of probability is also available in statistical theory which is not dealt with here.

Classical definition of probability : Suppose an event E can happen in x ways out of a total of n possible equally likely ways. Then the probability of occurrence of the event (called its success) is denoted by

(2.1)

The probability of non-occurrence of the event (called its failure) is denoted by

q = P(not E) = (2.2)

= (2.3)

Thus p + q = 1, or P(E) + P(not E) = 1. The event ‘not E’ is sometimes denoted by .

As an example, let the colour of flowers in a particular plant species be governed by the presence of a dominant gene A in a single gene locus, the gametic combinations AA and Aa giving rise to red flowers and the combination aa giving white flowers. Let E be the event of getting red flowers in the progeny obtained through selfing of a heterozygote, Aa. Let us assume that the four gametic combinations AA, Aa, aA and aa are equally likely. Since the event E can occur in three of these ways, we have,

p = P(E) =

The probability of getting white flowers in the progeny through selfing of the heterozygote Aa is

q =

Note that the probability of an event is a number between 0 and 1. If the event cannot occur, its probability is 0. If it must occur, i.e., its occurrence is certain, its probability is 1. If p is the probability that an event will occur, the odds in favour of its happening are p:q (read ‘p to q’); the odds against its happening are q:p. Thus the odds in favour of red flowers in the above example are

, i.e. 3 to 1.

Frequency interpretation of probability : The above definition of probability has a disadvantage in that the words ‘equally likely’ are vague. Since these words seem to be synonymous with ‘equally probable’, the definition is circular because, we are essentially defining probability in terms of itself. For this reason, a statistical definition of probability has been advocated by some people. According to this, the estimated probability, or empirical probability, of an event is taken as the relative frequency of occurrence of the event when the number of observations is large. The probability itself is the limit of the relative frequency as the number of observations increases indefinitely. Symbolically, probability of event E is,

P(E) = lim f_n (E) (2.4)

n ® ¥

where f_n (E) = (number of times E occurred)/(total number of observations)

For example, in a search for a particular endangered species, the following numbers of plants of that species were encountered in a survey in sequence.

x (number of plants of endangered species) : 1, 6, 62, 610

n (number of plants examined) : 1000, 10000, 100000, 1000000

p (proportion of endangered species) : 0.001, 0.00060, 0.00062, 0.00061

As n tends to infinity, the relative frequency seems to approach a certain limit. We call this empirical property as the stability of the relative frequency.

Conditional probability, independent and dependent events : If E₁ and E₂ are two events, the probability that E₂ occurs given that E₁ has occurred is denoted by P(E₂/E₁) or P(E₂ given E₁) and is called the conditional probability of E₂ given that E₁ has occurred. If the occurrence or non-occurrence of E₁does not affect the probability of occurrence of E₂ then P(E₂/E₁) = P(E₂) and we say that E₁ and E₂ are independent events; otherwise they are dependent events.

If we denote by E₁E₂ the event that ‘both E₁ and E₂ occur’, sometimes called a compound event, then

P(E₁E₂) = P(E₁)P(E₂/E₁) (2.5)

In particular, P(E₁E₂) = P(E₁)P(E₂) for independent events. (2.6)

For example, consider the joint segregation of two characters viz., flower colour and shape of seed in a plant species, the characters being individually governed by the presence of dominant genes A and B respectively. Individually, the combinations AA and Aa give rise to red flowers and the combination aa give white flowers, the combinations BB and Bb give round seeds and the combination bb produce wrinkled seeds.

Let E₁ and E₂ be the events of ‘getting plants with red flowers’ and ‘getting plants with round seeds’ in the progeny obtained through selfing of a heterozygote AaBb respectively. If E₁ and E₂ are independent events, i.e., there is no interaction between the two gene loci, the probability of getting plants with red flowers and round seeds in the selfed progeny is,

P(E₁E₂)=P(E₁)P(E₂)=

In general, if E₁, E₂, E₃, …, E_n are n independent events having respective probabilities p₁, p₂, p₃, …, p_n, then the probability of occurrence of E₁ and E₂ and E₃ and … E_n is p₁p₂p₃…p_n.

2.2. Frequency distribution

Since the frequency interpretation of probability is highly useful in practice, preparation of frequency distribution is an often-used technique in statistical works when summarising large masses of raw data, which leads to information on the pattern of occurrence of predefined classes of events. The raw data consist of measurements of some attribute on a collection of individuals. The measurement would have been made in one of the following scales viz., nominal, ordinal, interval or ratio scale. Nominal scale refers to measurement at its weakest level when number or other symbols are used simply to classify an object, person or characteristic, e.g., state of health (healthy, diseased). Ordinal scale is one wherein given a group of equivalence classes, the relation greater than holds for all pairs of classes so that a complete rank ordering of classes is possible, e.g., socio-economic status. When a scale has all the characteristics of an ordinal scale, and when in addition, the distances between any two numbers on the scale are of known size, interval scale is achieved,. e.g., temperature scales like centigrade or Fahrenheit. An interval scale with a true zero point as its origin forms a ratio scale. In a ratio scale, the ratio of any two scale points is independent of the unit of measurement, e.g., height of trees. Reference may be made to Siegel (1956) for a detailed discussion on the different scales of measurement, their properties and admissible operations in each scale.

Regardless of the scale of measurement, a way to summarise data is to distribute it into classes or categories and to determine the number of individuals belonging to each class, called the class frequency. A tabular arrangement of data by classes together with the corresponding class frequencies is called a frequency distribution or frequency table. Table 2.1 is a frequency distribution of diameter at breast-height (dbh) recorded to the nearest cm, of 80 teak trees in a sample plot. The relative frequency of a class is the frequency of the class divided by the total frequency of all classes and is generally expressed as a percentage. For example, the relative frequency of the class 17-19 in Table 2.1 is (30/80)100 = 37.4%. The sum of all the relative frequencies of all classes is clearly 100 %.

Table 2.1. Frequency distribution of dbh of teak trees in a plot

13.8

Dbh class (cm)	Frequency (Number of trees)	Relative frequency (%)
11-13	11
14-16	20	25.0
17-19	30	37.4
20-22	15	18.8
23-25	4	5.0
Total	80	100.0

A symbol defining a class interval such as 11-13 in the above table is called a class interval. The end numbers 11 and 13, are called class limits; the smaller number 11 is the lower class limit and the larger number 13 is the upper class limit. The terms class and class interval are often used interchangeably, although the class interval is actually a symbol for the class. A class interval which, at least theoretically, has either no upper class limit or no lower class limit indicated is called an open class interval. For example, the class interval ‘23 cm and over’ is an open class interval.

If dbh values are recorded to the nearest cm, the class interval 11-13, theoretically includes all measurements from 10.5 to 13.5 cm. These numbers are called class boundaries or true class limits; the smaller number 10.5 is the lower class boundary and the large number 13.5 is the upper class boundary. In practice, the class boundaries are obtained by adding the upper limit of one class interval to the lower limit of the next higher class interval and dividing by 2.

Sometimes, class boundaries are used to symbolise classes. For example, the various classes in the first column of Table 2.1 could be indicated by 10.5-13.5, 13.5-16.5, etc. To avoid ambiguity in using such notation, class boundaries should not coincide with actual observations. Thus, if an observation were 13.5 it would not be possible to decide whether it belonged to the class interval 10.5-13.5 or 13.5-16.5. The size or width of a class interval is the difference between the lower and upper boundaries and is also referred as the class width. The class mark is the midpoint of the class interval and is obtained by adding the lower and upper class limits and dividing by two.

Frequency distributions are often graphically represented by a histogram or frequency polygon. A histogram consists of a set of rectangles having bases on a horizontal axis (the x axis) with centres at the class marks and lengths equal to the class interval sizes and areas proportional to class frequencies. If the class intervals all have equal size, the heights of the rectangles are proportional to the class frequencies and it is then customary to take the heights numerically equal to the class frequencies. If class intervals do not have equal size, these heights must be adjusted. A frequency polygon is a line graph of class frequency plotted against class mark. It can be obtained by connecting midpoints of the tops of the rectangles in the histogram.

Figure 2.1. Histogram showing the frequency distribution of dbh

Figure 2.2. Frequency polygon showing the frequency distribution of dbh

2.3. Properties of frequency distribution

Having prepared a frequency distribution, a number of measures can be generated out of it, which leads to further condensation of the data. These are measures of location, dispersion, skewness and kurtosis.

2.3.1. Measures of location

A frequency distribution can be located by its average value which is typical or representative of the set of data. Since such typical values tend to lie centrally within a set of data arranged according to magnitude, averages are also called measures of central tendency. Several types of averages can be defined, the most common being the arithmetic mean or briefly the mean, the median and the mode. Each has advantages and disadvantages depending on the data and the intended purpose.

Arithmetic mean : The arithmetic mean or the mean of a set of N numbers x₁, x₂, x₃, …, x_N is denoted by (read as ‘x bar’) and is defined as

(2.7)

The symbol denote the sum of all the x_j’s from j = 1 to j = N .

For example, the arithmetic mean of the numbers 8, 3, 5, 12, 10 is

If the numbers x₁, x₂, …, x_K occur f₁, f₂, …, f_K times respectively (i.e., occur with frequencies f₁, f₂, …, f_K, the arithmetic mean is

(2.8)

where is the total frequency. i.e., the total number of cases.

The computation of mean from grouped data of Table 2.1 is illustrated below.

Step 1. Find the midpoints of the classes. For this purpose add the lower and upper limits of the first class and divide by 2. For the subsequent classes go on adding the class interval.

Step 2. Multiply the midpoints of the classes by the corresponding frequencies, and add them up to get .

The results in the above steps can be summarised as given in Table 2.2.

Table 2.2. Computation of mean from grouped data

Dbh class (cm)	Midpoint x	f	fx
11-13	12	11	132
14-16	15	20	300
17-19	18	30	540
20-22	21	15	315
23-25	24	4	96
Total

Step 3. Substitute the values in the formula

= cm

Median : The median of a set of numbers arranged in order of magnitude (i.e., in an array) is the middle value or the arithmetic mean of the two middle values.

For example, the set of numbers 3, 4, 4, 5, 6, 8, 8, 8, 10 has median 6. The set of numbers 5, 5, 7, 9, 11, 12, 15, 18 has median = 10.

For grouped data the median, obtained by interpolation, is given by

Median = (2.9)

where L₁ = lower class boundary of the median class (i.e., the class containing the median)

N = number of items in the data (i.e., total frequency)

= sum of frequencies of all classes lower than the median class

f_m = frequency of median class

c = size of median class interval.

Geometrically, the median is the value of x (abscissa) corresponding to that vertical line which divides a histogram into two parts having equal areas.

The computation of median from grouped data of Table 2.1 is illustrated below.

Step 1. Find the midpoints of the classes. For this purpose add the lower and upper limits of the first class and divide by 2. For the subsequent classes go on adding the class interval.

Step 2. Write down the cumulative frequency and present the results as in Table 2.3.

Table 2.3. Computation of median from grouped data

Dbh class (cm)	Midpoint x	frequency f	Cumulative frequency
11-13	12	11	11
14-16	15	20	31
17-19	18	30	61
20-22	21	15	76
23-25	24	4	80
Total

Step 3. Find the median class by locating the (N / 2)th item in the cumulative frequency column. In this example, N / 2 = 40. It falls in the class 17-19. Hence it is the median class.

Step 4. Use the formula (2.9) for calculating the median.

Median =

= 17.4

Mode : The mode of a set of numbers is that value which occurs with the greatest frequency, i.e., it is the most common value. The mode may not exist, and even if it does exist, it may not be unique.

The set of numbers 2, 2, 5, 7, 9, 9, 9, 10, 10, 11, 12, 18 has mode 9. The set 3, 5, 8, 10, 12, 15, 16 has no mode. The set 2, 3, 4, 4, 4, 5, 5, 7, 7, 7, 9 has two modes 4 and 7 and is called bimodal. A distribution having only one mode is called unimodal.

In the case of grouped data where a frequency curve have been constructed to fit the data, the mode will be the value (or values) of x corresponding to the maximum point (or points) on the curve.

From a frequency distribution or histogram, the mode can be obtained from the formula,

Mode = (2.10)

where L₁ = Lower class boundary of modal class (i.e., the class containing the mode).

f₁ = Frequency of the class previous to the modal class.

f₂ = Frequency of the class just after the modal class.

c = Size of modal class interval.

The computation of mode from grouped data of Table 2.1. is illustrated below.

Step 1. Find out the modal class. The modal class is the class against the maximum frequency. In our example, the maximum frequency is 30 and hence the modal class is 17-19.

Step 2. Use the formula (2.10) for computing mode

Mode =

= 17.79

The general guidelines on the use of measures of location are that mean is mostly to be used in the case of symmetric distributions (explained in Section 2.3.3) as it is greatly affected by extreme values in the data, median has the distinct advantage of being computable even with open classes and mode is useful with multimodal distributions as it works out to be the most frequent observation in a data set.

2.3.2. Measures of dispersion

The degree to which numerical data tend to spread about an average value is called the variation or dispersion of the data. Various measures of dispersion or variation are available, like the range, mean deviation or semi-interquartile range but the most common is the standard deviation.

Standard deviation: The standard deviation of a set of N numbers x₁, x₂, …, x_N is defined by

(2.11)

whererepresents the arithmetic mean.

Thus standard deviation is the square root of the mean of the squares of the deviations of individual values from their mean or, as it is sometimes called, the root mean square deviation. For computation of standard deviation, the following simpler form is used many times.

(2.12)

For example, the set of data given below represents diameters at breast-height of 10 randomly selected teak trees in a plot.

23.5, 11.3, 17.5, 16.7, 9.6, 10.6, 24.5, 21.0, 18.1, 20.7

Here N = 10, = 3266.5 and = 173.5. Hence,

= 5.062

If x₁, x₂, …, x_K occur with frequencies f₁, f₂, …, f_K respectively, the standard deviation can be computed as

(2.13)

where

Equation (2.13) can be written in the equivalent form which is useful in computations, as

(2.14)

The variance of a set of data is de fined as the square of the standard deviation. The ratio of standard deviation to mean expressed in percentage is called coefficient of variation.

For illustration, we can use the data given in Table 2.1.

Step 1. Find the midpoints of the classes. For this purpose, add the lower and upper limits of the first class and divide by 2. For the subsequent classes, go on adding the class interval.

Step 2. Multiply the midpoints of the classes by the corresponding frequencies, and add them up to get .

Step 3. Multiply the square of the midpoints of the classes by the corresponding frequencies and add them up to get .

The above results can be summarised as in Table 2.4.

Table 2.4. Computation of standard deviation from grouped data

Dbh class (cm)	Midpoint x	Frequency f	fx	fx²
11-13	12	11	132	1584
14-16	15	20	300	4500
17-19	18	30	540	9720
20-22	21	15	315	6615
23-25	24	4	96	2304
Total		80	1383	24723

Step 4. Use the formula (2.14) for calculating the standard deviation and find out variance and coefficient of variation

= 3.19

Variance = (Standard deviation )² = (3.19)²

= 10.18

Coefficient of variation =

= = 18.45

Both standard deviation and mean carry units of measurement where as coefficient of variation has no such units and hence is useful for comparing the extent of variation in characters which differ in their units of measurement. This is a useful property in comparison of variation in two sets of numbers which differ by their means. For instance, suppose that the variation in height of seedlings and that of older trees of a species are to be compared. Let the respective means and standard deviations be,

Mean height of seedlings = 50 cm, Standard deviation of height of seedlings = 10 cm.

Mean height of trees = 500 cm, Standard deviation of height of seedlings = 100 cm.

By the absolute value of the standard deviation, one may tend to judge that variation is more in the case of trees but the relative variation as indicated by the coefficient of variation (20 %) is the same in both the sets.

2.3.3. Measures of skewness

Skewness is the degree of asymmetry, or departure from symmetry, of a distribution. If the frequency curve (smoothed frequency polygon) of a distribution has a longer ‘tail’ to the right of the central maximum than to the left, the distribution is said to be skewed to the right or to have positive skewness. If the reverse is true, it is said to be skewed to the left or to have negative skewness. An important measure of skewness expressed in dimensionless form is given by

Moment coefficient of skewness = (2.15)

where and are the second and third central moments defined using the formula,

(2.16)

For grouped data, the above moments are given by

(2.17)

For a symmetrical distribution, = 0. Skewness is positive or negative depending upon whether is positive or negative.

The data given in Table 2.1 are used for illustrating the steps for computing the measure of skewness.

Step 1. Calculate the mean.

Mean = = 17.29

Step 2. Compute f_j (x_j - )², f_j (x_j -)³ and their sum as summarised in Table 2.5.

Table 2.5. Steps for computing coefficient of skewness from grouped data

Dbh class (cm)

Midpoint

x

f

x_j -

f_j(x_j - )²

f_j(x_j - )³

f_j(x_j - )⁴

11-13

12

11

-5.29

307.83

-1628.39

8614.21

14-16

15

20

-2.29

104.88

-240.18

550.01

17-19

18

30

0.71

15.12

10.74

7.62

20-22

21

15

3.71

206.46

765.97

2841.76

23-25

24

4

6.71

180.10

1208.45

8108.68

Total

80

3.55

814.39

116.58

20122.28

Step 3. Compute and using the formula (2.17).

= 10.18

= 1.46

Step 4. Compute the measure of skewness using the formula (2.15).

Moment coefficient of skewness =

= 0.002.

Since, = .002, the distribution is very slightly skewed or skewness is negligible. It is positively skewed since is positive.

2.3.4. Kurtosis

Kurtosis is the degree of peakedness of a distribution, usually taken relative to a normal distribution. A distribution having a relatively high peak is called leptokurtic, while the curve which is flat-topped is called platykurtic. A bell shaped curve which is not very peaked or very flat-topped is called mesokurtic.

One measure of kurtosis, expressed in dimensionless form, is given by

Moment coefficient of kurtosis = (2.18)

where and can be obtained from the formula (2.16) for ungrouped data and by using the formula (2.17) for grouped data. The distribution is called normal if = 3. When is more than 3, the distribution is said to be leptokurtic. If is less than 3, the distribution is said to be platykurtic

For example, the data in Table 2.1 is utilised for computing the moment coefficient of kurtosis.

Step 1. Compute the mean .

Mean = = 17.29

Step 2. Compute f_j (x_j - )², f_j (x_j -)⁴ and their sum as summarised in Table 2.5.

Step 3. Compute and using the formula (2.17).

= 10.18

= 251.53

Step 4. Compute the measure of kurtosis using the formula (2.18).

Moment coefficient of kurtosis =

= 2.43.

The value of is 2.38 which is less than 3. Hence the distribution is platykurtic.

2.4. Discrete theoretical distributions

If a variable X can assume a discrete set of values x₁, x₂,…, x_K with respective probabilities p₁, p₂, …, p_K where , we say that a discrete probability distribution for X has been defined. The function p(x) which has the respective values p₁, p₂, …, p_K for x = x₁, x₂, …, x_K, is called the probability function or frequency function of X. Because X can assume certain values with given probabilities, it is often called a discrete random variable.

For example, let a pair of fair dice be tossed and let X denote the sum of the points obtained. Then the probability distribution is given by the following table.

X	2	3	4	5	6	7	8	9	10	11	12
p(x)	1/36	2/36	3/36	4/36	5/36	6/36	5/36	4/36	3/36	2/36	1/36

The probability of getting sum 5 is . Thus in 900 tosses of the dice, we would expect 100 tosses to give the sum 5.

Note that this is analogous to a relative frequency distribution with probabilities replacing relative frequencies. Thus we can think of probability distributions as theoretical or ideal limiting forms of relative frequency distributions when the number of observations is made very large. For this reason, we can think of probability distributions as being distributions for populations, whereas relative frequency distributions are distributions of samples drawn from this population.

When the values of x can be ordered as in the case where they are real numbers, we can define the cumulative distribution function,

for all x (2.19)

F(x) is the probability that X will take on some value less than or equal to x.

Two important discrete distributions which are encountered frequently in research investigations in forestry are mentioned here for purposes of future reference.

2.4.1. Binomial distribution

A binomial distribution arises from a set of n independent trials with outcome of a single trial being dichotomous such as ‘success’ or ‘failure’. A binomial distribution applies if the probability of getting x successes out of n trials is given by the function,

(2.20)

where n is a positive integer and 0<p<1. The constants n and p are the parameters of the binomial distribution. As indicated, the value of x ranges from 0 to n.

For example, if a silviculturist is observing mortality of seedlings in plots in a plantation where 100 seedlings were planted in each plot and records live plants as ‘successes’ and dead plants as ‘failures’, then the variable ‘number of live plants in a plot’ may follow a binomial distribution.

Binomial distribution has mean np and a standard deviation . The value of p is estimated from a sample by

(2.21)

where x is the number of successes in the sample and n is the total number of cases examined.

As an example, suppose that an entomologist picks up at random 5 plots each of size 10 m x 10 m from a plantation with seedlings planted at 2 m x 2 m espacement. Let the observed number of plants affected by termites in the five plots containing 25 seedlings each be (4, 7, 7, 4, 3). The pooled estimate of p from the five plots would be,

Further, if he picks up a plot of same size at random from the plantation, the probability of that plot containing a specified number of plants infested with termites can be obtained by Equation (2.20) provided the infestation by termites follow binomial distribution. For instance, the probability of getting a plot uninfested by termites is

= 0.0038

2.4.2. The Poisson distribution

A discrete random variable X is said to have a Poisson distribution if the probability of assuming specific value x is given by

(2.22)

where l >0. The variable X ranges from 0 to ¥ .

In ecological studies, certain sparsely occurring organisms are found to be distributed randomly over space. In such instances, observations on number of organisms found in small sampling units are found to follow Poisson distribution. Poisson distribution has the single parameter l which is the mean and also the variance of the distribution. Accordingly the standard deviation is. From samples, the values of l is estimated as

(2.23)

where x_i’s are the number of cases detected in a sampling unit and n is the number of sampling units observed.

For instance, a biologist observes the numbers of leech found in 100 samples taken from a fresh water lake. Let the total number of leeches caught be 80 so that the mean number per sample is calculated as,

If the variable follows Poisson distribution, the probability of getting at least one leach in a fresh sample can be calculated as 1 - p(0) which is,

= 0.5507

2.5. Continuous theoretical distributions

The idea of discrete distribution can be extended to the case where the variable X may assume continuous set of values. The relative frequency polygon of a sample becomes, in the theoretical or limiting case of a population, a continuous curve as shown in Figure 2.3, whose equation is y = p(x).

Figure 2.3. Graph of continuous distribution

The total area under this curve bounded by the X axis is equal to one, and the area under the curve between lines X = a and X = b (shaded in the figure) gives the probability that X lies between a and b, which can be denoted by P(a<X<b). We call p(x) a probability density function, or briefly a density function, and when such a function is given, we say that a continuous probability distribution for X has been defined. The variable X is then called a continuous random variable.

Cumulative distribution function for a continuous random variable is

(2.24)

The symbol ò indicates integration which is in a way equivalent to summation in the discrete case. As in the discrete case, F(x) gives the probability that the variable X will assume a value less than or equal to x. A useful property of the cumulative distribution function is that

(2.25)

Two cases of continuous theoretical distributions which frequently occur in forestry research are discussed here mainly for future references.

2.5.1. Normal distribution

Normal distribution is defined by the probability density function,

(2.26)

where m is a location parameter and s is a scale parameter. The range of the variable X is from -¥ to + ¥ . The m parameter also varies from -¥ to +¥ but s is always positive. The parameters m and s are not related. Equation (2.26) is a symmetrical function around m as can be seen from Figure 2.4 which shows a normal curve for m = 0 and s = 1. When m = 0 and s = 1, the distribution is called a standard normal curve.

Figure 2.4. Graph of a normal distribution for m = 0 and s = 1

If the total area bounded by the curve and the axis in Figure 2.4 is taken as unity, the area under the curve between two ordinates X = a and X = b, where a<b, represents the probability that X lies between a and b, denoted by P(a<X<b). Appendix 1 gives the areas under this curve which lies outside +z and -z.

Normal distribution has mean m and standard deviation s . The distribution satisfies the following area properties. Taking the total area under the curve as unity, m ± s covers 68.27% of the area, m ± 2s covers 95.45% and m ± 3s will cover 99.73 % of the total area. For instance, let the mean height of trees in a large plantation of a particular age be 10 m and the standard deviation be 1 m. Consider the deviation of height of individual trees from the population mean. If these deviations are normally distributed, we can expect about 68% of the trees to have their deviations from the mean within 1m; around 95% of the trees to have deviations lying with in 2 m and 99% of the trees showing deviations within 3 m.

Although normal distribution was originally proposed as a measurement error model, it was found to be basis of variation in a large number of biometrical characters. Normal distribution is supposed to arise from additive effects of a large number of independent causative random variables.

The estimates of m and s from sample observations are

(2.27)

(2.28)

where x_i, i = 1, …, n are n independent observations from the population.

2.5.2. Lognormal distribution

Let X be a random variable. Consider the transformation from X to Y by Y = ln X. If the transformed variable Y is distributed according to a normal model, X is said to be a ‘lognormal’ random variable. The probability density function of lognormal distribution is given by

(2.29)

In this case, em is a scale parameter and s is a shape parameter. The shape of the log-normal distribution is highly flexible as can be seen from Figure 2.5 which plots equation (2.29) for different values of s when m = 0.

Figure 2.5. Graph of lognormal distribution for m = 0 and different values of s .

The mean and standard deviation of log-normal distribution are complex functions of the parameters m and s . The mean and standard deviation are given by,

(2.30)

(2.31)

Unlike the normal distribution, the mean and standard deviation of this distribution are not independent. This distribution also arises from compounding effects of a large number of independent effects with multiplicative effects rather than additive effects. For instance, if the data are obtained by pooling height of trees from plantations of differing age groups, it may show a log-normal distribution, the age having a compounding effect on variability among trees. Accordingly, trees of smaller age group may show low variation but trees of older age group are likely to exhibit large variation because of their interaction with the environment for a larger span of time.

For log-normal distribution, the estimates of the parameters of the m and s are obtained by

(2.32)

(2.33)

where x_i, i = 1, …, n are n independent observations from the population.

More elaborate discussion including several solved problems and computational exercises on topics mentioned in this chapter can be found in Spiegel and Boxer (1972).

Dbh class (cm)	Midpoint x	f	x_j -	f_j(x_j - )²	f_j(x_j - )³	f_j(x_j - )⁴
11-13	12	11	-5.29	307.83	-1628.39	8614.21
14-16	15	20	-2.29	104.88	-240.18	550.01
17-19	18	30	0.71	15.12	10.74	7.62
20-22	21	15	3.71	206.46	765.97	2841.76
23-25	24	4	6.71	180.10	1208.45	8108.68
Total		80	3.55	814.39	116.58	20122.28