2. STATISTICS: REGRESSION AND CORRELATION

Introduction

The work conducted by fishery biologists generally requires a fair amount of statistical analysis and most courses in fishery biology therefore include elementary statistics, at least.

Most often, however, lack of practice causes one to forget what was learnt, which results in a very valuable tool remaining underutilized.

This note aims at briefly recalling two very powerful statistical techniques - regression and correlation analysis - and to indicate some of their most common fields of application by fishery biologists.

Linear Regression

Put simply, linear regression is a technique for quantifying the relationship that can be seen when a scatter diagram involving two variables is drawn (Figure la), which relationship being summarized by a “best fitting” equation of the form:

y = a + bx

(1)

In this equation, y represents the coordinate values along the vertical axis of the graph (ordinate), while x represents the coordinate values along the horizontal axis (absissa). The value of “a” (which can be negative, positive or equal to zero) is called the intercept, while the value of b (which can be negative or positive) is called the slope or regression coefficient.

Table 1
Data set for calculating a regression (a and b) and correlation coefficient (r)
Number	x-values	y-values	Number	x-values	y-values
1	9.0	0.50	7	6.7	1.00
2	9.4	0.50	8	8.4	0.50
3	7.4	1.23	9	8.0	0.50
4	9.7	1.00	10	10.0	0.50
5	10.4	0.30	11	9.2	0.50
6	5.0	1.50	12	6.2	1.00
			13	7.7	0.50

The procedure to obtain values of a and b for a given set of y and x data pairs (such as in Figure 1 and/or Table 1) is as follows:

Step 1	Compute, for each pair of y, x values the quantities x², y² and x.y.

Step 2	Compute the sums (Σ) of these quantities for all x, y data pairs, along with the sums of the x and y values. The results of Steps 1 and 2 should look similar to this:

Number of data pairs	x	x²	y	y²	x.y
1	…	…	…	…	…
2	…	…	…	…	…
3	…	…	…	…	…
.
.
.
n	…	…	…	…	…
Names of sums	Σx	Σx²	Σy	Σy²	Σx·y

Step 3

Estimate the slope (b) by means of the relationship

Step 4 Estimate the intercept (a) by means of the relationship

Using values of “a” and “b” obtained by means of Equations 2 and 3, one then can draw through the points of a scatter diagram the best fitting straight line and visually assess if the points are well “explained” by the line (Figure 1b).

Correlation

Correlation analysis is closely related to regression analysis and both can be viewed, in fact as two aspects of the same thing.

The correlation between two variables, is, again put in the simplest terms, the degree of association between two variables. This degree of association is expressed by a single value called a correlation coefficient (r), which can take values ranging between -1 and +1. When r is negative, it means that one variable (either x or y) tends to decrease as the other increases - there is a “negative correlation” (corresponding to a negative value of b in regression analysis). When r is positive, on the other hand, it means that the one variable increases with the other (which corresponds to a positive value of b in regression analysis).

Values of r are easily computed for a set of x, y data pairs, using the same table and sums as shown in Step 2 of the “regression” section of this note. Thus r can then be obtained - indirectly - from the relationship

Figure 1a A scatter diagram (or scattergram) of x, y values. Note that y generally decreases as x increases, suggesting negative regression and correlation coefficients (based in Table 1)

Figure 1b Same data as in la, but fitted with the regression y = 2.16 - 0.173, with r = -0.756

which provides a value of the “coefficient of determination” (= r²). All we need is then to compute

that is to take the square root of the coefficient of determination to obtain the (absolute) value of r, and then to add the sign (+ or -) depending on whether the correlation is positive or negative (which can be assessed by visual inspection of a scattergram or by computing the b value of the corresponding regression and using for r the sign of b).

When we compute values of r, we would also like to know, however, whether the correlation that was identified could have arisen by chance alone. This can be established by testing whether the computed value of r is “significant” that is whether the (absolute) value of r is higher than, or equal to a “critical” value of r as given in a statistical table (see table of critical values of r in Appendix 1).

Exercise:

Compute a, b and r for the data given in Table 1 and test, by means of the table in Appendix 1 whether the computed value of r is significant at P = 0.01 and P = 0.05.

Linearizing Transformation in Regression Analysis

Both the regression and correlation analysis, as outlined above are based on the assumption of a “linear” relationship between the two variables involved (meaning that the best fitting line is straight). There are many cases in fishery biology, however, where the relationship between two variables is non-linear, and a well known example for this is the length-weight relationship, where

W = α · L^b

(6)

where the weight (W) is proportional to a certain power (b) of the length (L) (see Figure 2a).

Length-weight data can, however, be fitted with a (linear) regression if logarithms are taken of both sides, resulting in

log₁₀ W = a + b log₁₀L

(7)

As may be seen from Figure 2b, the logarithm of the length and weight are fitted extremely well by a linear regression, where

y = log₁₀W

(8a)

and

x = log₁₀L

(8b)

Thus, fitting a length weight relationship of the form given in Expression 6 to a set of length/weight data (such as given in Table 2) consists of the following:

Table 2
Data for the estimation of a length-weight relationship in the threadfin bream *Nemipterus marginatus*¹
Number	TL (cm)	W (g)	Log₁₀ L (=x)	Log₁₀ W (=y)
1	8.1	6.3	0.908	0.799
2	9.1	9.6	0.959	0.982
3	10.2	11.6	1.009	1.064
4	11.9	18.5	1.076	1.267
5	12.2	26.2	1.086	1.425
6	13.8	36.1	1.140	1.558
7	14.8	40.1	1.170	1.603
8	15.7	47.3	1.196	1.675
9	16.6	65.6	1.220	1.817
10	17.7	69.4	1.248	1.841
11	18.7	76.4	1.272	1.883
12	19.0	82.5	1.279	1.916
13	20.6	106.6	1.314	2.028
14	21.9	119.8	1.340	2.078
15	22.9	169.2	1.360	2.228
16	23.5	173.3	1.371	2.239

¹ From the southern tip of the South China Sea. Original>

Step 1	Take the logarithm of the length and weight values.
Step 2	Compute the sums given in the regression section, with x and y values as defined in 8a and 8b.
Step 3	Compute a and b using Equations 2 and 3.
Step 4	Take the antilogarithm of a to obtain α in Equation 6.
Step 5	Write your version of Equation 6.
Step 6	Using the sums computed in Step 2, compute the value of r² and r, and check significance.
Exercise:	(a) Perform Steps 1 to 6 (with P = 0.01) for the length-weight data given in Table 2.
	(b) List other linearizing transformations, and give examples of their use in fishery biology.

Figure 2a Length-weight relationship of Nemipterus marginatus in the South China Sea (based on data in Table 2)

Figure 2b The same data converted to base 10 logarithms