Previous Page Table of Contents Next Page


15. Regression analysis


As indicated in the previous chapter (from page 59 onwards) querying of data grids in GIS can give you a first indication of relations between different factors. To get a more precise idea you have to carry out a statistical analysison the available data. Regression analysis is a simple statistical tool to look at correlations between two or more types of data and is the most commonly used statistical technique in fisheries biology. First a refresher of the basic mathematics of regression analysis.

15.1. Linear regression

Linear regression is a technique to quantify the relationship, which can be seen in a graph made between two variables. For example, Figure 15.1 presents the relationship between the number of the fishers and the number of gillnets in the different villages around Lake Kadim. It shows you a relationship; when the number of fishers increases, the number of gillnets in the villages increases too, this is called a positive relationship.

FIGURE 15.1
The relation between the number of fishers and the number
of gillnets in the different villages around Lake Kadim

FIGURE 15.2
Relation between the number of fishers and their annual catch of carps

However, if you look at the annual carp catches of the individual fishers (Figure 15.2) you see a negative relationship where the catch decreases as the number of fishers increases.

It is nice to know that there is a positive linear relationship between the number of fishers and the number of gillnets at Lake Kadim, but our goal is to describe this relationship by a mathematical model or equation:

y = a+bx

In this equation y, the number of gillnets, is the variable on the vertical axis of the graph or the dependent variable, while x, the number of fishers, represents the variable on the horizontal axis or the independent variable. The value a (which can be negative, positive or zero) is called the intercept, while the value b (which can be positive or negative) is called ‘slope’ or ‘coefficient of regression’. The question is how to calculate the values of a and b. You will not be bothered with the details but in all statistical textbooks you will see that a and b can be calculated with the following equations:

and

Whereby;

x and y are the values of the different x and y pairs, n is the number of pairs, is the average value of y, and is the average value of x.

Have a look at the following data presented in Table 15.1.

TABLE 15.1
Data for the calculation of the regression between the number of fishers and the number of gillnets at Lake Kadim

Number of fishers
x

No of gill nets
y

x2

y2

x * y

12

90

144

8 100

1 080

33

260

1 089

67 600

8 580

140

1 800

19 600

3 240 000

252 000

160

1 600

25 600

2 560 000

256 000

45

230

2 025

52 900

10 350

111

700

12 321

490 000

77 700

87

600

7 569

360 000

52 200

75

700

5 625

490 000

52 500

66

600

4 356

360 000

39 600

122

700

14 884

490 000

85 400

11

120

121

14 400

1 320

44

100

1 936

10 000

4 400

52

260

2 704

67 600

13 520

56

420

3 136

176 400

23 520

32

210

1 024

44 100

6 720

12

75

144

5 625

900

45

286

2 025

81 796

12 870

78

500

6 084

250 000

39 000

90

585

8 100

342 225

52 650

23

140

529

19 600

3 220

13

82

169

6 724

1 066

n = 21

Some basic calculations of means and sums are provided at the bottom of the table. Using these values and the formulas above, we can calculate our regression slope and intercept parameters:

and a = 479-9.7412 * 62 = -127.321

From this we can describe the relation; y = -127.3 + 9.74x or in words:

Number of gillnets = -127.3 + 9.74 * Number of fishers.

Notice that the slope of the line (9.74) is a positive number, indicating that this is a positive relationship. This agrees with our visual interpretation of Figure 15.1.

In Table 15.2 the data for the number of fishers and their annual catch of carp (CPUE) is provided. Calculate the regression relation

TABLE 15.2
Relation between the number of fishers and the CPUE of carps in Lake Kadim

Fishers

CPUE

x2

y2

x * y

12

1 400

144

1 960 000

16 800

33

900

1 089

810 000

29 700

140

250

19 600

62 500

35 000

160

110

25 600

12 100

17 600

45

1 400

2 025

1 960 000

63 000

111

520

12 321

270 400

57 720

87

800

7 569

640 000

69 600

75

900

5 625

810 000

67 500

66

960

4 356

921 600

63 360

122

250

14 884

62 500

30 500

11

1 320

121

1 742 400

14 520

44

1 000

1 936

1 000 000

44 000

52

1 050

2 704

1 102 500

54 600

56

980

3 136

960 400

54 880

32

1 150

1 024

1 322 500

36 800

12

1 390

144

1 932 100

16 680

45

1 100

2 025

1 210 000

49 500

78

700

6 084

490 000

54 600

90

680

8 100

462 400

61 200

23

1 200

529

1 440 000

27 600

13

1 400

169

1 960 000

18 200

n = 21

Result of the regression analysis: y = 1 465 - 8.6627x.

Nowadays regression analyses have become easier as they are included in all spreadsheet programmes such as Lotus 1-2-3 and Microsoft Excel. In Microsoft Excel regression analysis is carried out in graphs made with the datasets.

Lets do our Lake Kadim example in Microsoft Excel:

1. Start Microsoft Excel, Open the spreadsheet ‘Lake Kadim regression analysis.xls’, from the folder ‘15_Lake_Kad_regr’. You see the data set with the number of fishers, their CPUE, and two graphs.

2. Activate one graph by clicking on it.

3. Go to Chart/Add Trendline via the menu bar. The Add Trendline window will popup (Figure 15.3) and you select linear by checking its box. Then click the Options tab in the Add Trendline window and check Display equation on chart and Display R-squared value on chart (Figure 15.4). Click OK.

FIGURE 15.3
The Add Trendline window

FIGURE 15.4
Checking Display equation on chart and Display R-squared on chart

The Chart of the graph is displayed again. Now with a straight line (the calculated regression) and the relation y = -8.6627x + 1465.8, which you calculated previously displayed in the chart (Figure 15.5).

FIGURE 15.5
The Microsoft Excel regression between the number of fishers and their CPUE in Lake Kadim

In the chart you also see another value: R2 = 0.9164. R-squared, or the coefficient of determination, is the square of the correlation coefficient R. It is a measure of the linear association between two data sets and reflects the amount of variation in the dependent variable which can be explained by variation in the independent variable. R-square values value ranges between 0 (reflecting absolutely no linear relationship between the variables) and 1 (indicating a perfect correlation). For example if R-squared = 0.25, we could say that the variance in the independent variable explains 25 percent of the variation of the dependent variable. The closer to 1 the higher the correlation between the two variables will be. R-square is calculated as (Figure 15.6):

FIGURE 15.6
The equation to calculate R-squared

However a high value of R does not mean that the regression line is always statistically valid. The official way to look at this is to carry out a t-test on the regression coefficient b and test the calculated t-value, or to carry out an ANOVA (or Analysis Of Variance) and test the calculated value of the F-statistic. This can be done in any statistical software package.

15.2. Regression with an Avenue script in ArcView

Unlike Microsoft Excel, regression analysis is not an integral part of ArcView. Fortunately, ArcView provides an internal coding language which allows us to write these type of functions. Many custom functions, including regression analysis, have already been written by ArcView users and are available for public use (for example see ArcScripts at http://gis.esri.com/arcscripts/index.cfm).

A sample regression script is included with the data on the CD which allows for linear regression analysis. The available method is rather basic, can only be applied on shapefiles and has a limited scatterplot function. This application will be shown to you using the example of Lake Kadim and carrying out a regression analysis between the number of fishers and the number of Gillnets in villages around Lake Kadim.

1. Start ArcView, Open a New Project, and a New View. Add to the View the Themes (from the ‘15_Lake_Kad_regr’ folder): ‘Pais pesca country.shp’, ‘Lake kadim boundary.shp’, ‘Lake Kadim data.shp’ and ‘Fishing village lake kadim.shp’.

2. Check the projection and the working directory.

3. First you have to add the script. Close the View and open a new script in the Project window (Figure 15.7).

FIGURE 15.7
Opening a new script

4. You arrive in the Script window, where you have to open the avenue script ‘bvreg.ave’ This is a text file and for your convenience it is placed in the same folder as the Theme file, ‘15_Lake_Kad_regr’. Go to Script/Load Text File... (Figure 15.8).

5. The load script window will appear. Go to the correct subdirectory and select the ‘bvreg.ave’ script and click OK (Figure 15.9).

FIGURE 15.8
Loading a script text file

FIGURE 15.9
Selection of the regression script

For running the script you have to know some tricks. First you must compile the script. ‘Compiling’ simply means that ArcView checks the code text for errors and then converts the code into a format that ArcView can run directly. After you click the ‘Compile button’(Figure 15.10) you will see the button next to it with the running person becomes active. This is the button for running the script. If you click on this button you will get an error message: ‘A(n) project does not recognize request GetActiveThemes’(Figure 15.11).

FIGURE 15.10
Compiling a script

FIGURE 15.11
Regression error message

The basic reason we get this error message is that ArcView does not know where to look for the data to perform the analysis. The simplest way to tell it is to open the View, find the Theme which contains the data for the regression analysis, and make that Theme active. The way to do this seems complicated, but do not worry, it works and once you have done it several times you know the trick. Basically you have to tile the window so that the View and the Script window are visible. Then you can switch from one to the other.

6. Go in the script window to Window/Tile via the menu bar (Figure 15.12). You will get two windows on your screen. Open the view (Figure 15.13).

FIGURE 15.12
Tiling in the script view

FIGURE 15.13
Tiled script and view window

7. After opening your View, tile the windows again via Window/Tile in the menu bar. You should now see three open windows in your project, and your View window should be your active window. You can see this because the window bar of this window is blue. Your View should have four Themes listed. Click on the words ‘Fishing village lake kadim.shp’ (not on the checkbox) to make the Theme active. Now ArcView will know which Theme contains the data for regression analysis (Figure 15.14).

8. Go back to the script window by clicking somewhere inside it, and then click on the Run script button (Figure 15.15).

FIGURE 15.14
Activating the data theme for regression

FIGURE 15.15
Running the script

9. The Bivariate regression window will appear. First you have to indicate the independent variable (X): ‘Fishermen’ (Figure 15.16), after selecting ‘Fishermen’ click OK. In the next window you indicate the dependent variable (Y), ‘Gill_nets’ (Figure 15.17), and Click OK.

FIGURE 15.16
Selecting the independent variable

FIGURE 15.17
Selecting the dependent variable

10. The Bivariate regression results window appears providing the results of the regression. In this case the results are similar to the calculation we made before, click OK. In the next window you will be asked if you want to create a scatter plot. Click Yes and the scatterplot with the calculated regression appears (Figure 15.19).

FIGURE 15.18
Bivariate regression results

FIGURE 15.19
The scatterplot of the calculated regression

15.2.1. Regression analysis of Lake Kadim data using an avenue script

Make the regression analysis between the number of fishers and the CPUE with data from the theme ‘Fishing villages of Lake Kadim’ and compare them with the results of the analysis carried out in Excel.

Make a number of regression analyses with the data from the Theme Lake Kadim data and fill in Table 15.3.

TABLE 15.3
Results of regression analysis of the raw data of Lake Kadim

Parameters

a

b

R-square

Water depth - carp larvae




Water depth - clupeid larvae




Water depth - adult carp




Water temperature - carp larvae




Water temperature - clupeid larvae




Water temperature - adult carp




Chlorophyll - carp larvae




Chlorophyll - clupeid larvae




Chlorophyll - adult carp




From this exercise you see that this script has limitations for the number of points in a scatter plot. The script works well if you want to make a quick regression between a small number of data. However, for a more profound analysis with large data sets it is easier to import the attribute table of the Theme into Excel and or into statistical software[24] such as SPSS, Minitab, Sysstat, or others, and carry out the analysis with these programs.

15.3. Regression analysis between grids with the Grid Regression tool

Regression between grids can be very interesting, especially if we think about the application of the Munro and Thomson plot (1983b) for surplus production models in GIS as demonstrated by Corsi (2000a, a copy of the article can be found on the CD, folder ‘Corsi_article’). These developments are very interesting, and will be discussed in the chapter: A Corsi type analysis of Hake data in the Mediterranean on page 130. However, as with the earlier example, there is no standard tool within ArcView to perform regressions between grids. The internet, at the time of writing this manual, also did not provide such a tool, so this tool called Grid Regression[25] was developed[26].

The Grid Regression extension carries out a regression analysis between pixels at the same location in two different grids. First each pixel at the same location in both grids is given the same Identification number and then the values of the pixels are attached to this ID number. Once this is done regression analysis becomes straightforward as we have an ID number each with a data pair (Figure 15.20).

FIGURE 15.20
Two grids with values for a regression

The Grid Regression extension has two modes of operation;

The results of both modes are similar. The Grid based mode tends to go faster than the vector mode, while the vector mode produces more of the intermediate data such as seen in Table 15.1 and Table 15.2 on pages 78 and 79.

Install the extension by copying the file ‘grid_regression.avx’ from CD1 to your ArcView folder: ‘..AVGIS_30\ARCVIEW\ext32\’ directory. Then, after you started ArcView, go to File/Extensions... via the menu bar and check the one called ‘Grid Regression’. In your View screen, you should see now a new icon that looks a little like a blue regression line.

Grid regression of data from Lake Kadim with the Grid Regression extension

1. Copy the file ‘grid_regression.avx’ from the Extension folder on your CD to your ArcView ‘..AVGIS_30\ARCVIEW\ext32\’directory.

2. Open a New Project, New View. Check the projection (Equal-Area Cylindrical), working directory and properties settings (Distance Units: Meters).

3. Add the following Themes from the ‘16_Lake_Kad_regr_tool’ folder from CD2 (with as Data Source Type: Grid Data Source): ‘Kadimbnd’, ‘adultcarp’, ‘carplarvae’, ‘chlorophyll’, ‘clupeidlarvae’, ‘secchidepth’, ‘waterlevel’, and ‘watertemp’. Also Add the Themes (with Data Source Type: Feature Data Source): ‘fishing village lake kadim.shp’, ‘lake kadim boundary.shp’, ‘lake kadim data.shp’, and ‘pais pesca country.shp’. Legends of all Grid Themes can also be loaded (from the same folder).

4. Go to File/Extensions... via the menu bar and activate the Grid Regression extension.

5. You see that all these Themes are the same as the ones you generated during the exercise: Protection of fish stocks and the creation of protected areas in Lake Kadim, Pais Pesca, on page 70. By querying the different grids you got an idea about the mechanisms behind fish distribution in Lake Kadim. Now check these ideas with the Grid Regression extension.

6. First the relation between water depth and Clupeid abundance. Open the Grid Regression extension by clicking the icon.

7. The Regression Options window appears (Figure 15.21). First you select the independent variable Waterlevel, then the dependent variable Clupeidlarvae. Do not use a boundary or barriers for your analysis, so in the - POLYGON THEME - box, select - Don’t Use Polygons -.

8. Carry out a vector based analysis only, Check all the boxes as indicated in Figure 15.21, because all different parameters and grids need to be calculated and click OK.

FIGURE 15.21
The Regression Options window

At the bottom of the screen a regression analysis progress bar will appear (Figure 15.22).

FIGURE 15.22
The regression analysis progress bar

9. You will be asked where to save the files and give it a name. Do this and click OK.

FIGURE 15.23
The analysis progress window

FIGURE 15.24
The model and ANOVA table, and the scatterplot of the analysis

After this the regression analysis progress window will appear (Figure 15.23). Here you can see which calculations are made and the progress of the analysis. The analysis will take some time, depending on the size of the grids. Different Themes will be automatically added to the View and the regression is finished once you see on your screen Model and ANOVA Table and the scatterplot of the analysis (Figure 15.24).

The results of the regression analysis are summarized in the Model and ANOVA table (Figure 15.25). It first indicates the grids used, and in your case you see (if you continued to work from page 70 onwards) that there was a grid used as a mask; ‘Kadimbnd’. You used this grid as a mask during the interpolation of the different grids of Lake Kadim. In the present analysis you forgot[27] to remove this mask from the Analysis Properties.

Then the table presents the descriptive statistics of both grids used in the regression, and you see that the regression was carried out over 21 945 data pairs. Then follows the results of the linear regression: y = 3.19x + 8.44 or Clupeid abundance = 3.19*Waterlevel + 8.44 and the regression has an R-squared of 0.64, which is significant with P<0.000001.

FIGURE 15.25
The Model and ANOVA table

If you close the Model and ANOVA table you see the scatterplot of the 21 000 pairs of depth and clupeid density used in the analysis (Figure 15.26). From this figure you see immediately there is a positive relation. The higher the waterdepth, the higher the density of clupeid larvae. The red line represents the values of the calculated regression. It fits the values reasonably but from the scatterplot and the calculated regression you see also that a more S-like regression fit would provide better results as the highest values of the clupeid density tends to reach an asymptote. In principle your fit overestimates somewhat all low waterdepths and underestimates somewhat at medium waterdepth around 10 metres.

The Grid Regression extension only performs a linear regression. The data set is saved in your project and can be imported in any statistical or curvefit software. For example, a simple logarithmic plot carried out in Excel improves R-square already somewhat (Figure 15.27).

FIGURE 15.26
Scatterplot between Depth and Clupeid made with the Grid Regression extension

FIGURE 15.27
Scatterplot between Depth and Clupeids made in Microsoft Excel

If you open the View again you see that the Grid Regression extension has added 3 themes to the View (Figure 15.28); the ‘Grid_depclu.shp (VECTOR - based)’ (this name depends on the name you have given the Grid values shapefile in the beginning), the ‘Regression grid (Vector based)’and the ‘Residuals grid (Vector based)’.

FIGURE 15.28
Themes added by the Grid Regression extension to View

The ‘Grid_depclu.shp (VECTOR - based)’ - Theme consists of a set of points located at the cell centres of the grids, and contains all the data from the grids. If you open the attribute table of this Theme (Figure 15.29) you see the different columns with the data for each pixel pair; ‘Independent’ which in this case is the water depth; ‘Dependent’, the Clupeid density; ‘model’, which is the expected Clupeid density value, given the regression model that was calculated by the Grid Regression extension, and ‘Residuals’, which is the difference between the expected Clupeid density and the actual Clupeid density.

FIGURE 15.29
The attribute table of the data used in ‘Grid Regression’

‘Regression grid (vector based)’ contains the calculated values of the Clupeid density for each pixel and the ‘residuals grid (vector based)’ contains the residual for each pixels. Both grids and the scatterplot allow you to get an idea of the reliability of the calculated regression. The original data (clupeid density), water depth, the calculated values, and the residuals are plotted in Figure 15.30, Figure 15.31, Figure 15.32, and Figure 15.33. From the results you see that the calculated distribution follows the real distribution reasonably well. From the residuals you see that the major difference is found in the medium water depth (6-12 metres), where the density is underestimated which you saw already from the scatter plot (Figure 15.26).

FIGURE 15.30
Clupeid density in Lake Kadim (original measurements)

FIGURE 15.31
Water depth of Lake Kadim

FIGURE 15.32
Calculated clupeid density (Model)

FIGURE 15.33
Residual plot between the original and calculated densities

15.3.1. Grid regression of data from the fish survey at Lake Kadim

Make regression analysis between the different data grids of Lake Kadim and fill in Table 15.4 Results of the Grid regression between the different data grids of Lake Kadim.

TABLE 15.4
Results of the Grid regression between the different data grids of Lake Kadim

Parameters

a

b

R-square

Acceptable

Water depth - carp larvae





Water depth - clupeid larvae

8.47

3.19

0.634

Yes

Water depth - adult carp





Water temperature - carp larvae





Water temperature - clupeid larvae





Water temperature - adult carp





Chlorophyll - carp larvae





Chlorophyll - clupeid larvae





Chlorophyll - adult carp





Attention!: A warning about grid regression:

Although the above examples illustrate the usefulness and utility of grid regression, it should be noted that some aspects of it may violate some of the assumptions of basic regression statistics. The end results of these violations would likely be that your estimated parameters (i.e. your slope, y-intercept and R-square values) are probably a little bit off, and in particular your R-square value is likely to be slightly less than the calculated value. This may not be a problem in many cases because it is still a good method for identifying relationships between our independent variable and our predictor variables, and therefore helps us to predict what our independent variable will likely be doing in different areas based on our predictor variables. We do, however, have to be careful to report that there is some uncertainty about our model because of these violations, and be cautious when our R-squared value is near the limits of what we consider to be significant.

In particular, the violations are:

Additional Reading:

For those students who would like to learn about regression in depth, there are many texts available that cover it thoroughly. Two such texts that the authors recommend are:

For those students who would like to learn more about spatial autocorrelation, some of the classic references are:


[24] Free statistical software can be downloaded at http://members.aol.com/johnp71/javasta2.html#General.
[25] The Grid regression tool can be downloaded at http://www.jennessent.com/arcview/arcview_extensions.htm.
[26] by Jeff Jenness, Jenness Enterprises, http://www.jennessent.com.
[27] Or we forgot to instruct you to remove this mask.

Previous Page Top of Page Next Page