3. REVIEW OF EXISTING GEOREFERENCED POPULATION DATASETS

The previous chapter reviewed definitions of urban and rural areas, and analysed what statistical data are available and what georeferenced datasets could be used as potential inputs into models of population distribution. In this chapter, the two most widely known and used georeferenced global population distribution databases that have been developed based on these sources are reviewed and several recent efforts to model population distribution, taking urban and rural areas explicitly into account are described.

The Gridded Population of the World (GPW), originally developed at the National Center for Geographic Information Analysis (NCGIA) and subsequently updated by the Center for International Earth Science Network (CIESIN) at Columbia University, attributes population to the lowest subnational administrative units for which population counts are available. In GPW the population count for each administrative unit is distributed uniformly across all the gridcells of the unit, without considering whether the gridcell belongs to urban or rural area.

The LandScan Global Population Database, produced by the Oak Ridge National Laboratories (ORNL), distributes national populations by land cover category, according to a model with assumed coefficients for population occurrence in each type of land cover.

General information about how each database was produced is given below, along with the main advantages and disadvantages of each. In both cases, the primary sources of population are data from censuses and surveys compiled for political or administrative units. The term global is used to indicate that there is no explicit reference to urban or rural areas, and only overall total population counts and densities are given. As there is more than one global database available, each being produced by different methods, the most suitable database should be chosen largely on the basis of the type of application for which it is to be used.

3.1 GRIDDED POPULATION OF THE WORLD

The GPW project was the first major attempt to generate a consistent global georeferenced population dataset. It was originally produced at the National Center for Geographic Information Analysis (NCGIA) in 1995 (Tobler et al., 1995), and subsequently updated by CIESIN in 2000 (Deichmann et al., 2001) and in 2004 (Balk and Yetman, 2004).

GPW was the first global rasterized dataset of population totals based solely on administrative boundary data and population estimates associated with those administrative units. In the original version, two datasets at 2.5 arc-minutes were produced with the data for the year 1990:

unsmoothed, where the gridding algorithm assigned population in grid cells with multiple input polygons by a straight majority rule, and
smoothed, where population was distributed based on a smoothing method called pycnophylactic interpolation (Tobler et al., 1995), which assumes that grid cells close to administrative units with higher population density tend to contain more people than those close to low density units.

Since that first release, higher resolution population data sets have been compiled for various regions of the world. In 2000 CIESIN released an updated second version of GPW. GPWv2 is based on more detailed administrative units, resulting in an improved median resolution. The median resolution is defined as the ratio of total area of the country to number of administrative units; a lower number indicates a larger number of administrative units, and therefore a more spatially refined dataset. Nonetheless, no effort was made to model population distribution, and no ancillary data were used to predict population distribution or revise the population estimates. The only assumption made was that population is uniformly distributed within each administrative unit. The latest version, GPWv3 (Web site ref. 13), is based on the same assumptions as the previous version but relies on more recent data at higher resolutions (see Map 3.1). In particular, the number of administrative units has increased from approximately 128 000 in GPWv2 to more than 375 000 in GPWv3, and consequently the average median resolution has dropped from 33 in GPW2 to 18 in GPWv3. This new version contains unadjusted population data for the years 1990, 1995 and 2000, as well as data for those years adjusted to match United Nations population estimates. Data about land area and population density are also included. In order to avoid mismatches at the border between countries, most country boundaries have been matched to standard sources, namely Seamless Administrative Boundaries of Europe (SABE, Web site ref. 14) and DCW.

The main advantages in using GPW are that it relies on a very simple area-weighting scheme for reallocation, and on the best possible census and administrative data available. GPW also provides updates every five years, allowing for a (short) time series analysis. Its main drawbacks are its coarse resolution of 2.5 arc-minutes, which corresponds to approximately 5 kilometres at the equator, and the lack of any modelling of population distribution within administrative units, causing population to be evenly distributed across any given administrative unit. This is unlikely to represent a realistic population distribution, especially within large units with significant variation in land cover characteristics.

MAP 3.1
Population density in 2000 from GPWv3 adjusted to UN totals

Source: Center for International Earth Science Information Network (CIESIN), Columbia University and Centro Internacional de Agricultura Tropical (CIAT)

3.2 LANDSCAN GLOBAL POPULATION DATABASE.

The Oak Ridge National Laboratories developed LandScan (Web site ref. 15) in 1998 (Dobson et al., 2000) in order to overcome the limitations of GPW, and originally in response to a demand for distributed population data that would show emergency workers where populations were likely to be concentrated in the event of a disaster. It was subsequently updated in 2000, 2001 and 2002. LandScan was conceived as an effort to capture ambient population, more than decennial population counts. The difference between ambient and resident population is not significant as the results are quite coarse in all available population density maps.

LandScan 2003 was released shortly before this report went to press. In this FAO study, a modified version of LandScan 2002 (LandScan-a) was used, as explained in section 4.2.1 (see Map 3.2).

The sources used for the LandScan released in 1998, included DCW, Nighttime Lights, GLCC, high-resolution aerial photography and satellite imagery. The methodology was subsequently updated and the input layers improved.

In the 2000 version of LandScan, the major improvement was the use of VMap1 (see section 2.3.1) with its superior identification of the road networks, populated places and water bodies. In the 2001 version, the major improvement was better information about second order administrative boundaries for population distribution outside the United States; and, within the United States, newly-available high-resolution (30 metre) land cover data products. In 2002, refinements were made to the algorithm for its population models and MODIS land cover database was used as an input data sources.

The LandScan methodology consists in an automated procedure to allocate population data to 30 arc-second cells, which correspond to approximately 1 square kilometre at the equator. The population estimates used as inputs are based primarily on aggregate data for second order administrative units compiled by the International Programs Center of the US Bureau of Census and represent the most recent census information for each country. These population counts are allocated to the individual 30 arc-second cells through a ‘smart’ interpolation method that assesses the relative likelihood of population occurrence in cells on the basis of road proximity, slope, land cover, and Nighttime Lights. Probability coefficients are assigned to every value of each input variable, and a composite probability coefficient is calculated for each LandScan cell. The coefficients for all regions are based on the following factors:

Roads, weighted by distance from major roads.
Elevation, weighted by favourability of slope categories.
Land cover, weighted by type with exclusions for certain types.
Nighttime Lights of the World, weighted by frequency.

The resulting coefficients are weighted values, independent of census data, which can then be used to apportion shares of actual population counts within any particular area of interest. Coefficients vary considerably from country to country even within different regions of the same country.

MAP 3.2
LandScan Global Population Database, adjusted to UN figure year 2000

Source: Oak Ridge National Laboratories (ORNL), Tennessee, USA

Control totals can be based on any administrative unit (whether nation, province, district or minor civil division) or on any arbitrary polygon for which census data are available. The resulting population distribution is normalized and compared with appropriate control totals to ensure that aggregate distributions are consistent with census control totals.

The advantages of LandScan, as compared with GPW, include its better output resolution of 30 arc-seconds, as opposed to 2.5 arc-minutes, and the use of an extensive model to predict population distribution within administrative units. Although LandScan takes urban areas into account, it does not distinguish urban and rural populations in the database. However, the input layers are such that urban areas can be inferred by analysing the population density.

One problem with LandScan concerns the roads database. The model processes the input layers by country without taking into consideration the spatial continuity of the road networks between them, resulting in uneven changes of population density at country boundaries. Another problem is that, owing to the way in which the LandScan processing methods evolved, population comparisons between available revisions of the database are not possible. Although each revision date of LandScan represents the adjusted midyear-July population estimates for that year, comparatively, the available 1998, 2000, 2001, 2002 and 2003 releases of these data do not represent a time series that can be used for pixel-by-pixel analyses or comparisons (see also Dooley, 2005). Also the underlying models have not been published, so the assumptions employed by LandScan to distribute population counts to pixels are not known.

3.3 GLOBAL RURAL URBAN MAPPING PROJECT

In a recent project, CIESIN and partners such as the International Food Policy Research Institute (IPFRI), the World Bank and the Centro Internacional de Agricultura Tropical (CIAT), developed a model for redistributing population within administrative units by combining data from several sources. The description of the method and the datasets, in the box, draws on the working paper available at the GPW Web site (Balk et al., 2004a).

BOX 3. 1
GLOBAL RURAL URBAN MAPPING PROJECT (GRUMP) DATASET
What does the GRUMP dataset contain?
Human settlements database of about 55 000 settlements points that have a population of 1 000 or more	A global database of cities and towns (points). Each point, represented as a latitude/longitude pair, has associated tabular information on its population and data sources. Population data were gathered primarily from official statistical offices (census data) and secondarily from other sources, such as Gazetteer and City Population. Based on the data available and applying UN growth rates, population was estimated for the year 1990, 1995, and 2000. When the records for cities and town did not include latitude and longitude coordinates. those were taken from the NIMA database, based on a city name and administrative units match. As mentioned earlier, due to uncertainties in the positional accuracy of the NIMA coordinates, some of the cities and towns might not be accurately geolocated.
Urban extent database of over 21 000 areas	The GRUMP urban mask represents an attempt to delineate extents associated with human settlements globally. The physical extents of settlements are derived from both raster and vector datasets. In particular, the team used the Nighttime Lights dataset for the period 1994–1995 (Elvidge et al., 1997, 2001), DCW Populated Places, and cities from the Tactical Pilotage Charts (standard charts produced by the Australian Defense Imagery and Geospatial Organization, at a scale of 1:500 000) for selected countries in Africa. All the sources of urban extent (night-lights, DCW polygons and TPCs) were combined in order to obtain the maximum possible coverage for each country. The population values are assigned to the physical extents from points within a three kilometre buffer. For points that are not within the three kilometres buffer of an extent, circles were created based on the relationship between population size and areal extents for the points with known parameters. These newly created circles were added to the existing ones to create a complete coverage of urban extents with population information for each country.
Urban-rural population grid, with an output resolution of 30 arc-seconds	The urban-rural population grid was created by using a mass-conserving algorithm called GRUMPe (Global Rural Urban Mapping Programme), developed by CIESIN, that reallocates people into urban areas, within each administrative unit. In particular the data inputs are the administrative polygons, containing the total population for each admin unit, and the populated urban extents. The reallocation process works iteratively so that the output urban and rural proportions match, when possible, the UN ones. Although the UN totals are useful as a benchmark, in some cases the GRUMP output proportions have not been matched to the UN ones (when for example CIESIN's data includes many more small settlements than those corresponding to the urban threshold given by the country).
What are GRUMP's main advantages?
The main advantage of GRUMP is that it uses population data from the census. rather than predicting it based on probability coefficients or lighted areas. Also, it makes use of other GIS data to identify urban areas, compensating for the small settlements in poor countries that are not detected by the Nighttime Lights. The resulting grid is a dataset at moderate resolution that represents a more accurate distribution of human population than the existing datasets. and that makes explicit reference to urban and rural areas.
What are GRUMP's main limitations?
The lights are known to overestimate the actual extents of urban areas (Elvidge et al., 2004), but, as previously discussed, applying a threshold would reduce the number of small settlements that are not frequently lit, as in developing countries. Given the complexity of finding a single threshold that could work globally (Small et al., 2005), no light threshold was applied, resulting in an overestimation of the urban extents in some parts of the world. Although population is estimated for three time periods (1990, 1995, and 2000), users need to remember that the lights refer to one point in time only (the 1994/1995 time period), so it would not be advisable to use these extents for any analysis of change in urban areas.
These data provide the first systematic assessment of the world's urban land area - nearly three percent (Balk et al., 2004a), and how population distributions by ecosystems differ dramatically. Coastal zones are the most urban of all systems, and sustain the highest population densities, not only in the urban areas, but in the rural ones as well. The GRUMP grid is one of the key input datasets in the Millennium Ecosystem Assessment (McGranahan et al., 2005).

MAP 3.3
Population density in 2000 from GRUMP adjusted to UN totals

Source: Center for International Earth Science Information Network (CIESIN), Columbia University; International Food Policy Research Inst. (IPFRI), the World Bank and Centro Internacional de Agricultura Tropical (CIAT)

3.4 POPULATION DATABASES FOR AFRICA, ASIA AND LATIN AMERICA

Population databases for Africa, Asia and Latin America, compiled by the United Nation Environment Programme (UNEP) and partners (CIAT and CIESIN), build on the GPW tradition but take road networks and populated places into account in the redistribution of population (Web site ref. 16)

As described in the documentation (Deichmann, 1996a; Hyman et al., 2000; Nelson, 2004), a model was created in the following stages. First, information about the transportation network and urban centres was collected. The transportation network included roads, railroads and navigable rivers using data from DCW, the World Boundary Databank II, and Michelin paper maps, while information about urban centres consists of location and size of towns and cities from the human settlements database of GRUMP. This information was then used to compute a simple measure of accessibility for each node in the network. This measure is the so-called population potential, which is the sum of the population of towns in the vicinity of a given node weighted by a function of distance, using network distances rather than straight-line distances. The computed accessibility estimates for each node were subsequently interpolated onto a regular raster surface. A simple inverse distance interpolation procedure was used, which resulted in a relatively smooth surface. Raster data for inland water bodies (lakes and glaciers), protected areas and altitude were then used to adjust the accessibility surface heuristically. Finally, the population totals estimated for each administrative unit were distributed in proportion to the accessibility index measures estimated for each grid cell. The input administrative units, with corresponding population numbers, are the same as those of GPW. The output resolution, as for GPW, is 2.5 arc-minutes.

This model undoubtedly represents an improvement upon GPW, in that it takes into account road networks and populated places to achieve a better reallocation of population within administrative units. Unlike LandScan, only roads and populated places are used, and there is no explicit effort to capture the ambient quality of the LandScan approach. The resolution might still be too coarse for detailed studies at the local/national level, but it provides consistent population distributions across continents, allowing analysis at the regional scale.

3.5 OTHER RESEARCH EFFORTS TO MAP URBAN POPULATION

In this section, other recent attempts to model population distribution are described. The first two use GPW as base population input and additional georeferenced datasets, while for the third the starting point is country-level demographic statistics. The first is work in progress, and is not available publicly.

The first one was conducted by CIESIN, in a parallel effort to the GRUMP database. CIESIN pursued a method for improving on the GPW by using the Nighttime Lights dataset to identify urban areas (Pozzi et al., 2003). The project aimed to overcome some of the limitations of LandScan (extensive modelling), GPW (lack of modelling) and GRUMP (extensive data collection) by developing a simple model to redistribute population within administrative units according to human settlements. Human settlements are identified by the Nighttime Lights dataset produced for the year 1994/1995 (Elvidge et al., 1997, 2001). The reallocation of population within administrative units is based on a function derived from the relationship between the population density and Nighttime Light frequency for a sample of regions of the world with spatially detailed administrative areas. The result is spatial refinement in areas or countries with relatively large populations but poor spatial detail for administrative boundaries. As the identification of urban areas is based solely on Nighttime Lights, in countries with poor lights coverage (for instance in Africa) the accuracy of the reallocation may not be very precise.

The second effort was conducted at the Department of Geography and Center of Remote Sensing at Boston University, as part of a larger project to map global land cover from MODIS data (Web site ref. 17). The authors present a method for mapping urban land cover at spatial resolution of one kilometre by fusing multiple sources of coarse resolution data (Schneider et al., 2003). The objective was to determine the boundaries and the extents of urban areas more accurately. Population density data were used as one of the sources for determining probable location of urban areas, but no effort was made to actually estimate urban population counts. Two major tasks were involved in this study. First, a supervised decision tree classification method was developed by fusing one kilometre MODIS data and two ancillary sources: the Nighttime Lights data (Elvidge et al, 1999) and population density data (GPW, see Tobler et al., 1995; Deichmann et al., 2001). The second task was to establish the best means for evaluating the accuracy of urban land cover maps produced over large regions, an issue that is especially problematic when the class of interest is a small fraction of the total area mapped. For most parts of the world, multiple data sources were fused to achieve the results. The fusion of these three data types improves urban classification results by resolving confusion between urban and other classes that occurs when any one of the data sets is used by itself.

For Africa, the ancillary data were too problematic, and Africa was successfully mapped with MODIS data alone. Any city around the globe larger than a few square kilometres should be represented, barring those areas (such as the majority of the Congo basin) that have continuous cloud cover. In addition, the scale of cities in developing countries is quite different from the rest of the world, so that most small cities in Africa, India and China (which might only be one pixel) are not represented (Schneider, personal communications).

The third project is part of the World Water Development Report II Indicators for World Water Assessment Programme (Web site ref. 18). The University of New Hampshire Water Systems Analysis Group has developed a compendium of Earth System and socio-economic databases describing the current state of global water resources, including associated human interactions and pressures. Global population fields were constructed for the year 2000 using country-level demographic statistics contained in the World Resources Institute (WRI) Earth Trends database. The urban and rural population data sets were developed by spatially distributing the WRI 2000 country-level urban population data among DMSP-OLS nighttime stable-lights imagery (Elvidge et al., 1997a) and ESRI Digital Chart of the World populated places points. Country-level urban population was evenly distributed among the DMSP-OLS city lights data set at one-kilometre grid cell resolution with detectable lights in at least ten percent of the cloud free observations (Elvidge et al, 1997b). Where available, the spatial extents of major city locations with known demographic data (Tobler et al, 1995) were superimposed in the DMSP-OLS city lights data set to enhance the accuracy of the urban population distribution. Rural population was spatially distributed equally among the DCW populated places points falling outside of the DMSP-OLS city lights extent. Total population is simply the sum of urban and rural population data sets gridded to the 30 minute simulated topological river network (STN-30) (Fekete et al., 2001).