2. SOURCES FOR URBAN AND RURAL POPULATION DATASETS

The primary sources for population data are national censuses and other demographic surveys. The publicly available data include population totals at the country level and by administrative units, as well as population data for cities - usually above a certain size. In this chapter, the ways in which urban and rural areas have been defined are examined, and the statistical sources of population data and geospatial databases which are used to produce some of the georeferenced population datasets are described.

2.1 DEFINITIONS

The task of defining urban population has always been particularly challenging. The United Nations itself recognizes the difficulty of defining urban areas globally, stating that, “because of national differences in the characteristics that distinguish urban from rural areas, the distinction between urban and rural population is not amenable to a single definition that would be applicable to all countries” (UN, 1998). Rural areas are usually defined as “what is not urban” (UN, 1998 and 2004), and so inconsistencies in the definition of what is urban lead to inconsistencies in characterizing what is rural.

Each country defines the term ‘urban’ in its own way, although this is often only in terms of other labels; for example, ‘urban centres’, ‘major cities’, ‘administrative centres’ or ‘municipalities’. Sometimes the administrative boundaries of human settlements such as cities, towns and villages are available and are used to distinguish urban from rural; the populations within these administrative units are classified as urban. When definitions are based on quantitative thresholds, the minimum population for a place to be considered urban varies greatly. For instance, in several countries in Latin America and West Africa, the threshold is a population of 2 000, whereas it is 200 in Iceland, and 10 000 in countries like Italy and Benin. Alternatively, the definition of an urban population can be very complex, involving the socio-economic characteristics of the population or community (UN, 2004).

An urban agglomeration is generally easier to define. The United Nations describes it as a place that “comprises a city or town proper and also the suburban fringe or thickly settled territory lying outside, but adjacent to, its boundaries. A single large urban agglomeration may comprise several cities or towns and their suburban fringes” (UN, 1998). Nonetheless, the spatial boundaries of the agglomeration or the cities included are usually not provided, yielding great uncertainty about its characterization.

This lack of commonly accepted definitions makes it extremely difficult to find a global basis for defining urban areas.

As will be seen in the description of relevant geospatial datasets, remote sensing can be a helpful tool in characterizing urban areas consistently on a large scale. With the development and refinement of spatial techniques for defining urban boundaries and modelling spatial distribution of population, the choice of whether to use UN definitions and urban/rural population counts that conform to national usage but are not consistent across countries, or to use spatial information about human settlements and urban area boundaries derived from satellite imagery, will depend on the objectives of specific research applications.

2.2 STATISTICAL SOURCES AND DATABASES

Recognized sources for internationally comparable population data are the UN Population Division (Web site ref. 1) and the United States Bureau of the Census International Programs Center (IPC) (Web site ref. 2).

Other widely used sources of population data, such as the Web sites of Gazetteer (Web site ref. 3) or City Population (Web site ref. 4) supplement data from the previous two sources with information that they obtain from other official in-country sources and local survey data for urban agglomerations, cities and towns.

2.2.1 Internationally-recognized country-by-country population databases

The primary sources of the UN and US Census Bureau are national census and other demographic surveys. These are not conducted annually, and the census or survey dates vary from one country to another. Statisticians in both the UN and the US Census Bureau collect these data and interpolate from one survey to the next to create data series.

The International Data Base (IDB), created in the US Census Bureau's IPC, is a computerized source of demographic and socio-economic statistics for 227 countries and areas of the world. The IDB combines data from country sources (especially censuses and surveys) with IPC's estimates and projections to provide information dating back as far as 1950 and as far ahead as 2050. Because the IDB is maintained at IPC as a research tool in response to the requirements of its sponsors, the amount of information available for each country may vary.

Through its Demographic Yearbook system, the UN Statistics Division (UNSD) has collected country-by-country population data from national statistical authorities since 1948, through a set of questionnaires dispatched annually to over 230 national statistical offices. UNSD's annual Demographic Yearbooks provide latest available statistics on population size and composition, fertility, adult mortality, infant and foetal mortality, marriage and divorce as well as special topic issues. The 26 tables of the Demographic Yearbook 2002 as well as technical notes are available electronically.

UNSD also provides data on the population of capital cities and cities of 100 000 and more inhabitants for the latest available year. In this database the population data are given for the city proper and for the urban agglomeration, including the suburban fringe adjacent to the city boundaries.

The UN Population Division uses UNSD data as the basis for preparing current demographic estimates, standardized time series starting from 1950, and projections to 2050 for total population, urban population and rural population for all countries and areas of the world. Standard demographic techniques are used to estimate the population by age and sex for the current year; these estimates then serve as the base for the projections. International and rural/urban migration, total fertility, life expectancy at birth, infant, child and maternal mortality and increased adult mortality in some regions, as well as the demographic impact of AIDS, are among the factors taken into account. The results, published annually in World Population Prospects, serve as the standard and consistent set of population figures for use throughout the United Nations system. The entire time series is available online.

The UN Population Division also publishes a biennial report, World Urbanization Prospects, which contains summary tables by country and region and also reports the sources of data and the definition of urban and rural when available, for each country.

In the most recent revision of this report (UN, 2004), it is stated that the world's urban population reached 2.9 billion in 2000, corresponding to 48 percent of the total population. Much of the population growth that occurred in the past 50 years, and most of what will occur in the next 30 years, concerns urban areas. The majority of the urban population growth to occur in developing countries, where it is projected to increase by 2.3 percent per year between 2000 and 2030, as opposed to an increase of only 0.5 percent in the more developed countries.

The UN report highlights the differences in urbanization rates and numbers of urban dwellers by region, as well as the size of cities expected to absorb most of the population growth in the next 15 to 30 years. For example, the proportion of people living in megacities (with population greater than 10 million) across the world is still fairly small, amounting to 4.1 percent in 2000 - a figure expected to rise to 5 percent by 2015. Overall, by 2015 it is expected that only 8.7 percent of the world population will live in cities with 5 million inhabitants or more, as opposed to the 27.2 percent expected to be living in urban settlements with fewer than 500 000 inhabitants.

World Urbanization Prospects also gives estimates and projections of the population of urban agglomerations with 750 000 inhabitants or more for the period 1950 to 2015.

In both the UNSD and the UN Population Division databases, the geographic information of latitude and longitude to identify the locations of human settlement is not reported, although in most instances these can be derived from other sources (see sections 2.2.2 and 2.3.1).

2.2.2 Other widely known sources of population data

Statistical offices and gazetteers are other widely used sources of population data. There are several Web sites listing populated places, usually derived from primary sources, but few contain population estimates for those named places. Two Web sites that offer good information about current populations of countries, their administrative divisions, cities and towns are Gazetteer and City Population. Both report information about population, gathered from official census data, the UN and IDB databases and other official and non-official sources; they also produce estimates for the current year, if these are not already available from official sources.

In some cases it is in fact difficult to obtain population figures for cities and towns because the statistical registration in a country is not very accurate due to civil wars and/or poverty. Gazetteer admits that the figures presented on its Web site are far from being official, but they are calculated carefully and revised manually if necessary. One issue highlighted in the documentation of both Web sites concerns the definition of cities and urban agglomerations. Metropolitan areas are important to define, as they are indicators of a country's urbanization and economy. However, for many metropolitan areas it is difficult to specify an exact population figure, especially for the fast growing agglomerations in developing countries, because they are continuously incorporating cities and urbanizing areas in their environment and their definitions are often not comparable around the world. Most countries do not specify whether their data is for city or for urban agglomeration, and some even have data available for both types of place (see Web site ref. 3). Both Gazetteer and City Population Web sites take simply what is provided by the countries in terms of type of settlements. Therefore, the lists provided are to be considered only as a rough reference table for the world's largest agglomerations, and, as with other sources the figures are not necessarily comparable across countries.

2.2.3 Concluding remarks

The main problem with the statistics available concerns the reliability of the population figures used. If the city population figures are not very accurate, the calculation of urban and rural proportions is also incorrect. Furthermore, although some databases that give city population estimates also include geographic information such as the latitude and longitude coordinates for points or the centroids of polygons representing the locations of cities, they rarely contain information about the extent of each urban area.

It is well known that the world population reached 6 billion in 2000 and is projected to grow to 8.9 billion by 2050 (UN, 2003). Even though urban population is about half of the total population, the percentage of land occupied by urban areas is only about three percent (Balk et al., 2004a). Current urban population densities are already putting pressure on the environment in many parts of the world, and this pressure is likely to increase in fast-growing urban areas. Similarly, high population density, environmental degradation and increasing poverty are also major issues in traditionally agricultural rural areas. This indicates the importance of understanding the spatial distribution of population, in addition to having accurate information about the urban/rural proportions.

2.3 GEOSPATIAL DATASETS

Two important georeferenced datasets have been developed during the 1990s to overcome the limitations of statistical data for spatial studies of population. The initial breakthrough in the global mapping of urban areas came with the release of the first Digital Chart of the Word (DCW) in 1992 (Danko, 1992). This was a set of computerised global maps, created for the most part by scanning and digitising paper sources. Through DCW, georeferenced datasets for settlements, country boundaries and other layers of information were made available by country. The populated places layer of the DCW has evolved into another important source for spatial information about cities and towns. This is the National Imagery Mapping Agency (NIMA) points database, which holds the geographic coordinates of several million human settlements, together with their names and the administrative units to which they belong, if known.

The other major source of information about urban areas globally is the Nighttime Lights dataset from the National Oceanic and Atmospheric Administration (NO A A). Although it has been under development since the early seventies, it is only since 1997 that this dataset has been used to derive a global image map showing light sources, including human settlements. All other global georeferenced datasets that include an urban layer rely on either DCW and its subsequent refinements or Nighttime Lights as primary source.

2.3.1 Populated places

The DCW was developed originally in 1992 by the Environmental Systems Research Institute, Inc. (ESRI) on commission for the US Defense Mapping Agency (DMA) (Web site ref. 5). The DCW is a vector basemap of the world at a scale of 1:1 000 000. The primary sources for this database were the Operational Navigation Chart (ONC) series, co-produced by the military mapping authorities of Australia, Canada, United Kingdom, and the United States; and the Jet Navigation Charts (JNCs) for the region of Antarctica. Some collateral sources have been used to add extra information about road and railroad connectivity through selected urbanized areas, for instance the Digital Aeronautical Flight Information File (DAFIF) for the airport data contained in the aeronautical layer, and the Advanced Very High Resolution Radiometer (AVHRR) dataset for the data in the vegetation layer.

The DCW database is organized into 16 thematic layers and one data quality layer.

BOX 2. 1
THEMATIC LAYERS IN DCW
Political/Ocean	Hypsography supplemental
Populated places	Land cover
Railroads	Ocean features
Roads	Physiography
Utilities	Aeronautical
Drainage	Cultural landmarks
Supplemental drainage	Transportation structure
Hypsography	Vegetation

The DMA subsequently merged with several other agencies to form the National Imagery and Mapping Agency (NIMA), later renamed as National Geospatial-Intelligence Agency (NGA). NIMA released an updated and improved version of the DCW database, called Vector Smart Map level 0 (VMap0) in 1997 (Web site ref. 6). VMap0 includes major road and rail networks, hydrologic drainage systems, utility networks (cross-country pipelines and communication lines), major airports, elevation contours, coastlines, international boundaries and populated places.

The more recent versions of this database, VMap1 (Web site ref. 7) and VMap2, are not yet completely available in the public domain. The greatest improvement in VMap1 is its 1:250 000 map scale resolution, four times higher than VMap0. The structure is quite similar, with the data content held in ten thematic layers. VMap1 data are divided into a rather complex global mosaic of 234 geographic zones. However at the present time, NGA is only releasing 55 selected zones from the VMap1 dataset.

The populated places layer in DCW and VMap contains points and polygons that represent human settlements. The points dataset is a collection of latitude/longitude references associated with known locations of human settlements. The polygons dataset identifies urbanized (or built-up) areas of the world that it is possible to represent at 1:1 000 000 scale. Their shapes look as viewed from the air and their outlines do not necessarily conform to political boundaries.

The populated places layer of VMap, i.e. NIMA points database, constitutes perhaps the most comprehensive georeferenced cities database available. In the 1997 version of Vmap0, the populated places remained essentially unchanged from DCW. An updated version of VMAP0, released in 2000, added the names for most of the unnamed points and polygons, with the result that the database now contains nearly 5,000,000 named settlement points or polygons. The database is available through the GEOnet Names Server (GNS) of the NGA (Web site ref. 8).

Both points and polygons are good sources of information, but they also have certain limitations. One drawback is that this points database does not provide population information. The polygons tend to be conservative measures of the urban extents; often they do not correctly represent the extent of urban agglomeration, and prove to be inconsistent globally. In many instances where there are multiple settlements with the same name, it is not possible to ascertain which point and coordinates correspond to the city of interest. Finally, the points do not locate the geographical positions of settlements very precisely, partly due to imprecision of the source information and partly due to lack of consistent standards for selecting the point within an urban extent that should represent its location.

2.3.2 The Nighttime Lights of the World

The Nighttime Lights dataset has been created from data collected by the United States Air Force Defense Meteorological Satellite Program (DMSP) Operational Linescan System (OLS). This instrument has a low-light imaging capability, designed for the observation of clouds illuminated by moonlight in two spectral bands (visible/near infra-red and thermal infra-red). In addition to detecting moonlit clouds, the instrument can be used to detect light sources present at the earth's surface. Time series data from the DMSP-OLS were used to derive a global dataset and map image showing light sources observed during a six-month period spanning 1994–1995 (Elvidge et al., 2001). The 1994–1995 dataset was released in 1997 and has since then been used extensively to map urban areas globally (Elvidge et al. , 1997; Sutton, 1997). More recently, the DMSP group at the National Geophysical Data Center (NGDC) of the National Oceanic and Atmospheric Administration (NOAA) released version one of a pair of DMSP-OLS ‘Nighttime Lights of the World’ images and related databases, processed specifically to detect change, covering the years 1992–93 and 2000 (Web site ref. 9). Map 2.1 shows a focus on Italy superimposed on bathymetry.

The OLS detects lights from human settlements, fires, gas flares, and heavily lit boats (primarily squid-fishing boats). These four types of lights have been distinguished on the basis of location, brightness/persistence, and visual appearance. Four different datasets are available as a result: human settlements (cities, towns, villages and industrial sites), gas flares, fires, and heavily lit fishing boats. These products are usually available as frequency of detection (0–100 percent) over cloud-free observations during the time period considered. In addition to the percent frequency products, NOAA also provides the number of valid coverages, the number of cloud-free coverages, the number of cloud-free light detections, and the average digital numbers (DN) of the detected lights. The processing for the 1992–1993 and 2000 sets of data included automatic cloud detection and a modified light detection algorithm designed to capture dim lighting, but the final products are not radiance- calibrated owing to the lack of on-board calibration and uncertainty in the gain settings. The resolution of all the lights datasets is 30 arc-seconds (nominally one kilometre at the equator).

There are widely known problems with the data, possibly the most significant of which concerns the blooming effect. The blooming effect is an overestimation of the actual extent of urban areas, dependent on to intrinsic characteristics of the sensor (Elvidge et al., 1997, 2004). There have been attempts to impose a threshold on the lights in order to reduce this effect (Imhoff et al., 1997), but doing so results in the loss of small settlements that are not frequently lit. The difficulty of finding a unique threshold that could work globally has been explored in a recent publication (Small et al., 2005). One other problem concerns fires and gas flares. Using OLS alone, it is difficult to separate gas flares adequately from human settlements. As a matter of fact, NOAA releases a stable lights dataset (human settlement and fires) as well as its human settlements dataset (from which fires have been removed). The separation of city lights and fires is not entirely clear in some parts of the world where extensive fires are frequent. The other major problem, encountered in the northern hemisphere above almost 40 degrees latitude, is the effect of snow on the extent and brightness of the lights. Various techniques are being explored to minimize these problems for forthcoming releases. A new generation of annual global OLS Nighttime lights is currently in production for the 1992–2003 time period using the Visible Infrared Imaging Radiometer Suite (VIIRS) instrument. This series is expected to present substantial improvements in calibration, spatial resolution and level of quantization (Lee et al., 2004).

2.3.3 Global Land Cover

In the past decade, several efforts have been made to map land cover globally or at continent-wide scales using remotely sensed data. One of the most-widely used is the Global Land Cover Characteristics (GLCC) dataset (Web site ref. 10), generated by the United States Geological Survey (USGS), the University of Nebraska-Lincoln (UNL), and the European Commission's Joint Research Centre (JRC). The GLCC database was developed on a continent-by-continent basis, based on Advanced Very High Resolution Radiometer (AVHRR) data for the year from April 1992 to March 1993. The resolution of the product is a nominal one kilometre. The dataset was originally released to the public in 1997 and has subsequently been updated based on feedback from users.

MAP 2.1
The Nighttime Lights of the World superimposed on bathymetry (segment)

Source: National Oceanic and Atmospheric Administration (NOAA)

More recently, the Global Vegetation Monitoring Unit of JRC, in collaboration with more than 30 research teams, has developed a global land cover product (GLC2000) for the year 2000 (Web site ref. 11). The GLC2000 is based on SPOT-VEGETATION data, at one kilometre resolution, and on a Land Cover Classification System (LCCS) developed by FAO and the United Nation Environment Programme (UNEP). The hierarchical classification system allowed the different partners to choose land cover classes which best described their region, whilst also providing the possibility to translate regional classes to a more generalized global legend (see Map 2.2).

Both of these global land cover include a land cover class for built-up areas: in the case of GLCC this comes from DCW; GLC2000 bases its built-up area class on Nighttime Lights 1994/95. As these are derived data layer, there is no advantage in using them rather then the original sources. However, the classifications of agricultural and forested areas are unique to the land cover datasets, and can be used as a basis for distributing rural population across pixels, if the average population density for different types of land cover is known or can be estimated.

FAO definitions of and statistics on agricultural areas have been used to estimate total agricultural area in the world and to verify the accuracy of the area estimates given in the global land cover datasets (FAO, 2003).

Another global land cover dataset has been developed by the International Institute for Applied System Analysis (IIASA), in a collaborative effort with FAO. To create this dataset agro-ecological conditions in each pixel have been evaluated for their suitability for different types of crop production and for pasture, and the results matched with research data and agricultural statistics (Fischer et al., 2002). In this dataset, the land cover class for artificial surfaces and built-up areas is based on GLC2000, with some adjustments to account for the presence of buildings and infrastructure in rural areas.

Since December 1999 a new generation of satellite images has been produced by MODIS (Moderate Resolution Imaging Spectroradiometer). This instrument detects 17 land cover types, including 11 categories of vegetation and various non-vegetated surfaces, including bare soil, water and urban areas (Web site ref. 12). This instrument is operates from NASA satellites (Terra and Aqua). The frequency and sophistication of the MODIS images offer the prospect of significant future advances in land cover analysis.

MAP 2.2
The Global Land Cover, 2000

Source: Global Vegetation Monitoring Unit of the European Commission's Joint Research Centre (JRC)