Urban Sprawl Data

Data and replication files for 'Causes of sprawl: A portrait from space'

by Marcy Burchfield, Henry G. Overman, Diego Puga, and Matthew A. Turner

This site distributes and documents the urban sprawl dataset created by Marcy Burchfield, Henry G. Overman, Diego Puga, and Matthew A. Turner and used in their article 'Causes of sprawl: A portrait from space', published in the Quarterly Journal of Economics 121(2), May 2006: 587-633, as well as the computer code required to replicate their results. Users of this dataset are asked to cite the Quarterly Journal of Economics article as the source. We would also appreciate it if you let us know the details of any paper in which you use the data by sending an email to Diego Puga (diego.puga@cemfi.es).

There are two main components in this dataset available at the following links. Documentation for the two components follows the links:

Metropolitan-Level Sprawl Data

The article 'Causes of sprawl: A portrait from space' is concerned with a crucial aspect of sprawl, the scatteredness of development. We measure this aspect of sprawl as carefully as possible and study what causes differences across U.S. metropolitan areas in the extent to which urban development is sprawling or compact.

In order to be able to study detailed spatial patterns or urban development, we construct a new data set by merging high-altitude photographs from Sprawl poster around 1976 with satellite images from 1992. From these data, described in some detail below, for units that are square cells of 30×30 meters, we know whether land was developed or not around 1976 and in 1992, as well as details about the type of developed or undeveloped land. Our data set consists of 8.7 billion such 30×30 meter cells for a grid covering the entire conterminous United States. A poster (11×17in) with a map based on these data showing urban development across the continental United States 1976-1992 is available as a pdf file (4,276 Kb) by clicking on the thumbnail to the left of this paragraph.

To measure the extent of sprawl, for each 30-meter cell of residential development, we calculate the percentage of open space in the immediate square kilometer. We then average across all residential development in each metropolitan area to compute an index of sprawl. For instance, to calculate a sprawl index for the new development that took place between 1976 and 1992 in each metropolitan area, we identify 30-meter cells that were not developed in 1976 but were subject to residential development between 1976 and 1992, calculate the percentage of land not developed by 1992 in the square kilometer containing each of these 30-meter cells, and average across all such newly developed cells in the metropolitan area. We also perform similar calculations to calculate a sprawl index for the stock of development in 1976 and in 1992. This provides a very intuitive index of sprawl: the percentage of undeveloped land in the square kilometer surrounding an average residential development.

The spatial units of observation in the metropolitan-level sprawl data are individual metropolitan areas (although, obviously, our calculations of the sprawl indices and various explanatory variables still need to use the full spatial resolution of our land use and land cover data). We use the Metropolitan Statistical Area and Consolidated Metropolitan Statistical Area definitions (New England County Metropolitan Area definitions for New England). Since these are county-based definitions, care is needed when measuring the initial characteristics of areas where new development might take place. This is particularly important in the western part of the country, where counties are sometimes very large and consequently metropolitan area boundaries are often drawn much less tightly around the developed portion of metropolitan areas than in the eastern part of the country. We therefore restrict calculations for geographical variables to the "urban fringe'', defined as those parts of the metropolitan area that were mostly undeveloped in 1976 but are located within 20 kilometers of areas that were mostly developed in 1976 (by mostly developed we mean areas where over 50 percent of the immediate square kilometer was developed in 1976). The choice of 20 kilometers as a threshold was guided by visual inspection of maps showing the evolution of land use in all metropolitan areas (a buffer of 20 kilometers around areas that were already mostly developed in 1976 includes 98 percent of 1976 residential development and 99 percent of subsequent residential development in metropolitan areas). Given that we isolate the urban fringe in this manner, it makes sense to start with fairly wide metropolitan area boundaries before we cut out areas far away from initial development. We therefore use 1999 definitions. We include all 275 metropolitan areas in the conterminous United States in our regressions.

The metropolitan-level sprawl data includes the following variables:

This list includes all variables required to run the regressions in Table IV of the article 'Causes of sprawl: A portrait from space' plus several urban fringe variables re-calculated for the entire metropolitan area, in case these are useful for other purposes.

The metropolitan-level sprawl data is freely available for download from this site as a zip file: sprawl_msa.zip (59 Kb) . This contains:

Supplementary Land Use and Land Cover Data

The sprawl indices used in the article 'Causes of sprawl: A portrait from space' were constructed from two fine-resolution data sets describing land cover and land use across the conterminous United States for the mid-1970s and the early 1990s.

The most recent data set, the 1992 National Land Cover Data classifies the land area circa 1992 into different land cover categories mainly on the basis of Landsat 5 Thematic Mapper satellite imagery. These data are the result of a collaboration between the U.S. Geological Survey (USGS) and the U.S. Environmental Protection Agency (EPA). The USGS makes them freely available for download as a set of 49 raster files (one for each state except Alaska and Hawaii and one for the District of Columbia) from http://edcftp.cr.usgs.gov/pub/data/landcover/states/ (see the metadata documentation for Massachusetts).

The earlier data set, the Land Use and Land Cover GIRAS Spatial Data, derives mainly from high-altitude aerial photographs taken mostly in the mid-1970s (1976 being the median and modal year). These data were collected by USGS and converted to ArcInfo format by the EPA. The EPA makes the data freely available for download as a set of 469 ArcInfo vector coverages (one for each 1:250,000-scale USGS quadrangle) from http://www.epa.gov/ngispgm3/spdata/EPAGIRAS/ (see the metadata documentation). An alternative version is available also from the EPA through their BASINS program from http://www.epa.gov/waterscience/basins/ (see the metadata documentation).

These digital versions of the 1970s Land Use and Land Cover GIRAS Spatial Data from 1:250,000-scale maps lack data for a thirty by sixty minute rectangle in the map for Albuquerque and also in the map for Cedar City and for a one degree by one degree square in the map for Tampa.

Fortunately, in addition to producing 1:250,000-scale maps (covering quadrangles of one degree by two degrees) for the conterminous United States, the USGS produced 1:100,000-scale maps (covering quadrangles of 30 minutes by 60 minutes) for some parts of the nation. For Albuquerque and Cedar City, the USGS had digitized data from the 1:100,000-scale maps corresponding exactly to the rectangles with missing data (Chaco Mesa in the case of Albuquerque, and Kanab in the case of Cedar City). We obtained the original digital data for these 1:100,000-scale maps from the USGS and processed them with the same computer code used by the EPA for the rest of the nation to completely fill the gaps. This involved:

For Tampa, the missing data were not available digitally but could be found in the corresponding 1:250,000-scale paper map distributed by the USGS. We obtained a copy from the University of Toronto Map Library and digitized this to the same format specifications as the rest of the data.

The 1970s land use and land cover data for these three areas (missing from the 1:250,000-scale Land Use and Land Cover GIRAS Data) are freely available from this site in ArcInfo shapefile format, zipped into one file per area:

The projection details for these three shapefiles are the following:
Map projection: Albers Conical Equal Area
Standard Parallel: 29.5
Standard Parallel: 45.5
Longitude of Central Meridian: -96.0
Latitude of Projection Origin: 23.0
Horizontal Datum: North American Datum 1983
Ellipsoid: Geographic Reference System 80
Semi-major Axis: 6378137
Denominator of Flattening Ratio: 298.257

The land use and land cover codes (LUCODE attribute field) correspond to the same Anderson level 2 classification used in the USGS/EPA 1970s data. See the metadata documentation included with each shapefile for additional details.

Users interested in obtaining the same disaggregate data that served as the basis for the article 'Causes of sprawl: A portrait from space' will need three sets of files:

While there are many similarities between the 1970s and the 1990s data, there are a few important differences that users should be aware of.

First, the 1990s data are stored in raster format (assigning a code to each cell on a regular grid) while the 1970s data are stored in vector format (assigning a code and providing coordinates for irregular polygons). They also have different geographical projections. Thus, one needs to convert both data sets to a common projection and data model. For our analysis, we converted the 1970s data to the same projection and data model as the 1990s data, by breaking up each polygon into the 30-meter cells it contains.

The second difference is that the data are categorized using classifications with different degrees of detail. For our analysis, we worked with two urban codes that can be defined as aggregates of codes available in both years: residential; and commercial, industrial, and transportation networks.

The third and most important difference arises from some subtle, but relevant, differences in the thresholds used to classify an area as developed in the 1970s and in the 1990s data. Given this, we believe one should not compare the data directly. Instead, one can take advantage of the fact that, while land is often redeveloped, it is almost never undeveloped. At the national level, according to the U.S. Department of Agriculture's National Resource Inventory, less than 0.8% of developed land was converted from urban to non-urban uses over the 15-year period 1982-1997. With virtually no undevelopment taking place, we can base our analysis on the 1990s data and use the 1970s data to figure out whether each development that existed in 1992 was built before or after the 1970s. Thus, we define old development as land that was classified as urban in both the 1990s and 1970s. We define new development as land that was classified as urban in the 1990s, but was not urban in the 1970s. We also use the 1970s data to account for any conversion between residential and commercial uses.

Users interested in performing analysis at the level of metropolitan areas need not reproduce our aggregation procedure, and can work directly with the metropolitan-level data that we describe and make available above.


James R. Anderson, Ernest E. Hardy, John T. Roach, and Richard E. Witmer. 1976. A Land Use and Land Cover Classification System for Use with Remote Sensor Data. U.S. Geological Survey Professional Paper 964.

Burchfield, Marcy, Henry G. Overman, Diego Puga, and Matthew A. Turner. 2006. Causes of sprawl: A portrait from space. Quarterly Journal of Economics 121(2): 587-633.

Cutler, David M., Edward L. Glaeser, and Jacob L. Vigdor. 1999. The rise and decline of the American ghetto. Journal of Political Economy 107(3):455-506.

GeoLytics. 2000. CensusCD 1980, Version 2. East Brunswick, NJ: GeoLytics, Inc.

Glaeser, Edward L. and Matthew Kahn. 2001. Decentralized employment and the transformation of the American city. Brookings-Wharton Papers on Urban Affairs:1-47.

Horan, Patrick M. and Peggy G. Hargis. 1995. County Longitudinal Template, 1840-1990. Ann Arbor, MI: Inter-university Consortium for Political and Social Research (ICPSR 6576).

Riley, Shawn J., Stephen D. DeGloria, and Robert Elliot. 1999. A terrain ruggedness index that quantifies topographic heterogeneity. Intermountain Journal of Sciences 5(1-4):23-27.

U.S. Bureau of the Census. 1974. County and City Data Book, 1972. Ann Arbor, MI: Interuniversity Consortium for Political and Social Research (ICPSR 0061).

U.S. Bureau of the Census. 1999. County Business Patterns, 1977. Ann Arbor, MI: Interuniversity Consortium for Political and Social Research (ICPSR 8464).

U.S. Department of Agriculture. 2000. Summary Report: 1997 National Resources Inventory (revised December 2000). Washington, DC, and Ames, IA: United States Department of Agriculture, Natural Resources Conservation Service, and Statistical Laboratory Iowa State University.

U.S. Environmental Protection Agency. 1994. 1:250,000-scale Quadrangles of Landuse/Landcover GIRAS Spatial Data in the Conterminous United States. Washington, DC: United States Environmental Protection Agency, Office of Information Resources Management.

U.S. Geological Survey. 1990. Land Use and Land Cover Digital Data from 1:250,000- and 1:100,000-scale Maps: Data User Guide 4. Reston VA: United States Geological Survey.

U.S. Geological Survey. 1994. 1:250,000-scale Digital Elevation Models. Reston VA: United States Geological Survey.

U.S. Geological Survey. 2000. Ground Water Atlas of the United States. Reston VA: United States Geological Survey.

U.S. Geological Survey. 2003. Principal Aquifers of the 48 Conterminous United States, Hawaii, Puerto Rico, and the U.S. Virgin Islands. Madison WI: United States Geological Survey.

U.S. National Climatic Data Center. 2002. Climate Atlas of the United States, Version 2. Asheville, NC: United States National Climatic Data Center.

Vogelmann, James E., Stephen M. Howard, Limin Yang, Charles R. Larson, Bruce K. Wylie, and Nick Van Driel. 2001. Completion of the 1990s National Land Cover data set for the conterminous United States from Landsat Thematic Mapper data and ancillary data sources. Photogrammetric Engineering & Remote Sensing 67(6):650-684.