Data and replication files for 'Urban growth and its aggregate implications'

Data and replication files for 'Urban growth and its aggregate implications'

by Gilles Duranton and Diego Puga

This page distributes and documents computer programs and data to replicate the results obtained by Gilles Duranton and Diego Puga in their article 'Urban growth and its aggregate implications,' published in Econometrica 91(6), November 2023: 2219-2259.

The replication files

The full replication package is available for download from this site as a zip file: urbangrowth_replication.zip (18.12 Gb) .

This replication package contains all the required code and data except for the restricted-access files with the residence (county) of respondents in the us National Longitudinal Survey of the Youth 1979 (nlsy79) and the restricted-access file with the residence (block group) of households in the 2009 us National Household Travel Survey (nhts).

For researchers not intending to replicate the construction of the geographic variables from raw sources, a smaller replication package is also available for download as a zip file: urbangrowth_replication_nogis.zip (0.35 Gb) . This stills replicates all the results, but relies on intermediate data files from our own run of the relevant Python scripts. The only difference with respect to the full replication package is that the very large geographic data sources contained under data/src/gis/ in the full package are not included (with the exception of block-group and metropolitan arrea boundaries).

Obtaining the restricted-access location data for the NLSY79 and 2009 NHTS

Fully replicating the results of the published article requires, in addition to the code and data files provided here, access to the restricted-access geocode files with the residence (county) of respondents in the nlsy79 and the restricted-access file with the residence (block group) of households in the 2009 nhts.

See Using the code without the restricted-access location data for the NLSY79 and 2009 NHTS below for information on how to run a partial replication without these restricted-acccess data.

Regarding the restricted-access nlsy79 data, only employees and students of us universities, employees of us federally-funded research centers, and employees of eligible us government institutions and non-profits can request access to the nlsy79 geocode data. The us Bureau of Labor Statistics (bls) has no provisions for accessing the nlsy79 geocode data from outside the United States.

At the time of writing, one can find the application for obtaining access to the geocode nlsy79 data and information about the process at https://www.bls.gov/nls/geocodeapp.htm. In the application, the researcher must describe the project's research objectives in a few paragraphs. If the application is approved, the bls will send the researcher a Letter of Agreement to be signed by an official institution signatory. The researcher must sign additional agreements, and in the case of students, their research advisor must be the signatory. Data access agreements are between the bls and the recipient institution, not between bls and individual researchers. All geocode data access must occur on the recipient institution's physical premises.

When we requested access to the data for this research project, the us Bureau of Labor Statistics (bls) sent authorised users of the geocode nlsy79 data a cd-rom with a series of additional files containing the restricted-access location data for nlsy79 respondents. Shortly before the article's publication, the bls transitioned its mode of provision of nlsy79 geocode data to a virtual data enclave (vde). In this managed environment, researchers can analyze the geocode data. Statistical software available for use in the VDE includes Stata. Researchers can bring external files (such as the replication files for this project) and extract analysis results from the vde, following a bls approval process.

Regarding the restricted-access 2009 nhts data, the Census block group of households surveyed for the 2009 nhts is only provided to researchers by the Federal Highway Administration after explaining why their project involves legitimate research that could not otherwise be accomplished and signing a confidentiality agreement. To initiate the process, researchers should contact the Data User Support person for the nhts (at the time of writing, the contact details are available at https://nhts.ornl.gov/contactUs.shtml).

Instructions and overview of the replication files

These are the steps to construct the data and replicate the results of the Econometrica article:

The Stata script code/_hcgrowth_run.do first runs code/1_hcgrowth_builddata.do to perform the data construction, creating the processed data files used for the analysis (described under Processed data below) and placing them in the data/processed/ directory. Next, the Stata script code/_hcgrowth_run.do automatically runs code/2_hcgrowth_analysis.do to perform the analysis of the processed data and stores all the results (described under Results below) in the results/ directory.

Figures 1 and 2, which illustrate the theoretical model rather than empirical results, are produced by running Mathematica notebooks code/analysis/hcgrowth_fig1.nb and code/analysis/hcgrowth_fig2.nb (in the second notebook, specifying first Year = 1980; and then Year = 2010; in the first line to produce the two panels).

Using the code without the restricted-access location data for the NLSY79 and 2009 NHTS

While it is not possible to fully replicate the results of the Econometrica article without the restricted-access location data for the nlsy79 and the 2009 nhts, researchers can run the code without these additional data under two scenarios.

Researchers who wish to perform a partial replication can produce without the restricted access data all of the figures in the paper, as well as table 1 (except for column 1), table 3, and table C.2. To do this, simply leave the flags global NLSYGeocodeUnavailable = 1 and global NHTSBGUnavailable = 1 in code/_hcgrowth_run.do, as provided. Replication of table 2 and table C.1 will be skipped.

Researchers without the restricted-access location data for the nlsy79 and the 2009 nhts that wish to check that the replication code runs smoothly can edit code/_hcgrowth_run.do and set the flags global NLSYGenerateFakeLocations = 1 and global NHTSGenerateFakeLocations = 1, while leaving the flags global NLSYGeocodeUnavailable = 1 and global NHTSBGUnavailable = 1. This adjustment will randomly generate a fake location history for each nlsy79 and nhts respondent, allowing the code to run but generating meaningless results in column (1) of table 1, table 2, and table C.1 with the same format but different values than the actual results in the article.

Software and hardware notes

The results and figures in the Econometrica article have been produced in Stata version 18, Python version 3.9.15, and Mathematica version 13.2 using the code and data provided.

The code is highly portable; nevertheless, one should keep in mind the following considerations:

Data sources and treatments

City definitions: Our empirical and quantitative analysis focuses on the conterminous United States during the period 1950–2010. To define cities, we use Metropolitan Statistical Area and Consolidated Metropolitan Statistical Area (msa) definitions outside of New England and New England County Metropolitan Area (necma) definitions in New England, as set by the Office of Management and Budget on 30 June 1999. This defines 275 metropolitan areas.

Population: We use county-level population data from the us decennial censuses for 1850, 1920, 1950, 1980, and 2010, that we aggregate to the 1999 msa/necma level. The sources are Schroeder (2016) for 1850 and 1920, Forstall (1996) for 1950 and 1980, and Manson, Schroeder, Riper, Kugler, and Ruggles (2021) for 2010.

City centre and city periphery: We define the city centre as the location indicated by Google Maps for the core city of the metropolitan area.

In addition to defining centres, we need a measure for the spatial extent of the city, corresponding to x i t in the model. Since, in practice, cities cover two dimensions, there will be different distances between the city periphery and the city centre depending on the direction we follow. When cities have irregular shapes, using the maximum distance to the centre or a very high percentile of the distribution of distances to the centre can be problematic. Also, since metropolitan area definitions are county-based and some urban counties, particularly in the West of the country, extend well into rural areas, a few scattered dwellings very far away from the centre in a county that is part of a city can increase the measured distance between the city periphery and the city centre artificially. To address all of these difficulties, we implement a consistent definition of the city periphery. We take the city periphery to be the longest distance from the city centre within the metropolitan area boundaries that is within the 95th percentile of dwelling distances and has at least 500 dwelling units per square mile in the 2012 American Community Survey data described below.

Urban fringe: We use the term urban fringe for the area where the city would likely expand next. This is the area where we measure agricultural land prices when calculating replacement costs for housing at the city periphery and where we measure geographical constraints to urban expansion when relating these to the strictness of current planning regulations. The urban fringe is defined as the area within 20km of land developed at medium or high intensity in 2011 that is undeveloped or developed at low intensity, based on land cover data from the 2011 slice of the 2019 version of the National Land Cover Database. The nlcd2019 (Dewitz and us Geological Survey, 2021) offers land cover for years 2001, 2003, 2006, 2008, 2011, 2013, 2016, 2019. We use the 2011 slice of the nlcd2019, since our estimations are all centred around 2010.

Geographical constraints to urban expansion: To obtain an empirical counterpart to our model’s z i , we calculate the share of the area within 30 kilometres of the centre of each city that is geographically unconstrained. This corresponds to 1 / z i . We characterise this area with a 30-metre resolution. Each 30-metre cell is classified as geographically unconstrained if is not covered by slopes steeper than 15%, water, wetlands, or land permanently protected from land cover conversion with a mandate to conserve its natural state, and it does not belong to a foreign country. Slope is calculated on the basis of 1 arc-second Digital Elevation Models from the 3d Elevation Program of the us Geological Survey (2018). Water and wetlands cover is based on the 2011 slice of the nlcd2019. Protected land is identified based on the Protected Areas Database of the United States (us Geological Survey, 2020). This database maps protected areas and assigns them a gap status code as a measure of intent to permanently protect its natural state. We use gap status codes 1 and 2 used to isolate land permanently protected from land cover conversion with a mandate to conserve its natural state. As to foreign land, this is identified from the official boundary files of Statistics Canada and Mexico’s Instituto Nacional de Estadística y Geografía.

When calculating an empirical counterpart to z i , we are interested in geographical constraints on the long-term expansion of a city over the course of its history. However, when thinking about geographical constraints to urban expansion as a determinant of the strictness of current planning regulations in panel c of figure 5, it is more appropriate to focus on the area where the city would likely expand next. This is the urban fringe defined above, so we also calculate the share of the urban fringe that is geographically unconstrained for the same set of geographical constraints.

Illustrating the equilibrium with the urban system of the United States: Panels a and b of figure 2 depict the allocation of population across us cities and rural areas in 1980 and 2010 as an equilibrium of the model. To draw this figure, we use parameter values estimated or calibrated in section 5 ( γ = 0.07 , θ = 0.04 , σ = 0.04 , β = 0.04 , and λ = 0.18 ), the actual population in each us metropolitan area and outside metropolitan areas in each year to assign values to N i t and N r t , and the share of the area within 30 kilometres of the centre of each city that is geographically unconstrained as the empirical counterpart to 1 / z i . We normalise τ t = 1 in 1980 (which amounts to a choice of numéraire). We set τ t in 2010 so that population-weighted average growth in y i t in the model matches the actual growth in average Gross Domestic Product per person in the United States 1980–2010. For this purpose, we use equation (21) to obtain ρ σ A i t ( h t ) 1 + σ for each city as a function of τ t from its values of N i t , z i , and parameters. Substituting this into equation (8) then yields y i t for each city as a function of τ t . We obtain y r t by equating income in rural areas with income in the marginal populated city, where the latter is given by equation (22). We find that increasing τ t from 1 to 1.569 between 1980 and 2010 makes output per person in the model increase by a factor of 1.658, which matches the ratio of 2010 to 1980 Gross Domestic Product per person in the United States. Note that the numerical computation of τ t is straightforward since every value of y i t is proportional to τ t (equations 8 and 21).

In the figure, horizontal axis length is total us population, N t . Total urban population i N i t can be read as the horizontal distance to the left-side axes origin and rural population, N r t = N t i N i t can be read as the distance to the right-side axes origin. The thick horizontal segments in the figure represent equilibrium consumption for incumbents in each city, c i t (segment height), obtained from equation (40), and population N i t (segment length). The thin curves tangent to each thick segment plot consumption for incumbents in each city when population differs from its equilibrium level, as given by equation (41). Incumbents set permitting costs at p i t = c i t c t to achieve the consumption at the maximum of the curve for their city while keeping newcomers indifferent. Rural consumption as a function of rural population is given by the smooth long curve, corresponding to equation (40), where we obtain the value of A r t by equating c r t = c i t for the marginal populated city.

Current Population Survey: Figure 3 plots the evolution of the share of population aged 25–64 who hold a college degree in metropolitan areas of different sizes over the period 1986–2016 in the United States. It uses data from the Annual Social and Economic (asec) supplement of the Current Population Survey (cps), obtained from the ipums-cps project (Flood, King, Rodgers, Ruggles, and Warren, 2018).

We assign individual observations to a specific metropolitan area based on their county of residence, when available, which we then match to the corresponding 1999 msa/necma; when the county of residence is unavailable, the state of residence is outside of New England, and the cps source data contains the 1999 msa of residence, we use this; alternatively, we use a purposely-built crosswalk (available with the replication code for this paper) between alternative metropolitan area codes contained in the cps source data and 1999 msa/necma codes. We then group metropolitan areas into three population size categories based on their 2010 population (below 1 million, between 1 and 2.5 million, and above 2.5 million), so that each line in the figure corresponds to the same set of metropolitan areas throughout.

About one-third of the individual observations for residents in metropolitan areas cannot be assigned to a specific area. We assign these observations to the same three size population size categories based first on the metropolitan area size variable and next on the core-based statistical area size variable in the cps. The downside of this procedure relative to be able to assign individual-year observations to a specific metropolitan area is that some observations may be assigned to different curves over time despite corresponding to the same metropolitan area if the population of this area crosses the 1 million or the 2.5 million thresholds.

Up until 1991, the cps contains information on the years of college completed but not on whether the individual has obtained a bachelor’s degree, so we classify individuals as having a college degree if they have completed at least 4 years of college. From 1992 onwards, we use the information on whether they have a bachelor’s degree or higher. We plot the figure using the asec person-level weights.

The cps is also the source for the annual growth rate in the average years of schooling between 1950 and 2010 provided in Section 8. We base the calculation on Table A-1 in us Bureau of the Census (2023) (Years of School Completed by People 25 Years and Over, by Age and Sex: Selected Years 1940 to 2022) and assign to each category of years of school completed the number of years suggested for the United States by De la Fuente and Doménech (2014).

National Household Travel Survey and Census: Column (1) of table 1 estimates γ as the elasticity of distance travelled with respect to the distance between her dwelling and the city centre. Data on household travel behaviour come from the 2009 us National Household Travel Survey (nhts). The survey is sponsored by various agencies at the us Department of Transportation. For a nationally-representative sample of households, the nhts provides a travel diary kept by every member of each sampled household where we observe the distance, duration, mode, purpose, and start time for each trip taken on a randomly-assigned travel day. It also includes household and individual demographics.

Household miles travelled are measured using the best estimate of household annual miles computed by the survey administrators, which is their preferred measure. We regress the log of household miles travelled on the log of distance between the household’s block-group of residence and the city centre, controls for household and block-group characteristics, and metropolitan area fixed effects. We measure the distance to the centre as the haversine distance between the centroid of each block-group and the centre of each metropolitan area. For consistency with the specifications using housing data, we use all block groups from all metropolitan areas except for college towns, defined as the 46 metropolitan areas with under one million inhabitants in 2010 where at least 10% of them are college students, since the high concentrations of students make housing markets in such college towns very distinct.

The controls for household characteristics, all based on the same nhts data, are the log of the household size, the log of the number of drivers in the household, the share of drivers that are male, and indicators for a single-person household, for the presence of small children, for the household respondent being Hispanic, White, Black and Asian, and for being a renter.

The controls for block-group characteristics are the percentages of Hispanic, Black, and Asian population (based on 2000 Census data, obtained from the ipums-nhgis project, Manson, Schroeder, Riper, Kugler, and Ruggles, 2021, since the 2009 nhts records block groups using 2000 boundaries), the performance in standardised tests of the closest public school relative to the city average (from De la Roca, Gould Ellen, and O’Regan, 2014, with variation at the tract level), an indicator for waterfront location (constructed by combining the 2000 block-group boundaries provided in the ipums-nhgis 2000 Census data with the coastline shapefiles from the National Hydrography Dataset and the Great Lakes and watersheds shapefiles from the Great Lakes Restoration Initiative of the us Geological Survey), an indicator for riverfront location (constructed by combining the same block-group boundaries with the major rivers within the United States shapefile included with Esri Data & Maps), and terrain ruggedness (measured by the Terrain Ruggedness Index of Riley, DeGloria, and Elliot, 1999, calculated on the basis of 1 arc-second Digital Elevation Models from the 3d Elevation Program of the us Geological Survey, 2018, and then averaged at the block-group level).

The measure of travel speed in each city included as a control in the regression in column (3) of table 1 and as the dependent variable in column (5) of table 1 is based on the same nhts data. We keep data on trips in a household vehicle, where this vehicle is a car, van, suv, or pick-up, and is driven by the survey respondent. Following Couture, Duranton, and Turner (2018), we exclude all trips by households where either the respondent does not recall if they were the driver, or they report one or more trips in top or bottom 0.5% of all trips by distance, time or speed. As they note, removing all trips by the affected household and not only the odd ones is important to avoid biasing the calculations. Since speed varies very substantially depending on trip, individual and household characteristics, we need a minimum number of trips to compute a reliable measure of distance. We restrict our sample to the 182 cities where we have at least 100 trips recorded. We first calculate the speed of individual trips dividing trip miles by trip duration. We then regress the log of travel speed for individual trips on metropolitan area fixed effects, controls for trip characteristics the same controls for household and block-group characteristics as in the regression in column (1) of table 1, and the log of distance between the household’s block-group of residence and the city centre. The controls for trip characteristics, all based on the same nhts data, are the log of trip distance and indicators for day of the week, departure time in 30-minute intervals, and trip purpose. We use the estimated regression coefficients to predict, for each city, the speed of a 15km commuting trip on a Tuesday at 8:00am by a driver with average characteristics.

To validate the self-reported trip duration estimates of nhts respondents, we turn to data from Akbar, Couture, Duranton, and Storeygard (2023). They query Google Maps over an extended time period about the duration of a trip with the same origin, destination, day of the week, and departure time as each trip reported by nhts respondents. Using this alternative trip duration, they recompute an alternative measure of speed that we use in column (6) of table 1.

American Community Survey: All our estimations regarding housing rental prices and values use 5-year 2008–2012 data from the 2012 American Community Survey (acs), obtained from the ipums-nhgis project (Manson, Schroeder, Riper, Kugler, and Ruggles, 2021). The unit of observation is the block group. We use all block groups between the centre and the periphery of every metropolitan area except for college towns, as defined above, given their distinct housing markets.

All block-group housing regressions use the same controls for housing and block-group characteristics. The controls for housing characteristics are the percentage of dwellings in the block group by type of structure, by number of bedrooms, and by construction decade, all based on the same 2008–2012 acs data. The controls for block-group characteristics are the same as in the travel regressions, but re-computed for 2012 acs block groups: the percentages of Hispanic, Black, and Asian population, the performance in standardised tests of the closest public school relative to the city average, an indicator for waterfront location, an indicator for riverfront location, and terrain ruggedness.

Column (2) of table 1 estimates γ based on variation in house prices across locations within a city as a function of distance to the city centre. The dependent variable is the log of the difference between the median rent in the most expensive block group in the city and the median rent in block group under consideration, from the 2008–2012 acs data. We regress this on the log of the distance between the block group and the centre of its metropolitan area, city fixed effects, and the dwelling and neighbourhood characteristics described above.

Column (3) of table 1 estimates γ based on variation in house prices at the centre of cities as a function of the spatial extent of the city. The first component of the dependent variable for this regression is estimated from an auxiliary regression at the block group level of the log of the median monthly contract rent on city indicators, a third-degree polynomial of distance between the block-group centroid and the city centre, and the aforementioned controls for housing and block-group characteristics. We use this regression to predict the rental price of a national-reference house for city-average neighbourhood characteristics at the centre of each city —i.e. when x i j = 0 . This corresponds to P ^ i on the left-hand side of the empirical specification of equation (35). On the right-hand side of that expression, we have the spatial extent of the city, x i t , and travel speed τ ^ i . We measure x i t using the distance between the centre and the periphery of each city as defined above, i.e. the longest distance from the city centre within the metropolitan area boundaries that is within the 95th percentile of dwelling distances and has at least 500 dwelling units per square mile in the 2012 acs data. Our estimate of speed is the predicted speed of a 15km commuting trip on a Tuesday at 8:00am by a driver with average characteristics in each city, using nhts data as described above.

The final component of equation (35) that we need to measure is c t . Equation (24) tells us this should be proportional to the price of housing at the centre of the cheapest city. Unfortunately, the proportionality constant is itself a function of our key parameter of interest, γ. Since γ, appears on both sides of equation (35), we estimate this iteratively. Given a starting value of γ , the values of θ , σ and β obtained below, and the estimated city-centre house price in the cheapest city P ^ , we obtain a value for c ( γ ) = γ + θ σ β ( σ + β ) ( γ + 1 ) P ^ . This value allows us to compute our dependent variable in regression (35), ln [ P ^ i + c ( γ ) ] . Estimating this regression by ordinary least squares yields an updated value of γ , which allows recomputing c ( γ ) and thus ln [ P ^ i + c ( γ ) ] . We then re-estimate regression (35), and so on until convergence is achieved.

Figure 4 plots housing price gradients for five us cities. We predict the monthly rent of a dwelling with average national characteristics in a neighbourhood with average city characteristics as a function of distance to the city centre with a semilinear regression at the block-group level for each city using Yatchew’s (1998) difference estimator. The dependent variable is the median contract rent in the block group. The linear component includes the same dwelling and neighbourhood controls as column (2) of table 1 while distance to the city centre is treated nonparametrically.

Panel a of figure 5 plots the city-periphery monthly rent against 2010 city population. City- periphery monthly rent is the monthly rent of a dwelling with average national characteristics in a neighbourhood with average city characteristics located at the city periphery. This is estimated from the same regression used to estimate the city-centre monthly rent used in column (3) of table 1, but valued at a distance from the centre corresponding to the periphery of each city instead of at a distance zero.

Planning regulations: The strictness of planning regulations in each metropolitan area plotted in panels b, c, and d of figure 5 is measured using the Wharton Residential Land Use Regulatory Index (wrluri). This index is constructed by Gyourko, Saiz, and Summers (2008) applying factor analysis to responses from a 2006 nationwide survey of residential planning regulations in over 2,600 communities across the United States. Gyourko, Hartley, and Krimmel (2021) construct an updated index based on a 2018 survey with some differences with respect to the 2006 survey, both in terms of questions and responding communities. To aggregate the 2006 index to the level of metropolitan areas, we retain data on the 1896 responding communities that are part of a 1999 msa/necma, average their index to the level of primary metropolitan statistical areas weighting by their population with a correction for community response probability provided by Gyourko, Saiz, and Summers (2008), and then average these values to the level of metropolitan areas weighting by population. To aggregate the 2018 index to the level of metropolitan areas, we retain data on the 1877 responding communities that are part of a 1999 msa/necma, and then average their index to the level of metropolitan areas using the weights provided for large metropolitan areas by Gyourko, Hartley, and Krimmel (2021) and population weights for the rest. Finally, we interpolate the 2006 and 2018 values of the index to obtain a value for 2010 to match the timing of our other data.

Housing replacement costs and price-cost wedges: The periphery house price-cost wedge plotted against the strictness of planning regulations in panel d of figure 5 is the difference between the value of a house and its replacement cost in the periphery of the city. The house value corresponds to a four-bedroom single-family detached house built 2000–2009 in a neighbourhood with average city characteristics located at the city periphery. This is estimated based on a regression of the log of the median house value in the block group on a third-degree polynomial of distance to the city centre, and the same dwelling and neighbourhood controls as column (2) of table 1 using 2008–2012 American Community Survey (acs) data. City periphery is again defined the longest distance from the city centre that is within the 95th percentile of dwelling distances and has at least 500 dwelling units per square mile in the block group.

The replacement costs are the sum of city-specific construction costs for an economy-quality single-family detached house of 2000 square feet and the price of a quarter-acre vacant plot of land used for agriculture at the urban fringe. Construction costs are based on RSMeans data for 2010 obtained from Glaeser and Gyourko (2018). The urban fringe is defined as the area within 20km of land developed at medium or high intensity in 2011 that is undeveloped or developed at low intensity, based on land cover data from the 2011 slice of the nlcd2019. Within the urban fringe, we isolate land devote to agricultural use based on the same 2011 slice of the nlcd2019. We then calculate the average price of vacant plots used for agriculture at the urban fringe of each city using gridded land value data for vacant plots from Nolte, 2020 derived from parcel sales 2000–2019. All prices converted to 2012 dollars using the Consumer Price Index for all urban consumers from us Bureau of Labor Statistics (2023).

Building permits: Data about the number of building permits plotted in figure C.1 and used for our counterfactuals are from the us Department of Housing and Urban Development (hud). The source data is at the county level and we aggregate this up to the 1999 msa/necma level. The variable annual permits relative to housing stock on the vertical axis of figure C.1 divides for each city the total number of residential construction permits during the period 2008–2012 (to match the timing of the acs housing data) by the total number of housing units in the city for that period as recorded in the acs data.

National Longitudinal Survey of Youth: Our estimation of the parameters governing agglomeration economies in table 2 uses panel data from the “cross-sectional sample” of the National Longitudinal Survey of Youth 1979 (nlsy79). The survey, conducted by the us Department of Labor’s Bureau of Labor Statistics, follows a nationally representative sample of 6,111 men and women who were 14–22 years old when they were first surveyed in 1979. These individuals were interviewed annually through 1994 and were interviewed on a biennial basis since 1996. We use data for the period 1979–2012. The nlsy79 contains information on a rich set of personal characteristics and tracks individuals’ labour market activities. Our starting panel is the same as in De la Roca, Ottaviano, and Puga (2023) and we refer the reader to that paper for further details. For each respondent, the confidential geocoded portion of the nlsy79 reports the county and state where they were located at birth, at age 14, and at each interview date since 1979. We use that location information both to record the 1999 msa/necma where each worker is currently employed and to split work experience accumulated until then into work experience in cities with populations equal or greater than 5 million, in cities with populations equal or greater than 2 million but below 5 million, and elsewhere. Since we need a reasonable number of observations to estimate city fixed effects, we include indicators for all metropolitan areas with a population above 2 million and additional indicators for groups of similar-size metropolitan areas with a population below 2 million. In particular, we have a common indicator for cities in groups that start at 75,000 people in increments of 25,000 until 600,000, then in increments of 50,000 people until 800,000, and then in increments of 100,000 people until 2 million. This aggregates the 261 metropolitan areas included in the panel into 63 groups.

In the tsls estimation of column (2) in table 2, we instrument the log of city size with the percentage of the area in a 30-kilometre radius around the city centre that has slopes greater than 15% and the percentage covered by wetlands (both computed as in our geographical constraints to urban expansion), the inverse hyperbolic sine of the city’s population in 1850 and 1920 (from Schroeder, 2016), the inverse hyperbolic sine of the distance to the Eastern Seaboard (computed using coastline shapefiles from the National Hydrography Dataset of the us Geological Survey), and heating degree days (from Burchfield, Overman, Puga, and Turner, 2006).

Processed data

The Stata script code/_hcgrowth_run.do first runs code/1_hcgrowth_builddata.do to perform the data construction, creating the processed data files used for the analysis and placing them in the data/processed/ directory. The processed data consist of the following files and variables:

Results

The main Stata script code/_hcgrowth_run.do, after running code/1_hcgrowth_builddata.do to create the data files used for the analysis, automatically runs code/2_hcgrowth_analysis.do to perform the analysis of the processed data. All the results are placed in the results/ directory.

Once the code runs, the researcher must compile in LaTeX the file results/hcgrowth_tables.tex to produce a PDF file with all the tables.

All of the numbers mentioned in the text that are not directly available in the tables are also automatically produced by code/2_hcgrowth_analysis.do by calling the Stata script code/analysis/hcgrowth_text_results.do and saved as a text file results/hcgrowth_text_results.txt that includes all the relevant sentences in the paper.

Figures are saved in Encapsulated PostScript format as results/hcgrowth_fig3.eps (figure 3); results/hcgrowth_fig4.eps (figure 4); results/hcgrowth_fig5a.eps, results/hcgrowth_fig5b.eps, results/hcgrowth_fig5c.eps, and results/hcgrowth_fig5d.eps (the four panels of figure 5); and results/hcgrowth_figc1.eps (appendix figure C.1). They are also saved in Portable Network Graphics (png) format with the same file names and extension .png.

Figures 1 and 2 illustrate the theoretical model rather than empirical results. The curves in those figures are produced by running Mathematica notebooks code/analysis/hcgrowth_fig1.nb and code/analysis/hcgrowth_fig2.nb (in the second notebook, specifying first Year = 1980; and then Year = 2010; in the first line to produce the two panels), which save figures 1 and 2 in Encapsulated PostScript format as results/hcgrowth_fig1.eps (figure 1); results/hcgrowth_fig2_1980.eps and results/hcgrowth_fig2_2010.eps (the two panels of figure 1). These figures are also saved in Portable Network Graphics (png) format with the same file names and extension .png.

References

Akbar, Prottoy, Victor Couture, Gilles Duranton, and Adam Storeygard. 2023. The fast, the slow, and the congested: Urban transportation in rich and poor countries. Preprint, University of Pennsylvania.

Baum, Christopher F., Mark E. Schaffer, and Steven Stillman. 2022. ivreg2: Stata module for extended instrumental variables/2sls, gmm and ac/hac, liml and k-class regression.

Burchfield, Marcy, Henry G. Overman, Diego Puga, and Matthew A. Turner. 2006. Causes of sprawl: A portrait from space. Quarterly Journal of Economics 121(2): 587–633.

Correia, Sergio. 2018. ivreghdfe: Stata module for extended instrumental variable regressions with multiple levels of fixed effects.

Correia, Sergio. 2019a. ftools: Stata module to provide alternatives to common Stata commands optimized for large datasets.

Correia, Sergio. 2019b. reghdfe: Stata module to perform linear or instrumental-variable regression absorbing any number of high-dimensional fixed effects.

Couture, Victor, Gilles Duranton, and Matthew A. Turner. 2018. Speed. Review of Economics and Statistics 100(4): 725–739.

Crow, Kevin. 2015. shp2dta: Stata module to converts shape boundary files to Stata datasets.

De la Fuente, Ángel and Rafael Doménech. 2014. Educational attainment in the oecd 1960–2010 (version 3.1). Working Paper 2014-14, Fundación de Estudios de Economía Aplicada.

De la Roca, Jorge, Gianmarco I. P. Ottaviano, and Diego Puga. 2023. City of dreams. Journal of the European Economic Association 21(2): 690-726.

De la Roca, Jorge, Ingrid Gould Ellen, and Katherine M. O’Regan. 2014. Race and neighborhoods in the 21st century: What does segregation mean today? Regional Science and Urban Economics 47: 138–151.

Dewitz, Jon and us Geological Survey. 2021. National Land Cover Database (nlcd) 2019 Products: Version 2.0, June 2021. Sioux Falls, sd: United States Geological Survey.

Duranton, Gilles, and Diego Puga. 2023. Urban growth and its aggregate implications. Econometrica 91(6): 2219-2259.

Flood, Sarah, Miriam King, Renae Rodgers, Steven Ruggles, and J. Robert Warren. 2018. Integrated Public Use Microdata Series, Current Population Survey: Version 6.0. Minneapolis: University of Minnesota.

Forstall, Richard L. 1996. Population of States and Counties of the United States: 1790 to 1990. Washington dc: us Bureau of the Census.

Glaeser, Edward L. and Joseph Gyourko. 2018. The economic implications of housing supply. Journal of Economic Perspectives 32(1): 3–30.

Gyourko, Joseph, Jonathan S. Hartley, and Jacob Krimmel. 2021. The local residential land use regulatory environment across us housing markets: Evidence from a new Wharton index. Journal of Urban Economics 124: 103337.

Gyourko, Joseph, Albert Saiz, and Anita A. Summers. 2008. A new measure of the local regulatory environment for housing markets: The Wharton Residential Land Use Regulatory Index. Urban Studies 45(3): 693–729.

Jann, Ben. 2023. estout: Stata module to export estimation results from estimates table.

Jann, Ben. 2020. grstyle: Stata module to customize the overall look of graphs.

Jann, Ben. 2022. palettes: Stata module providing color palettes, symbol palettes, and line pattern palettes.

Kleibergen, Frank, Mark E. Schaffer, and Frank Windmeijer. 2020. ranktest: Stata module to test the rank of a matrix.

Lokshin, Michael. 2006. Semi-parametric difference-based estimation of partial linear regression models. Stata Journal 6(3): 377-383.

Manson,Steven, Jonathan Schroeder, David Van Riper, Tracy Kugler, and Steven Ruggles. 2021. Integrated Public Use Microdata Series, National Historical Geographic Information System: Version 16.0. Minneapolis: ipums.

Nolte, Christoph. 2020. High-resolution land value maps reveal underestimation of conservation costs in the united states. Proceedings of the National Academy of Sciences 117(47): 29577–29583.

Reif, Julian. 2020. regsave: Stata module to save regression results to a Stata-formatted dataset.

Riley, Shawn J., Stephen D. DeGloria, and Robert Elliot. 1999. A terrain ruggedness index that quantifies topographic heterogeneity. Intermountain Journal of Sciences 5(1–4): 23–27.

Schroeder, Jonathan P. 2016. Historical Population Estimates for 2010 us States, Counties and Metro/Micro Areas, 1790–2010. Minneapolis: University of Minnesota.

us Bureau of Labor Statistics. 2023. Consumer Price Index for all urban consumers: All Items in us City Average. Washington, dc: United States Bureau of Labor Statistics. Retrieved from fred, Federal Reserve Bank of St. Louis.

us Bureau of the Census. 2023. cps Historical Time Series Tables. Washington, dc: United States Bureau of the Census.

us Geological Survey. 2018. 1 Arc-second Digital Elevation Models – usgs National Map 3dep Downloadable Data Collection. Reston, va: United States Geological Survey.

us Geological Survey. 2020. Protected Areas Database of the United States (pad-us): Version 2.1, December 2020. Reston va: United States Geological Survey.

Yatchew, Adonis. 1998. Nonparametric regression techniques in Economics. Journal of Economic Literature 36(2): 669–721.