Code for National Longitudinal Survey of Youth 1979 Geocode Data

Data and replication files for 'City of dreams'

by Jorge De la Roca, Gianmarco I. P. Ottaviano, and Diego Puga

This page distributes and documents computer programs and data to replicate the results obtained by Jorge De la Roca, Gianmarco I. P. Ottaviano, and Diego Puga in their article 'City of dreams,' to be published in the Journal of the European Economic Association.

The primary data source is the National Longitudinal Survey of the Youth 1979 (nlsy79). This survey by the us Bureau of Labor Statistics (bls) follows a nationally representative sample of men and women who were 14-22 years old in 1979. The computer programs on this page will serve nlsy79 data users to generate an individual-level panel for general purposes.

Bigger cities offer more valuable experience and opportunities in exchange for higher housing costs. While higher-ability workers benefit more from bigger cities, they are not more likely to move to one. The article proposes a model of urban sorting by workers with heterogeneous self-confidence and ability, which suggests that flawed self-assessment is partly to blame. Workers who misjudge their ability at an early career stage make location decisions they would not have made had they known their ability. By the time they learn enough about their actual ability, those early decisions have had a lasting impact, reducing their incentives to move and affecting their lifetime earnings.

Analysis of nlsy79 data shows that, in line with our model predictions, the location choices of young workers are guided by self-confidence rather than ability. Thus, some overconfident young workers start their career in a big city, while they would have chosen a small city with better self-assessment. That initial misjudged decision then becomes self-validating: having incurred a steep cost to gain more valuable experience, they find they might at least take advantage of this by remaining in the big city.

Conversely, some underconfident young workers spend their lives in a small city, even though a correct initial ability assessment would have made them self-select into a big city. Workers who severely underestimate their ability may nevertheless relocate from a small to a big city once labour market experience provides them with better information about their true capabilities. Young workers who are confident enough in their abilities locate in bigger cities to pursue their dreams, but those dreams do not come true for everyone.

The replication files

The full replication package is available for download from this site as a zip file: (11.06 Mb) .

This replication package contains all the required data and code except for the restricted-access geocode files with the location (county) of respondents at birth, at age 14, and on every survey wave.

Obtaining access to the geocode NLSY79 data

Replicating the results of the published article requires, in addition to the code and data files provided here, access to the nlsy79 geocode data.

Only employees and students of us universities, employees of us federally-funded research centers, and employees of eligible us government institutions and non-profits can request access to the nlsy79 geocode data. The us Bureau of Labor Statistics (bls) has no provisions for accessing the nlsy79 geocode data from outside the United States.

At the time of writing, one can find the application for obtaining access to the geocode nlsy79 data and information about the process at In the application, the researcher must describe the project's research objectives in a few paragraphs. If the application is approved, the bls will send the researcher a Letter of Agreement to be signed by an official institution signatory. The researcher must sign additional agreements, and in the case of students, their research advisor must be the signatory. Data access agreements are between the bls and the recipient institution, not between bls and individual researchers. All geocode data access must occur on the recipient institution's physical premises.

When we produced the results of this research, the us Bureau of Labor Statistics (bls) sent authorised users of the geocode nlsy79 data a cd-rom with a series of additional files containing the restricted-access location data for nlsy79 respondents. Shortly before the article's publication, the bls transitioned its mode of provision of nlsy79 geocode data to a virtual data enclave (vde). In this managed environment, researchers can analyze the geocode data. Statistical software available for use in the VDE includes Stata. Researchers can bring external files (such as the replication files for this project) and extract analysis results from the vde, following a bls approval process.

Instructions and overview of the replication files

These are the steps to construct the panel and replicate the results of the Journal of the European Economic Association article:

The Stata script code/ first runs code/ to perform the data construction. This code uses the data files described under Source data below, located in the directories data/src/cbsa, data/src/nlsy, data/src/nlsy/cpi_ind_occ, and data/src/nlsy/geocode. The script code/ first runs code/builddata/ to build a general-purpose panel that it saves as data/processed/nlsy_panel.dta. Next, it runs code/builddata/ to prepare the panel for our estimations, adding core-based statistical area (cbsa) codes for locations, defining the junior and senior periods, constructing controls, and defining the final sample. This panel is saved as data/processed/nlsy_panel_sample.dta.

After the Stata script code/ creates the data file used for the analysis and places it in data/processed/nlsy_panel_sample.dta, the Stata script code/ automatically runs code/ to perform the analysis of the processed data (described under Processed data below) and stores all the results (described under Results below) in the results/ directory.

Using the code without the geocode NLSY79 data

While it is not possible to replicate the results of the Journal of the European Economic Association article without the restricted-access nlsy79 geocode data, researchers can run the code without these additional data under two scenarios.

Researchers who wish to use the public-use nlsy79 data for other projects that do not require the location of respondents can use our code as a starting point. To do this, edit code/ and set the flag global NLSYDisableGeocode = 1. This adjustment will produce a general-purpose panel (described under Processed data below but without location data) saved as data/processed/nlsy_panel.dta. However, the panel will not produce any tables or figures with results.

Researchers without access to the geocode nlsy79 data that wish to check that the replication code runs smoothly can edit code/ and set the flag global NLSYGenerateFakeLocations = 1. This adjustment will randomly generate a fake location history for each respondent, allowing the code to run but generating meaningless results with the same format but different values than the actual results in the article.

Software and hardware notes

The results and figures in the Journal of the European Economic Association article have been produced using the code and data provided in Stata version 17.

The code is highly portable; nevertheless, one should keep in mind the following considerations:

Source data

The primary source data combines public-use and restricted-access geocode data from the National Longitudinal Survey of the Youth 1979 (nlsy79). The replication code reads all source nlsy79 data files from the data/src/nlsy directory and its data/src/nlsy/geocode child directory. The required public-use nlsy79 data files are included with the replication file in the data/src/nlsy directory. The required public-use nlsy79 data files are the following:

If desired, the researcher can re-download from the bls these files (for instance, to obtain additional variables). The files raw_data.NLSY79 and raw_data.NLSY79 are saved tagsets that make it easy to select the required variables for download in the nlsy Investigator platform (

The restricted-access geocode nlsy79 data files are not included with the replication file. The researcher must request them from the us Bureau of Labor Statistics (bls) and place them in the data/src/nlsy/geocode directory, as explained above. The required restricted-access geocode nlsy79 data files are the following:

In addition, complementary data files are needed to deflate nominal wages and create standardised time-consistent codes of occupation and sector. The researcher can find these files in the data/src/nlsy/cpi_ind_occ directory. Furthermore, researchers with access to the geocode nlsy79 can assign counties to metropolitan areas using the data files in the data/src/cbsa directory. We describe these auxiliary data files below:

We combine these source files to construct an annual panel from 1979 to 1994 and a biennial panel from 1994 to 2012.

Processed data

The replication code fully recreates the processed data from the original sources and performs the data analysis. The processed data consist of the following files and variables:


After running code/ to create the data file used for the analysis, the Stata script code/ automatically runs code/ to perform the analysis of the processed data. Specifically, code/ runs in sequence the Stata scripts code/analysis/, code/analysis/, code/analysis/, code/analysis/, and code/analysis/ to produce the LaTeX code for the tables in the article. Subsequently, it runs the Stata scripts code/analysis/ and code/analysis/ to produce the figures, and code/analysis/ to calculate various numbers mentioned in the text.

All the results are placed in the results/ directory.

After running code/, the researcher must compile in LaTeX the file results/dreams_tables.tex to produce a PDF file with all the tables.

Figures are saved in Encapsulated PostScript format as results/dreams_fig1.eps (figure 1); results/dreams_fig2a.eps, results/dreams_fig2b.eps, and results/dreams_fig2c.eps (the three panels of figure 2); and results/dreams_figb1a.eps and results/dreams_figb1b.eps (the three panels of appendix figure B.1). They are also saved in PNG format with the same file names and extension .png.

The results mentioned in the text, besides those contained in tables, are calculated by code/analysis/ This Stata script automatically writes the relevant paragraphs to the text file results/dreams_text_results.txt. This text file reads as follows:

'City of dreams', by Jorge De la Roca, Gianmarco I. P. Ottaviano, and Diego Puga

Results mentioned in the text not contained in tables.

Section 1

According to our data, 56% of all individuals (and 42% of the college-educated) in the United States live in the same city at ages 14 and 40.

Our data show a low correlation of 0.21 between ability and self-confidence (our measure of ability self-assessment). Among college graduates, this correlation falls to 0.02.

Our primary measure of ability is the individual's percentile score in the Armed Forces Qualification Test (AFQT), a general ability test administered to respondents in 1980 when they were between 15 and 23 (with a median age of 19).

Section 3

Our measure of ability is the individual's percentile score in the Armed Forces Qualification Test (AFQT). This general ability test was administered in 1980 when NLSY respondents were between 15 and 23 (with a median age of 19), regardless of their interest in the military.

The correlation between the AFQT and the Rosenberg test scores is low (0.21) for the full sample, suggesting that ability assessment is imperfect. Our model assumes that labour market experience provides workers with a better self-assessment of ability. Since the age of NLSY79 respondents ranged between 15 and 23 when tested in 1980, a way to see if self-assessment improves over time with job experience is to analyse whether self-confidence and ability are more correlated for older respondents at the time of the tests.

Regarding timing, we set the junior period for all respondents at the year after their highest level of education is completed, excluding educational periods that happen after more than two years away from education (median age of 20 for individuals without post-secondary education and 24 for the college-educated).

Based on these counties, we determine whether each respondent lives in a Core Based Statistical Areas (CBSA) with a 2010 population above two million. If so, we classify them as living in a big city, otherwise as living in a small city. This population threshold leads to 40% and 39% of individuals living in big cities during their junior and senior periods respectively.

The initial sample includes all 6,111 individuals in the cross-sectional sample of the NLSY79. We exclude individuals for whom the AFQT or the Rosenberg self-esteem scores are missing, which reduces the sample to 5,671 individuals. We can determine the junior period location of 5,462 of these individuals and, due to sampling attrition, the senior period locations of 5,180 of them. The availability of the demographic controls that we include further reduces our sample to 5,254 individuals in the junior period analysis and 4,985 individuals in the senior period analysis.

Section 6

One year after completing their education, 71% of individuals in our sample are in the same city as at age 14, and 61% remain there by age 40.

In table 2, 33.6% of individuals move between both periods while only 13.4% change city-size class (i.e, SB or BS).

Importantly, self-assessment of ability relative to people with the same education is so imperfect that there is virtually no correlation (0.02) between self-confidence and ability among college-educated workers.


De la Roca, Jorge, Gianmarco I. P. Ottaviano, and Diego Puga. Forthcoming. City of dreams. Journal of the European Economic Association.

Jann, Ben. 2004. estout: Stata module to export estimation results from estimates table.

Jann, Ben. 2017. grstyle: Stata module to customize the overall look of graphs.