Code for mcvl Spanish Social Security Data

Replication files for 'Learning by working in big cities'

by Jorge De la Roca and Diego Puga

This site distributes and documents computer programs to replicate the results obtained by Jorge De la Roca and Diego Puga in their article 'Learning by working in big cities', published in Review of Economic Studies, 84(1), January 2017: 106-142.

This research uses anonymized administrative data from the Muestra Continua de Vidas Laborales con Datos Fiscales (mcvl) with the permission of Spain's Dirección General de Ordenación de la Seguridad Social. We are NOT allowed to make the mcvl data available from this site. Thus, in addition to the replication files available here, interested researchers will need to request access to the mcvl data from Spain's Dirección General de Ordenación de la Seguridad Social, by following the application process described below.

The mcvl data are extremely rich, containing matched anonymized social security, income tax and census records for a 4% random sample of Spanish workers, pensioners and unemployment benefit recipients. The application process required to obtain the mcvl data is simple, and approved users are allowed to work with the data in their own computers. By providing this replication code, in addition to enabling easy replication of our results, we hope to substantially reduce entry costs for users of the mcvl data.

The replication files

The replication files, documented below, are available for download from this site as a zip file: esurban_replication.zip (167 Kb) . This contains:

A series of Stata do files that create a monthly panel from the source mcvl data files provided by Spain's Dirección General de Ordenación de la Seguridad Social. These are contained in the directory code/panel/ within the zip file. Running the Stata do file code/make_panel.do produces a Stata data file with the panel output/mcvl_panel.dta.
A series of Stata do files that replicate all of the tables and figures in the published article. These are contained in the directory code/analysis/ within the zip file. Running the Stata do file code/make_esurban.do produces the individual tables in LaTeX format and the individual figures in PDF format and places them in the directory output/. Subsequently, compiling in LaTeX the file output/esurban_results.tex produces a PDF file with all the tables and figures put together.
A Stata do file, code/make_all.do, that calls code/make_panel.do and code/make_esurban.do to first create the panel and then replicate the results without any additional user intervention.
Four additional data files, found in the directory otherdata/, ua_muni.dta (Spain's official urban area definitions), ua_size.dta (our size measures for Spain's urban areas), ua_geog.dta (geographical variables for Spain's urban areas, used as instruments), and cpi_annual.dta (consumer price index data for Spain for 1980-2014).
A copy of this documentation: readme.html (the latest version can be found at https://diegopuga.org/data/mcvl/).

The MCVL data

Starting with the year 2004, an edition of the Muestra Continua de Vidas Laborales con Datos Fiscales (mcvl) data set has been released every year with social security records for a 4% non-stratified random sample of the population who on that year have had any relationship with Spain’s Social Security (individuals who are working, receiving unemployment benefits, or receiving a pension).

There are two versions of the mcvl, with and without income tax data. This research requires access to the version with income tax data (con Datos Fiscales), where gross labour earnings are recorded separately for each job and are not subjected to any censoring.

Each mcvl edition provides for the individuals included in it social security records covering their complete labour market history. However, it only provides their uncensored earnings from income tax records for the year of that particular mcvl edition. Thus, we combine multiple editions of the mcvl, beginning with the first produced, for 2004, to have uncensored earnings throughout our study period.

Different editions of the mcvl can be combined because the criterion for inclusion in the mcvl (based on the individual’s permanent tax identification number) as well as the algorithm used to construct the individual’s anonymized identifier are maintained across mcvl editions. Combining multiple waves has the additional advantage of maintaining the representativeness of the sample throughout the study period, by enlarging the sample to include individuals who have an affiliation with the Social Security in one year but not in another. More recent editions add individuals who enter the labour force for the first time while they lose those who cease affiliation with the Social Security.

We track workers over time throughout their working lives to compute their job tenure and their work experience in different urban areas, but study their earnings only when employed in 2004-2009. In particular, we regress individual monthly earnings in 2004-2009 on a set of characteristics that capture the complete prior labour history of each individual. We do not study years prior to 2004 due to the lack of earnings from income tax data. We also do not study years after 2009 due to the extreme impact of the Great Recession on Spain after that year. In particular, our fixed-effects estimations rely on migrants to identify some key coefficients. Migrations across urban areas had remained very stable, with around 7% of workers relocating every year since 1998 through both bad and good times, but plummeted below 3% in the Great Recession.

The mcvl also provides individual characteristics contained in social security records, such as age and gender, and also matched characteristics contained in Spain’s Continuous Census of Population (Padrón Continuo), such as country of birth, nationality, and educational attainment. Information on educational attainment has improved in recent editions of the mcvl. The Ministry of Education now reports individuals’ highest educational attainment directly to the National Statistical Institute and this information is used to update the corresponding records in the Continuous Census of Population. To take advantage of this improved data on educational attainment, we use not only the editions of the mcvl that correspond to our study period 2004-2009, but also the editions for the period 2010-2013.

Interested users must therefore apply to Spain's Dirección General de Ordenación de la Seguridad Social to obtain the Muestra Continua de Vidas Laborales con Datos Fiscales data for the years 2004-2013.

Obtaining the MCVL data

The application process for obtaining the mcvl data requires completing 11 forms that can be downloaded from the website of Spain's Seguridad Social. At the time of writing, the forms (available only in Spanish) can be found at http://www.seg-social.es/Internet_1/Estadistica/Est/Muestra_Continua_de_Vidas_Laborales/SolicitarM/index.htm. One form (Ficha de Usuario) asks for details of the user and the research project. The remaining ten forms (Condiciones MCVL 20YY CDF), one for each edition of the mcvl 2004-2013, specify the terms and conditions of use. Note that there is a different form for every edition and for versions with and without income tax data (the versions with income tax data are marked con Datos Fiscales or CDF).

After completing and signing the forms, these must be sent to:

Dirección General de Ordenación de la Seguridad Social
Subdirección General de Seguimiento Económico
C/Jorge Juan 59
28001 Madrid
Spain

If the application is approved, the user will receive a set of DVDs with the data.

Combining the MCVL data and the replication code

After downloading the replication code from this site and obtaining the Muestra Continua de Vidas Laborales con Datos Fiscales data for the years 2004-2013 from Spain's Dirección General de Ordenación de la Seguridad Social, users need to unzip the replication files and copy the data files contained in the DVD for each mcvl edition to the paths specified below.

The DVD for each mcvl edition contains multiple data files with a .txt or .trs extension. We use the following files to construct the panel: a file with individual characteristics (the original name contains the terms PERSANON or PERSONAL); several files that comprise all spells in the lives of individuals, specifying start and end dates for each of them (their names contain the terms AFILANON, AFILIAD or AFILIA); files for each of the twelve calendar months that provide (top- and bottom coded) earnings from social security records for each spell that takes place in a given month (their names contain the terms COTIANON or COTIZA); a file with earnings from social security records for self-employment and other types of non-standard work regimes (their names also contain COTIANON or COTIZA but have the number 13 instead of the calendar month number, or contain CPROANON); and (uncensored) annual earnings from tax records for every tax source in a calendar year (the original names contain the terms BLOQUE5, DATOS_FISCALES or FISCAL).

The following files contained in the DVD for Muestra Continua de Vidas Laborales con Datos Fiscales 2004 must be placed in the directory mcvl/mcvl_cdf_2004/: AFILANON1.txt, AFILANON2.txt, AFILANON3.txt, BLOQUE5.txt, COTIANON1.txt, COTIANON2.txt, COTIANON3.txt, COTIANON4.txt, COTIANON5.txt, COTIANON6.txt, COTIANON7.txt, COTIANON8.txt, COTIANON9.txt, COTIANON10.txt, COTIANON11.txt, COTIANON12.txt, COTIANON13.txt, DIVISION.txt, PERSANON.txt, PREANON.txt.

The following files contained in the DVD for Muestra Continua de Vidas Laborales con Datos Fiscales 2005 must be placed in the directory mcvl/mcvl_cdf_2005/: AFILANON1.trs, AFILANON2.trs, AFILANON3.trs, BLOQUE5.trs, CONVIVI.trs, COTIANON1.trs, COTIANON2.trs, COTIANON3.trs, COTIANON4.trs, COTIANON5.trs, COTIANON6.trs, COTIANON7.trs, COTIANON8.trs, COTIANON9.trs, COTIANON10.trs, COTIANON11.trs, COTIANON12.trs, CPROANON.trs, DIVISION.trs, PERSANON.trs, PREANON.trs.

The following files contained in the DVD for Muestra Continua de Vidas Laborales con Datos Fiscales 2006 must be placed in the directory mcvl/mcvl_cdf_2006/: AFILANON1.trs, AFILANON2.trs, AFILANON3.trs, CONVIVI.trs, COTIANON1.trs, COTIANON2.trs, COTIANON3.trs, COTIANON4.trs, COTIANON5.trs, COTIANON6.trs, COTIANON7.trs, COTIANON8.trs, COTIANON9.trs, COTIANON10.trs, COTIANON11.trs, COTIANON12.trs, COTIANON13.trs, DATOS_FISCALES.trs, DIVISION.trs, PERSANON.trs, PREANON.trs.

The following files contained in the DVD for Muestra Continua de Vidas Laborales con Datos Fiscales 2007 must be placed in the directory mcvl/mcvl_cdf_2007/: AFILANON1.trs, AFILANON2.trs, AFILANON3.trs, CONVIVI.trs, COTIANON1.trs, COTIANON2.trs, COTIANON3.trs, COTIANON4.trs, COTIANON5.trs, COTIANON6.trs, COTIANON7.trs, COTIANON8.trs, COTIANON9.trs, COTIANON10.trs, COTIANON11.trs, COTIANON12.trs, COTIANON13.trs, DATOS_FISCALES.trs, DIVISION.trs, PERSANON.trs, PREANON.trs.

The following files contained in the DVD for Muestra Continua de Vidas Laborales con Datos Fiscales 2008 must be placed in the directory mcvl/mcvl_cdf_2008/: AFILANON1.trs, AFILANON2.trs, AFILANON3.trs, CONVIVI.trs, COTIANON1.trs, COTIANON2.trs, COTIANON3.trs, COTIANON4.trs, COTIANON5.trs, COTIANON6.trs, COTIANON7.trs, COTIANON8.trs, COTIANON9.trs, COTIANON10.trs, COTIANON11.trs, COTIANON12.trs, COTIANON13.trs, DATOS_FISCALES.trs, DIVISION.trs, PERSANON.trs, PREANON.trs.

The following files contained in the DVD for Muestra Continua de Vidas Laborales con Datos Fiscales 2009 must be placed in the directory mcvl/mcvl_cdf_2009/: MCVL2009AFILIAD1_CDF.TXT, MCVL2009AFILIAD2_CDF.TXT, MCVL2009AFILIAD3_CDF.TXT, MCVL2009CONVIVIR_CDF.TXT, MCVL2009COTIZA1.TXT,MCVL2009COTIZA2.TXT, MCVL2009COTIZA3.TXT, MCVL2009COTIZA4.TXT, MCVL2009COTIZA5.TXT, MCVL2009COTIZA6.TXT, MCVL2009COTIZA7.TXT, MCVL2009COTIZA8.TXT, MCVL2009COTIZA9.TXT, MCVL2009COTIZA10.TXT, MCVL2009COTIZA11.TXT, MCVL2009COTIZA12.TXT, MCVL2009COTIZA13.TXT, MCVL2009DIVISION_CDF.TXT, MCVL2009FISCAL_CDF.TXT, MCVL2009PERSONAL_CDF.TXT, MCVL2009PRESTAC_CDF.TXT.

The following files contained in the DVD for Muestra Continua de Vidas Laborales con Datos Fiscales 2010 must be placed in the directory mcvl/mcvl_cdf_2010/: MCVL2010AFILIAD1_CDF.TXT, MCVL2010AFILIAD2_CDF.TXT, MCVL2010AFILIAD3_CDF.TXT, MCVL2010CONVIVIR_CDF.TXT, MCVL2010COTIZA1.TXT,MCVL2010COTIZA2.TXT, MCVL2010COTIZA3.TXT, MCVL2010COTIZA4.TXT, MCVL2010COTIZA5.TXT, MCVL2010COTIZA6.TXT, MCVL2010COTIZA7.TXT, MCVL2010COTIZA8.TXT, MCVL2010COTIZA9.TXT, MCVL2010COTIZA10.TXT, MCVL2010COTIZA11.TXT, MCVL2010COTIZA12.TXT, MCVL2010COTIZA13.TXT, MCVL2010DIVISION_CDF.TXT, MCVL2010FISCAL_CDF.TXT, MCVL2010PERSONAL_CDF.TXT, MCVL2010PRESTAC_CDF.TXT.

The following files contained in the DVD for Muestra Continua de Vidas Laborales con Datos Fiscales 2011 must be placed in the directory mcvl/mcvl_cdf_2011/: MCVL2011.F2013.AFILIA1_CDF.txt, MCVL2011.F2013.AFILIA2_CDF.txt, MCVL2011.F2013.AFILIA3_CDF.txt (these three file names reflect that these are the updated versions now being provided by the Social Security Administration correcting an error in the initial version of these files), MCVL2011CONVIVIR_CDF.TXT, MCVL2011COTIZA1_CDF.TXT, MCVL2011COTIZA2_CDF.TXT, MCVL2011COTIZA3_CDF.TXT, MCVL2011COTIZA4_CDF.TXT, MCVL2011COTIZA5_CDF.TXT, MCVL2011COTIZA6_CDF.TXT, MCVL2011COTIZA7_CDF.TXT, MCVL2011COTIZA8_CDF.TXT, MCVL2011COTIZA9_CDF.TXT, MCVL2011COTIZA10_CDF.TXT, MCVL2011COTIZA11_CDF.TXT, MCVL2011COTIZA12_CDF.TXT, MCVL2011COTIZA13_CDF.TXT, MCVL2011DIVISION_CDF.TXT, MCVL2011FISCAL_CDF.TXT, MCVL2011PERSONAL_CDF.TXT, MCVL2011PRESTAC_CDF.TXT.

The following files contained in the DVD for Muestra Continua de Vidas Laborales con Datos Fiscales 2012 must be placed in the directory mcvl/mcvl_cdf_2012/: MCVL2012AFILIAD1_CDF.TXT, MCVL2012AFILIAD2_CDF.TXT, MCVL2012AFILIAD3_CDF.TXT, MCVL2012CONVIVIR_CDF.TXT, MCVL2012COTIZA1_CDF.TXT, MCVL2012COTIZA2_CDF.TXT, MCVL2012COTIZA3_CDF.TXT, MCVL2012COTIZA4_CDF.TXT, MCVL2012COTIZA5_CDF.TXT, MCVL2012COTIZA6_CDF.TXT, MCVL2012COTIZA7_CDF.TXT, MCVL2012COTIZA8_CDF.TXT, MCVL2012COTIZA9_CDF.TXT, MCVL2012COTIZA10_CDF.TXT, MCVL2012COTIZA11_CDF.TXT, MCVL2012COTIZA12_CDF.TXT, MCVL2012COTIZA13_CDF.TXT, MCVL2012DIVISION_CDF.TXT, MCVL2012FISCAL_CDF.TXT, MCVL2012PERSONAL_CDF.TXT, MCVL2012PRESTAC_CDF.TXT.

The following files contained in the DVD for Muestra Continua de Vidas Laborales con Datos Fiscales 2013 must be placed in the directory mcvl/mcvl_cdf_2013/: MCVL2012FISCAL_CDF.TXT, MCVL2012PRESTAC_CDF.TXT, (note these two files have the year 2012 in their names due to an error by the Social Security Administration when producing the DVDs, but they contain the 2013 data), MCVL2013AFILIAD1_CDF.TXT, MCVL2013AFILIAD2_CDF.TXT, MCVL2013AFILIAD3_CDF.TXT, MCVL2013AFILIAD4_CDF.TXT, MCVL2013CONVIVIR_CDF.TXT, MCVL2013COTIZA1_CDF.TXT, MCVL2013COTIZA2_CDF.TXT, MCVL2013COTIZA3_CDF.TXT, MCVL2013COTIZA4_CDF.TXT, MCVL2013COTIZA5_CDF.TXT, MCVL2013COTIZA6_CDF.TXT, MCVL2013COTIZA7_CDF.TXT, MCVL2013COTIZA8_CDF.TXT, MCVL2013COTIZA9_CDF.TXT, MCVL2013COTIZA10_CDF.TXT, MCVL2013COTIZA11_CDF.TXT, MCVL2013COTIZA12_CDF.TXT, MCVL2013COTIZA13_CDF.TXT, MCVL2013DIVISION_CDF.TXT, MCVL2013PERSONAL_CDF.TXT.

Additional data

The following four required additional data files are provided, and can be found in the directory otherdata/:

ua_muni.dta.
ua_size.dta.
ua_geog.dta.
cpi_annual.dta.

The data file ua_muni.dta provides Spain's official urban area definitions, constructed by the Ministry of Housing in 2008 and maintained unchanged since then. The data file specifies the municipalities that make up each of Spain's urban areas and contains the following variables:

muni_id. Municipality code.
muni_name. Municipality name.
ua_id. Urban Area code.
ua_name. Urban Area name.

The data file ua_size.dta provides our size measures for Spain's urban areas. To measure the size of each urban area, we calculate the number of people within 10 kilometres of the average person in the urban area. We do so on the basis of the 1-kilometre-resolution population grid for Spain in 2006 created by Goerlich and Cantarino (2013). They begin with population data from Spain’s Continuous Census of Population (Padrón Continuo) at the level of the approximately 35,000 census tracts (áreas censales) that cover Spain. Within each tract, they allocate population to 1×1 kilometre cells based on the location of buildings as recorded in high-resolution remote sensing data. We take each 1×1 kilometre cell in the urban area, trace a circle of radius 10 kilometres around the cell (encompassing both areas inside and outside the urban area), count population in that circle, and average this count over all cells in the urban area weighting by the population in each cell. This yields the number of people within 10 kilometres of the average person in the urban area. See the article for a discussion of the advantages of using this measure of size for urban areas. We also construct a similar measure for the year 1900, which is used as an instrument in the estimations of table 3. For this purpose, we obtain historical population data from Goerlich, Mas, Azagra, and Chorén (2006) who construct decennial municipality population series using all available censuses from 1900 to 2001, keeping constant the areas of municipalities in 2001. As we do for current urban area size, we measure urban area size in 1900 with the number of people within 10 kilometres of the average person in the urban area. Since we lack a 1-kilometre-resolution population grid for 1900, we distribute population uniformly within the municipality when performing our historical size calculations. The data file contains the following variables:

ua_id. Urban Area code.
ua_name. Urban Area name.
size_pop10k. Urban Area size (people within 10km of average person), 2006.
size_pop10k_1900. Urban Area size (people within 10km of average person), 1900.

The data file ua_geog.dta provides geographical variables for Spain's urban areas, which are used as an instrument in the estimations of table 3. The data file contains the following variables:

ua_id. Urban Area code.
ua_name. Urban Area name.
pct25k_fertile. % of land within 25 kilometres of the city centre that has high potential quality. Potential land quality refers to the inherent physical quality of the land resources for agriculture, biomass production and vegetation growth, prior to any modern intervention such as irrigation. The source of the land quality data is the corine Project (Coordination of Information on the Environment), initiated by the European Commission in 1985 and later incorporated by the European Environment Agency into its work programme (European Environment Agency, 1990). We calculate the percentage of land within 25 kilometres of the city centre with high potential quality using Geographic Information Systems (gis). The city centre is defined as the centroid of the main municipality of the urban area (the municipality that gives the urban area its name or the most populated municipality when the urban area does not take its name from a municipality).
avg25k_elevation. Mean elevation within 25km of urban area centre (m.). This is calculated on the basis of elevation data from the Shuttle Radar Topographic Mission (Jarvis, Reuter, Nelson, and Guevara, 2008), which record elevation for points on a grid 3 arc-seconds apart (approximately 90 metres). Note the estimations use the natural logarithm of this variable as an instrument.
pct25k_water. % of area within 25km of urban area centre covered by water (oceans, rivers or lakes). Geographic information on the location of water bodies in and around urban areas is computed using gis and the digital map of Spain’s hydrography included in Goerlich, Mas, Azagra, and Chorén (2006).
pct25k_steep15. % of area within 25km of urban area centre with slope greater than 15%. Slope is calculated on the basis of elevation data from the Shuttle Radar Topographic Mission (Jarvis, Reuter, Nelson, and Guevara, 2008), which record elevation for points on a grid 3 arc-seconds apart.
roman_roads. Roman road rays crossing a circumference of 25km radius around the city centre. This is computed using gis and the digital map of Roman roads of McCormick, Huang, Zambotti, and Lavash (2008).

The data file cpi_annual.dta provides consumer price index data for Spain for the period 1980-2014, obtained from Spain's Instituto Nacional de Estadística, and contains the following variables:

year. Year.
cpi_change. Percentage variation in the Consumer Price Index between December of the current year and December of the previous year.
cpi_index2009. Consumer Price Index (2009 = 1).

Constructing the MCVL panel

The unit of observation in the social security data contained in the mcvl is any change in the individual’s labour market status or any variation in job characteristics (including changes in occupation or contractual conditions within the same firm). The data record all changes since the date of first employment, or since 1980 for earlier entrants. Using this information, together with the matched income tax and census records, we construct a panel with monthly observations for the period 2004-2009. This is done by running the Stata do file code/make_panel.do, which produces the Stata data file with the panel output/mcvl_panel.dta. This data file contains the following variables:

person_id. Individual anonymized identifier.
year. Year.
month. Month.
date. Sequential ordering of months (January 2004 - December 2013).
mcvl_wave. mcvl wave used to extract information.
sex. Sex (1 - Male; 2 - Female).
birth_date. Birth year-month (yyyymm).
age. Age (years).
birth_prov. Province of birth (2-digit code, province names provided as value labels).
ss_reg_prov. Province of social security registration (2-digit code, province names provided as value labels).
edu_belowsecondary. Less than secondary education indicator.
edu_secondary. Secondary education indicator.
edu_tertiary. University education indicator.
edu_code. Education mcvl code.
earnings. Total real daily earnings from tax returns (2009 cents of euro).
earnings_capped_ss. Top- and bottom-coded real daily earnings from social security records (2009 cents of euro).
days_worked_month. Days worked in current job in current month.
estab_id. Firm establishment anonymized identifier (based on secondary social security contribution account code).
firm_id. Firm anonymized identifier (based on firm tax identification code).
tenure. Tenure in current employer (calculated in days and expressed in years).
experience. Experience (calculated in days and expressed in years).
uexp_1_2. Experience in 1st-2nd biggest cities (calculated in days and expressed in years).
uexp_3_5. Experience in 3rd-5th biggest cities (calculated in days and expressed in years).
uexp_smaller. Experience outside 5 biggest cities (calculated in days and expressed in years).
uexp_current. Experience in the current city (calculated in days and expressed in years).
ua_id. Urban area of current job (2-digit code, urban area names provided as value labels).
muni_id. Municipality of current job (5-digit code).
prov_id. Province of current job (2-digit code, province names provided as value labels).
ua_1_2. Indicator that current job is in 1st-2nd biggest cities.
ua_3_5. Indicator that current job is in 3rd-5th biggest cities.
ua_1_5. Indicator that current job is in 1st-5th biggest cities.
ua_smaller. Indicator that current job is outside 5th biggest cities.
migrant_depart. Last observation prior to migration between urban areas.
migrant_arrive. First observation following migration between urban areas.
skill. Occupational skills (1 - Very-high-skilled occupation; 2 - High-skilled occupation; 3 - Medium-high-skilled occupation; 4 - Medium-low-skilled occupation; 5 - Low-skilled occupation).
occupation. Occupation (grupos de cotización: 1 - Engineers, college graduates and senior managers; 2 - Technical engineers and graduate assistants; 3 - Administrative and technical managers; 4 - Non-graduate assistants; 5 - Administrative officers; 6 - Subordinates; 7 - Administrative assistants; 8 - First and second class officers; 9 - Third class officers and technicians; 10 - Labourers).
sector2d. 2-digit sector code (cnae93, sector names provided as value labels).
sector3d. 3-digit sector code (cnae93).
contract_fixedterm. Fixed-term contract indicator.
contract_parttime. Part-time contract indicator.
date_1 - date_72. Sequential date indicators.
sector2d_15 - sector2d_95. 2-digit sector indicators (cnae93).
skill_1 - skill_5. Skill indicators.
ua_id_1 - ua_id_83. Urban area indicators.

To construct this panel, for reasons explained above, we begin with the 2013 edition of the mcvl and extract social security records covering the complete labour market history of the individuals contained in this edition going back to their date of first employment as well as census records containing their personal characteristics. To reduce computation requirements, we process the data by birth-year cohorts.

For each worker, we combine consecutive job spells without gaps with the same employer into a single job spell, but keep track of changes in job characteristics within that job spell. An employer is identified in the data by two (anonymized) codes. One (estab_id in our panel data) is based on the contribution account code used by the Social Security Administration. Social Security legislation requires firms to keep separate contribution account codes for each province in which they conduct business. Furthermore, within a province, a municipality identification code is provided if the workplace establishment is located in a municipality with population greater than 40,000 inhabitants. We use this information to track not just the current workplace location but also cumulative experience in different locations or sets of locations. We also construct precise measures of tenure and experience, calculated as the actual number of days the individual has been employed, respectively, in the same establishment and overall. The second employer identifier (firm_id in our panel data) is based on the tax identification code asigned to each firm by the Tax Agency and is common to all establishments by the same firm.

Since the monthly panel we construct records at most one job per individual and month, when the individual performs more than one job in the month, we record the main job. Typical instances of simultaneous jobs feature one long-term job and some occassional short-term jobs that last a week or even a few days, and in such cases we give precedence to the longest job spell. When an individual switches job partway during the month, we select the main job as that with the highest total monthly earnings reported in mcvl. Once we identify the main job in a month, we keep all characteristics associated with this job, such as (anonymized) establishment identifier, worker occupation and type of contract.

After processing the social security and census records of individuals contained in the 2013 edition of the mcvl, we turn to the 2012 edition and extract the social security and census records of the individuals contained in this edition but absent from the 2013 edition. Since the criterion for inclusion in the mcvl and the individual’s anonymized identifier are maintained across mcvl editions, these are individuals who ceased affiliation with the Social Security during 2012. We do the same for the 2011, 2010, 2009, 2008, 2007, 2006 and 2005 editions in this order. The 2004 edition, due to an error in the design of the source data files (corrected in later editions), includes an anonymised employer code based on the firm's tax identification code only in the source data file with the tax records and not in the source data files with the social security records. Thus, we do not extract social security records from the 2004 edition. However, since most of the relevant individuals included in the 2014 edition are also included in later editions, and since these later editions contain their entire social security records and the anonymized firm identifier required to match the 2004 income tax data, we can still extend our panel to include uncensored earnings for 2004.

Next we assign a sector code to each observation in the panel. The 2004-2008 editions of the mcvl contain, for each job spell going back to 1993, a sector code from the cnae 93 sectoral classification that corresponds to the establishment's sector at the time when that mcvl edition was produced (but not necessarily at the time of the job spell) or, for establishments that are no longer active, the sector at the time they were last observed. The 2009 edition contains a sector code from the newer cnae 09 sectoral classification, but not a code from the cnae 93 classification, and only for establishments that were still active in 2009. The 2010-2013 editions of the mcvl contain sector codes from both the cnae 93 and the cnae 09 classifications. The cnae 93 code corresponds to the establishment's sector in 2009 or, for establishments that are no longer active, the sector at the time they were last observed. The cnae 09 code corresponds to the establishment's sector at the time when that mcvl edition was produced (but not necessarily at the time of the job spell) or, for establishments that are no longer active but were active at some point since 2009, the sector at the time they were last observed. To provide consistent sector codes from the cnae 93 clasification that reflect as accurately as possible the establishment's sector at the time of the job spell, we combine the information in all the mcvl editions. For job spells in 2004-2008, we use the cnae 93 sector code in the mcvl edition of the same year as the job spell. For job spells in 2009-2013 in establishments that were active in both 2009 and 2010, we take the establishment's cnae 93 code from the 2010 mcvl edition (which reflects the establishment's cnae 93 sector in 2009). For job spells in 2009-2013 in establishments that were not active in both 2009 and 2010, but were active in 2008 or earlier, we take the establishment's cnae 93 code from the 2008 mcvl edition (or the closest earlier mcvl edition when the establishment was active). For establishments that were active only in 2009 and for establishments created after 2009, we assign them the modal cnae 93 code of other establishments with the same cnae 09 code.

Finally, we assign uncensored earnings from income tax records to each observation in the panel. Each source of labour income can be matched between income tax records and social security records based on both employee and employer (anonymized) identifiers. In addition to uncensored earnings from income tax records, the mcvl contains earnings data from social security records going back to 1980. These alternative earnings data are either top or bottom coded for about 13% of observations. We therefore use the income tax data to compute monthly earnings, since these are completely uncensored. We express all earnings in real terms, using the consumer price index to obtain equivalent 2009 cents of euro. Given that labour earnings data from income tax records provide a value for each worker, firm and year, while the social security records provide a (top- and bottom-coded) value for each worker, establishment and month, we allocate an individual's annual labour earnings in a firm from income tax records by splitting them across months in proportion to the share of social security earnings for that worker, firm, and year that fall within that particular month. Finally, we express earnings in daily terms by dividing monthly earnings in a job by the days worked in that month in that job.

After combining the social security and income tax records, our monthly panel covers job spells in 2004-2009 for individuals aged 18 and over, born since 1962, and employed at any point between January 2004 and December 2009. Additional sample restrictions, detailed in the article, are imposed in the file code/panel/mcvl_panel.do to obtain the final panel output/mcvl_panel.dta.

Replicating the tables and figures

After obtaining the Muestra Continua de Vidas Laborales con Datos Fiscales data from Spain's Dirección General de Ordenación de la Seguridad Social, and running code/make_panel.do to construct the monthly panel output/mcvl_panel.dta, interested users can replicate all of the tables and figures in the published article by running code/make_esurban.do. This calls a series of Stata do files contained in the directory code/analysis/. The order in which these files are called within code/make_esurban.do matters (code/analysis/esurban_table3.do requires code/analysis/esurban_table2.do to be run before; code/analysis/esurban_table6.do requires code/analysis/esurban_table4.do to be run before; and code/analysis/esurban_figures.do requires code/analysis/esurban_table1.do, code/analysis/esurban_table2.do, and code/analysis/esurban_table4.do to be run before).

After running code/make_esurban.do, compiling in LaTeX the file output/esurban_results.tex produces a PDF file with all the tables and figures put together.

Software and hardware notes

All of tables and figures in the Review of Economic Studies article have been produced using the code provided, Stata version 14, Stat/Transfer version 13, and sas version 9.3 on a computer running the Linux operating system (although we have also succesfully tested the Stata code under Mac OS X and Windows). The following additional Stata packages are required:

Instrumental variables and GMM: Estimation and testing, by Christopher F. Baum, Mark E. Schaffer, and Steven Stillman. To install, run the following command in Stata: net install st0030_3, from(http://www.stata-journal.com/software/sj7-4).
Est2tex: module to create LaTeX tables from estimation results, by Marc-Andreas Muendler. To install, run the following command in Stata: net install est2tex, from(http://fmwww.bc.edu/RePEc/bocode/e).
Parallel: module for Parallel Computing, by George Vega Yon and Brian Quistorff. To install, run the following command in Stata: net install parallel, from(http://fmwww.bc.edu/RePEc/bocode/p).
Stcmd: module to execute StatTransfer command from within Stata. To install, run the following command in Stata: net install stcmd, from(http://fmwww.bc.edu/RePEc/bocode/s).

The code has been written to be as portable as possible. Nevertheless, if running the code on a different type of system, the following considerations should be kept in mind:

sas and Stat/Transfer: sas and Stat/Transfer are only required to produce table 6, which uses the sas code by Combes, Duranton, Gobillon, Puga, and Roux (2012) documented at https://diegopuga.org/data/selectagg/. The path to the sas executable must be specified in line 20 in code/analysis/esurban_table6.do. sas is available for Linux and Windows, but not for Mac OS X. Mac users will need to run the sas code for table 6 under virtualization software or on a different computer. Otherwise, users who do not have access to sas can replicate all the results except for table 6 using Stata alone simply by commenting out line 30 in code/make_esurban.do.
Stata: The code has been tested with Stata versions 14 and 15. For the purpose of running the code provided, there are two relevant differences relative to Stata version 13. First, the ability to read unicode files in Stata 14 (if using Stata 13, one can convert the provided *.do and *.dta files from unicode to plain ascii, but some characters may not display correctly). Second, the new default pseudo-random number generator, which in Stata 14 defaults to the 64-bit Mersenne-Twister generator (mt19937-64) instead of the 32-bit kiss generator (thus, Stata 13 would yield slightly different bootstrapped standard errors).
Operating system: None of the Stata code is operating-system specific. However, lines 13-25 in mcvl/dofiles/reading_mcvl_2004.do use the sed command to deal with the fact that variables in the 2004 edition of the mcvl are delimited differently than in later editions. The sed command is available as part of any UNIX-based operating system, including Linux and Mac OS X. Windows users can obtain a native port of sed from http://unxutils.sourceforge.net.
Hardware: For best results, we recommend running the replication code in a computer with 8 or more processor cores, at least 64Gb of memory and at least 200Gb of free disk space (this includes the disk space required to store the mcvl data). However, we have written the replication files so that they can be used in systems with lower processor and memory specifications. When fewer processor cores and/or less memory are available, we recommend editing line 24 in the file code/analysis/esurban_table4.do from local bsjobs = 8 to set this local variable to a lower number (anything down to local bsjobs = 2 will work). This will reduce the number of parallel jobs launched to compute the bootstrapped standard errors for table 4 (by default the 100 bootstrap iterations are split into 8 parallel jobs to speed up computations when the hardware allows for this). Note that this edit will result in slightly different bootstrapped standard errors because the pseudo-random number generator will not be using the same seeds as when running the code exactly as provided.
Time: Running the replication code on the computer we used throughout the project, with two 4-core Intel Xeon X7560 processors and 256Gb of 1066 MHz ddr3 memory, took 21 hours of runtime to produce the panel and another 2 days and 15 hours of runtime to produce the tables and figures. A more recent run, on a computer with one 10-core Intel Xeon W2155 Processor and 128Gb of 2666 MHz ddr4 memory, took 10 hours of runtime to produce the panel and another 27 hours of runtime to produce the tables and figures.

References

Combes, Pierre-Philippe, Gilles Duranton, Laurent Gobillon, Diego Puga, and Sébastien Roux. 2012. The productivity advantages of large cities: Distinguishing agglomeration from firm selection. Econometrica 80(6): 2543-2594.

De la Roca, Jorge and Diego Puga. 2017. Learning by working in big cities. Review of Economic Studies 84(1): 106-142..

European Environment Agency. 1990. corine Land Quality Project. Copenhagen: European Environment Agency.

Goerlich, Francisco J. and Isidro Cantarino. 2013. A population density grid for Spain. International Journal of Geographical Information Science 27(12): 2247-2263.

Goerlich, Francisco J., Matilde Mas, Joaquín Azagra, and Pilar Chorén. 2006. La localización de la poblacion española sobre el territorio. Un siglo de cambios: un estudio basado en series homogéneas. Bilbao, Spain: bbva Foundation.

Jarvis, Andrew, Hannes Isaak Reuter, Andrew Nelson, and Edward Guevara. 2008. Hole-filled Seamless srtm Data, Version 4.1. Cali, Colombia: International Centre for Tropical Agriculture.

McCormick, Michael, Guoping Huang, Giovanni Zambotti, and Jessica Lavash. 2008. Roman Road Network. Cambridge, ma: Digital Atlas of Roman and Medieval Civilizations (darmc), Center for Geographic Analysis, Harvard University.