Code for Distinguishing Agglomeration from Firm Selection

Replication files for 'The productivity advantages of large cities: Distinguishing agglomeration from firm selection'

by Pierre-Philippe Combes, Gilles Duranton, Laurent Gobillon, Diego Puga, and Sébastien Roux

This site distributes and documents computer programs that implement the estimation methodology developed by Pierre-Philippe Combes, Gilles Duranton, Laurent Gobillon, Diego Puga, and Sébastien Roux in their article 'The productivity advantages of large cities: Distinguishing agglomeration from firm selection', published in Econometrica 80(6), November 2012: 2543-2594.

Firms are more productive on average in larger cities. Two main explanations have been offered: firm selection (larger cities toughen competition, allowing only the most productive to survive) and agglomeration economies (larger cities promote interactions that increase productivity). If selection is tougher in larger cities, fewer of the weaker firms will survive there. Stronger selection should thus lead to a greater left truncation of the distribution of firm log productivity in larger cities. If agglomeration economies are stronger in larger cities, all firms located there will enjoy some productive advantages, with perhaps some benefiting more than others. Stronger agglomeration effects in larger cities should thus lead instead to a greater rightwards shift of the distribution of firm log productivity in larger cities. To the extent that more productive firms are better able to reap the benefits of agglomeration, agglomeration should also lead to an increased dilation of the distribution of firm log productivity in larger cities. While these properties should hold generally, the paper provides a nested model of selection and agglomeration that helps interpret the empirical results.

The article then develops an empirical methodology to estimate the extent to which the log productivity distribution in larger cities is left-truncated (evidence of differences in selection effects) or dilated and right-shifted (evidence of common productivity advantages) compared to the log productivity distribution in smaller cities. This is applied to French establishment-level total factor productivity data.

While the administrative establishment-level data used in the article cannot be posted here, we make available the computer code required to implement our empirical methodology. We also provide an anonymized data set with ordinary least squares log total factor productivity estimates that can be used to replicate the baseline point estimates in the article (the log productivity data provided are regression residuals, so no actual data for individual establishments is being revealed). More generally, the computer code is written so that it can be easily used to apply the same methodology to find the combination of left-truncation, shift and dilation that, when applied to one distribution, best approximates a second distribution. A detailed description of the methodology is available in the article and in the suplemental material that accompanies it. Researchers interested in accessing the complete source data need to obtain permission from France's Comité du secret statistique. The Comité can be contacted at comite-secret@cnis.fr. The relevant forms and details on the application procedure can be found at the web site of the Conseil national de l'information statistique (http://www.cnis.fr/cms/Accueil/activites/_trois_comites/Comite_du_secret_statistique).

The files needed to implement this methodology are available for download from this site as a zip file: cdgpr_replication.zip (2,769 Kb) . This contains:

Usage with the included replication data

To replicate the baseline point estimates in the article 'The productivity advantages of large cities: Distinguishing agglomeration from firm selection', download the zip file cdgpr_replication.zip, uncompress it, and place all the included files in the same directory (e.g., c:\yourdirectory). Edit line 15 of cdgprmainprogram.sas so that it points to the directory where you placed the files in your system. For example:
%let basedir=c:\yourdirectory; /* base directory */
Run cdgprmainprogram.sas in SAS. This will produce as output a SAS data file resultscdgprdata.sas7bdat with 6 rows and 12 columns. The 12 columns are:

The 6 rows correspond to the following cases:

This reproduces the point estimates of table I and table II in the article. Standard errors are not produced automatically because this requires access to the full establishment-level data used to estimate log total factor productivity, since the standard errors also need to account for variation in the productivity estimation stage of the methodology. See the section on computing standard errors below in this page for a description of how to do this.

Usage with other data

The code provided can also be used to find the combination of left-truncation, shift and dilation that, when applied to one distribution, best approximates a second distribution. The source data set to which you wish to apply the methodology, in SAS data format, needs to be specified in line 16 of cdgprmainprogram.sas. For example:
%let data=yourdata; /* source data set (extension to be sas7bdat) */
This must include the following variables:

Whenever one estimates firm-level productivity, measurement errors are likely to result in a few extreme outliers. To minimise the impact of such outliers in our estimates we exclude the 1 percent of observations with the highest productivity values and the 1 percent of observations with the lowest productivity values in each employment area density class. It is important to trim extreme values in both classes to avoid biasing the estimate of S. In other applications, it may advisable to trim a different percentage of extreme observations. This can be set in line 18 of cdgprmainprogram.sas. For example, to trim 2% instead of 1% on each extreme of each of the two distributions:
%let trim=2; /* defines the percentage of observations to be trimmed on each extreme of each of the two distributions; set 1 for 1%, 2 for 2% etc */

Running cdgprmainprogram.sas in SAS will produce as output a SAS data file resultsyourdata.sas7bdat (if your source data set is named yourdata.sas7bdat), with 6 rows for each of the variables you specified and the same 12 columns described above.

Calculating bootstrapped standard errors

Standard errors of the estimated parameters are bootstrapped drawing observations for some establishments out of the log productivity distribution with replacement. For each bootstrap iteration, we first reestimate log productivity for each observation employed in the iteration, and we then reestimate Aˆ , Dˆ , and Sˆ  . Finally, we use the distribution of estimates of Aˆ , Dˆ , and Sˆ  that results from all bootstrap iterations to compute the standard errors. Given that cdgprmainprogram.sas can be run on multiple variables simultanously by editing line 15, one can make the first variable be log productivity estimated with all establishments, and then specify another 100 variables with bootstrapped log productivity estimates. Calculating the standard deviation of the estimated coefficients across the 100 bootstrapped iterations with the same values of the shift, dilation, and truncation indicators yields standard errors.

References

Combes, Pierre-Philippe, Gilles Duranton, Laurent Gobillon, Diego Puga, and Sébastien Roux. 2012. The productivity advantages of large cities: Distinguishing agglomeration from firm selection. Econometrica 80(6): 2543-2594.