# Code for Distinguishing Agglomeration from Firm Selection

## Replication files for 'The productivity advantages of large cities: Distinguishing agglomeration from firm selection'

#### by Pierre-Philippe Combes, Gilles Duranton, Laurent Gobillon, Diego Puga, and Sébastien Roux

This site distributes and documents computer programs that implement the estimation methodology developed by Pierre-Philippe Combes, Gilles Duranton, Laurent Gobillon, Diego Puga, and Sébastien Roux in their article **'The productivity advantages of large cities: Distinguishing agglomeration from firm selection'**, published in * Econometrica* 80(6), November 2012: 2543-2594.

Firms are more productive on average in larger cities. Two main explanations have been offered: firm selection (larger cities toughen competition, allowing only the most productive to survive) and agglomeration economies (larger cities promote interactions that increase productivity). If selection is tougher in larger cities, fewer of the weaker firms will survive there. Stronger selection should thus lead to a greater left truncation of the distribution of firm log productivity in larger cities. If agglomeration economies are stronger in larger cities, all firms located there will enjoy some productive advantages, with perhaps some benefiting more than others. Stronger agglomeration effects in larger cities should thus lead instead to a greater rightwards shift of the distribution of firm log productivity in larger cities. To the extent that more productive firms are better able to reap the benefits of agglomeration, agglomeration should also lead to an increased dilation of the distribution of firm log productivity in larger cities. While these properties should hold generally, the paper provides a nested model of selection and agglomeration that helps interpret the empirical results.

The article then develops an empirical methodology to estimate the extent to which the log productivity distribution in larger cities is left-truncated (evidence of differences in selection effects) or dilated and right-shifted (evidence of common productivity advantages) compared to the log productivity distribution in smaller cities. This is applied to French establishment-level total factor productivity data.

While the administrative establishment-level data used in the article cannot be posted here, we make available the computer code required to implement our empirical methodology. We also provide an anonymized data set with ordinary least squares log total factor productivity estimates that can be used to replicate the baseline point estimates in the article (the log productivity data provided are regression residuals, so no actual data for individual establishments is being revealed). More generally, the computer code is written so that it can be easily used to apply the same methodology to find the combination of left-truncation, shift and dilation that, when applied to one distribution, best approximates a second distribution. A detailed description of the methodology is available in the article and in the suplemental material that accompanies it. Researchers interested in accessing the complete source data need to obtain permission from France's Comité du secret statistique. The Comité can be contacted at comite-secret@cnis.fr. The relevant forms and details on the application procedure can be found at the web site of the Conseil national de l'information statistique (http://www.cnis.fr/cms/Accueil/activites/_trois_comites/Comite_du_secret_statistique).

The files needed to implement this methodology are available for download from this site as a zip file:

(2,769 Kb) . This contains: *cdgpr_replication.zip*

- A SAS program that estimates the combination of left-truncation, shift and dilation that, when applied to one distribution, best approximates a second distribution:
`cdgprmainprogram.sas`

. The program has been tested to run properly on versions 9.2 and 9.3 of SAS in both Linux and Windows. - A second SAS program:
`cdgprmacros.sas`

. This program contains all the core routines and is called by`cdgprmainprogram.sas`

. - Establishment-level log total factor productivity data that, together with the two programs provided, can be used to replicate the baseline point estimates (bottom rows of table I and table II) in the article 'The productivity advantages of large cities: Distinguishing agglomeration from firm selection', in SAS data format:
`cdgprdata.sas7bdat`

. These data include three variables:- nident: a unique code for each establishment created specifically for this file with no particular meaning.
- cat: a code taking value 1 if the establishment is located in an employment area with below median employment density and taking value 2 if the establishment is located in an employment area with above median employment density.
- tfp_ols: Anonymized ordinary least squares log total factor productivity estimates for each establishment.

- The same establishment-level log total factor productivity data in comma-delimited ascii format:
`cdgprdata.csv`

. - The output produced by
`cdgprmainprogram.sas`

when run in SAS 9.3 on 13 February 2012, which corresponds to the results in the article:`resultscdgprdata.sas7bdat`

. - A copy of this documentation:
`readme.html`

(the latest version can be found at https://diegopuga.org/data/selectagg/).

### Usage with the included replication data

To replicate the baseline point estimates in the article 'The productivity advantages of large cities: Distinguishing agglomeration from firm selection', download the zip file `cdgpr_replication.zip`

, uncompress it, and place all the included files in the same directory (e.g., `c:\yourdirectory`

). Edit line 15 of `cdgprmainprogram.sas`

so that it points to the directory where you placed the files in your system. For example:

`%let basedir=c:\yourdirectory; /* base directory */`

Run `cdgprmainprogram.sas`

in SAS. This will produce as output a SAS data file `resultscdgprdata.sas7bdat`

with 6 rows and 12 columns. The 12 columns are:

- name: name of the variable on which the estimation is being run (tfp_ols when using the provided data file).
- shift: indicator variable taking value 1 if, for that row of the output file, the shift parameter
*A*is being estimated, and value 0 if the estimation is being run with the constraint*A*=0. - dilation: indicator variable taking value 1 if, for that row of the output file, the dilation parameter
*D*is being estimated, and value 0 if the estimation is being run with the constraint*D*=1. - truncation: indicator variable taking value 1 if, for that row of the output file, the truncation parameter
*S*is being estimated, and value 0 if the estimation is being run with the constraint*S*=0. - A: estimated value of the shift parameter,
*A*^{ˆ}, or 0 if shift=0. - D: estimated value of the dilation parameter,
*D*^{ˆ}, or 1 if dilation=0. - S: estimated value of the truncation parameter,
*S*^{ˆ}, or 0 if truncation=0. - R2: measure of the goodness of fit
*R*^{2}= 1-*M*(*A*^{ˆ},*D*^{ˆ},*S*^{ˆ})/*M*(0, 1, 0) . - obs: total number of observations being used in the estimation.
- criteria: the criteria being minimized
*M*(*A*^{ˆ},*D*^{ˆ},*S*^{ˆ}). - n1t: number of observations in the first distribution (source data observations with cat=1), after applying the estimated transformation. In rows with selection=0, this corresponds to the actual number of observations with cat=1 being used. In rows with selection=1, this is lower than the actual number of observations with cat=1 being used whenever
*S*^{ˆ}>0. - n2t: number of observations in the second distribution (source data observations with cat=2), after applying the estimated transformation. In rows with selection=0, this corresponds to the actual number of observations with cat=1 being used. In rows with selection=1, this is lower than the actual number of observations with cat=1 being used whenever
*S*^{ˆ}<0.

The 6 rows correspond to the following cases:

- shift=0, dilation=0, truncation=0. This is the baseline case with no transformation, useful to see the difference between the untransformed distributions as reflected in the value of the criteria, as well as the actual number of observations in each distribution being used in the estimation.
- shift=1, dilation=1, truncation=1.
- shift=1, dilation=1, truncation=0.
- shift=1, dilation=0, truncation=0.
- shift=0, dilation=0, truncation=1.
- shift=1, dilation=1, truncation=0.

This reproduces the point estimates of table I and table II in the article. Standard errors are not produced automatically because this requires access to the full establishment-level data used to estimate log total factor productivity, since the standard errors also need to account for variation in the productivity estimation stage of the methodology. See the section on computing standard errors below in this page for a description of how to do this.

### Usage with other data

The code provided can also be used to find the combination of left-truncation, shift and dilation that, when applied to one distribution, best approximates a second distribution. The source data set to which you wish to apply the methodology, in SAS data format, needs to be specified in line 16 of `cdgprmainprogram.sas`

. For example:

`%let data=yourdata; /* source data set (extension to be sas7bdat) */`

This must include the following variables:

- nident: a unique code for each establishment, unit or individual in the two distributions.
- cat: a code taking value 1 if the establishment is located in an employment area with below median employment density and taking value 2 if the establishment is located in an employment area with above median employment density.
- one or more variables; these (unlike the other two variables, nident and cat) can be named freely but the names need to be specified, separated by spaces, in line 15 of
`cdgprmainprogram.sas`

. For example:

`%let variables=var1 var2 var3 var4; /* variable name (or variable names separated by spaces) */`

Each variable name needs to have less than 50 characters. When more than one variable is included, the estimation is performed for each of them successively, and a single result file is produced with all the results. The name of each variable is used as an identifier.

Whenever one estimates firm-level productivity, measurement errors are likely to result in a few extreme outliers. To minimise the impact of such outliers in our estimates we exclude the 1 percent of observations with the highest productivity values and the 1 percent of observations with the lowest productivity values in each employment area density class. It is important to trim extreme values in both classes to avoid biasing the estimate of *S*. In other applications, it may advisable to trim a different percentage of extreme observations. This can be set in line 18 of `cdgprmainprogram.sas`

. For example, to trim 2% instead of 1% on each extreme of each of the two distributions:

`%let trim=2; /* defines the percentage of observations to be trimmed on each extreme of each of the two distributions; set 1 for 1%, 2 for 2% etc */`

Running `cdgprmainprogram.sas`

in SAS will produce as output a SAS data file `resultsyourdata.sas7bdat`

(if your source data set is named `yourdata.sas7bdat`

), with 6 rows for each of the variables you specified and the same 12 columns described above.

### Calculating bootstrapped standard errors

Standard errors of the estimated parameters are bootstrapped drawing observations for some establishments out of the log productivity distribution with replacement. For each bootstrap iteration, we first reestimate log productivity for each observation employed in the iteration, and we then reestimate *A*^{ˆ} , *D*^{ˆ} , and *S*^{ˆ} . Finally, we use the distribution of estimates of *A*^{ˆ} , *D*^{ˆ} , and *S*^{ˆ} that results from all bootstrap iterations to compute the standard errors. Given that `cdgprmainprogram.sas`

can be run on multiple variables simultanously by editing line 15, one can make the first variable be log productivity estimated with all establishments, and then specify another 100 variables with bootstrapped log productivity estimates. Calculating the standard deviation of the estimated coefficients across the 100 bootstrapped iterations with the same values of the shift, dilation, and truncation indicators yields standard errors.

### References

Combes, Pierre-Philippe, Gilles Duranton, Laurent Gobillon, Diego Puga, and Sébastien Roux. 2012. The productivity advantages of large cities: Distinguishing agglomeration from firm selection. *Econometrica* 80(6): 2543-2594.