Chapter 4 Metro Datasets and Code Standards
Brookings Metro publishes many novel datasets that accompany blogs and reports. During the research stage, we also create many useful interim datasets by collapsing or aggregating various administrative data to the metro scale. Depositing these commonly referenced datasets into the warehouse based on a shared guideline will facilitate consistent standards is crucial to ensuring both higher quality and more accessible data files.
4.1 Metro Dataset Standards
Metro data warehouse is a shared folder for final datasets. Researchers should still use their own project folders to store any input datasets or interim datasets during the research process.
4.1.1 File Format
Save the final dataset as a comma-separated values (.CSV
) file, which are plain-text files that work well with most statistical software packages. It is also recommended that users save with Unicode (UTF-8) encoding to allow compatibility with as wide an array of programs as possible.
4.1.2 File Names
File names should use only lowercase letters, numbers, and _
. Use underscores (_
) “snake case” to separate words within a name. Avoid spaces in file names.
4.1.3 Metadata
Each dataset should include metadata that describes:
- Author(s)
- Last updated date
- Project folder location: Specify where to find your interim datasets, code, etc.
- Variable summary
- Variable labels: Describe the variables and units, if applicable.
4.1.4 Data Format
4.1.4.1 Flat Files
Whenever possible, save all output in “tidy” data format(Wickham 2014): 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.

Figure 4.1: tidydata
4.1.4.2 Spatial Data
For publication, US maps should be projected into Albers Equal Area Conic or Lambert Conformal Conic (EPSG codes 102003 through 102009). State/metro level maps should be projected into the appropriate State Plane projection. Shapefiles consist of 3-8 separate files, which must be kept in the same directory, otherwise they will not project.
4.1.5 Unique Identifiers
Always preserve the unique identifiers for geographies, industries and occupations in your final datasets, stored as character to preserve leading zeros. For example, you might have saved “Birmingham” as cbsa_name
in your final output, but you also want to keep its cbsa_code
“13820” because when merging this to other metro-level data, matching by code is much easier than matching by text.
The left-most column of the dataset should always be a unique identifier.
4.1.6 Variable Names
As well as following the general advice for file names, strive to adopt the following conventions for your variable names:
- Where possible, names should follow the
key_attribute
pattern for consistency.
A key
is a variable or set of variables that uniquely identifies the elements of a table. It could be a geography, a date, an industry, an occupation, or a combination of these. An attribute
refers to a characteristic of the key
, such as population, unemployment rate, median wage, etc.
If the attribute is a percentage value, the attribute name should start with
pct_
, followed by the variable name.If the attribute is a boolean variable, the attribute name should start with is_ indicating a logical proposition, followed by the positive boolean variable name. For example, a metro-level dataset with a variable name
cbsa_is_top100
indicates that a metro is within the largest 100 metros ifcbsa_is_top100
has a value of 1 (e.g. True).
4.2 Metro Code Standards
4.2.1 Coding Style
Consistent coding style makes code easier to write because you need to make fewer decisions, and easier for readers to understand. The following two references describe the coding style you should follow in R and Stata.
- use the styler package to automatically style your R script
- Stata : State Coding Guide
You should also use comments to record important findings and analysis decisions. For complicated scripts, write a README file that briefly summarizes the steps taken. Code portability is useful to strive for, which includes explicit listing of dependencies, and using relative file paths. (For R, use RStudio projects and “here” package).
4.2.2 Code Review
To avoid replicating errors, we follow the AEA data and code guidance for data check and code review:
4.2.3 Version Control
Use Brookings Github repos to track changes
References
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software, Articles 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.