Chapter 4 Metro Datasets and Code Standards

Brookings Metro publishes many novel datasets that accompany blogs and reports. During the research stage, we also create many useful interim datasets by collapsing or aggregating various administrative data to the metro scale. Depositing these commonly referenced datasets into the warehouse based on a shared guideline will facilitate consistent standards is crucial to ensuring both higher quality and more accessible data files.

4.1 Metro Dataset Standards

Metro data warehouse is a shared folder for final datasets. Researchers should still use their own project folders to store any input datasets or interim datasets during the research process.

4.1.1 File Format

Save the final dataset as a comma-separated values (.CSV) file, which are plain-text files that work well with most statistical software packages. It is also recommended that users save with Unicode (UTF-8) encoding to allow compatibility with as wide an array of programs as possible.

4.1.2 File Names

File names should use only lowercase letters, numbers, and _. Use underscores (_) “snake case” to separate words within a name. Avoid spaces in file names.

4.1.3 Metadata

Each dataset should include metadata that describes:

Author(s)
Last updated date
Project folder location: Specify where to find your interim datasets, code, etc.
Variable summary
Variable labels: Describe the variables and units, if applicable.

4.1.4 Data Format

4.1.4.1 Flat Files

Whenever possible, save all output in “tidy” data format(Wickham 2014): 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.

Figure 4.1: tidydata

4.1.4.2 Spatial Data

For publication, US maps should be projected into Albers Equal Area Conic or Lambert Conformal Conic (EPSG codes 102003 through 102009). State/metro level maps should be projected into the appropriate State Plane projection. Shapefiles consist of 3-8 separate files, which must be kept in the same directory, otherwise they will not project.

4.1.5 Unique Identifiers

Always preserve the unique identifiers for geographies, industries and occupations in your final datasets, stored as character to preserve leading zeros. For example, you might have saved “Birmingham” as cbsa_name in your final output, but you also want to keep its cbsa_code “13820” because when merging this to other metro-level data, matching by code is much easier than matching by text.

The left-most column of the dataset should always be a unique identifier.

4.1.6 Variable Names

As well as following the general advice for file names, strive to adopt the following conventions for your variable names:

Where possible, names should follow the key_attribute pattern for consistency.

A key is a variable or set of variables that uniquely identifies the elements of a table. It could be a geography, a date, an industry, an occupation, or a combination of these. An attribute refers to a characteristic of the key, such as population, unemployment rate, median wage, etc.

If the attribute is a percentage value, the attribute name should start with pct_, followed by the variable name.
If the attribute is a boolean variable, the attribute name should start with is_ indicating a logical proposition, followed by the positive boolean variable name. For example, a metro-level dataset with a variable name cbsa_is_top100 indicates that a metro is within the largest 100 metros if cbsa_is_top100 has a value of 1 (e.g. True).

4.2 Metro Code Standards

4.2.1 Coding Style

Consistent coding style makes code easier to write because you need to make fewer decisions, and easier for readers to understand. The following two references describe the coding style you should follow in R and Stata.

R: Tidyverse Style Guide

use the styler package to automatically style your R script

Stata : State Coding Guide

You should also use comments to record important findings and analysis decisions. For complicated scripts, write a README file that briefly summarizes the steps taken. Code portability is useful to strive for, which includes explicit listing of dependencies, and using relative file paths. (For R, use RStudio projects and “here” package).

4.2.2 Code Review

To avoid replicating errors, we follow the AEA data and code guidance for data check and code review:

4.2.3 Version Control

Use Brookings Github repos to track changes

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software, Articles 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.