Skip to content

2_Data Structure and Expectations 

jwalsh28 edited this page Oct 17, 2024 · 1 revision

Overview

The ultimate goal is to be able to seamlessly transition the separate metric datasets into a collection of main combined datasets which take on the several file formats described below. The structure of these combined files determines the standards set for individual final metric datasets.

The final files are separated by geographic level (city vs county) and whether the file includes a subgroup (e.g. race/ethnicity). The combined metric files are in the "long" format as opposed to a "wide" format, meaning each unique geography will account for more than one row because all the files contain multiple years of data. These data are hosted publicly on the Urban Institute data catalog.

The first three variables in every file (both overall and subgroups) should be year, state, and county/place. The variable year should be a four-digit numeric variable. 

  • state should be a two characters FIPS code. 

  • county should be a three-character FIPS code. 

  • place should the 5-digit census place FIPS. Intermediate files at the tract-level should include tract as the fourth variable. 

Overall

The overall county and place files contain every mobility metric for all available years. These files have exactly one row per county/place per year. As a metric lead, you should create an overall file for your metric all available geographies with a format the matches the tables below. Note that not all variables have confidence intervals (CIs) but we encourage adding them when possible. For those that do not have CIs, these columns are not required. 

Example data: County level

year state county Var1 Var1_lb Var1_ub Var1_quality
2014 01 001 v v_lb v_ub v_quality
2014 01 003 v v_lb v_ub v_quality
2014 01 005 v v_lb v_ub v_quality
2014 01 007 v v_lb v_ub v_quality

There should be a row in the overall data files for every county in each available year. If created correctly, the final row count will equal the product of the number of counties and the number of years. For information on the number of counties you should have per year please consult the crosswalks section. 

Example data: Place level

year state place Var1 Var1_lb Var1_ub Var1_quality
2015 01 03076 v v_lb v_ub v_quality
2015 01 07000 v v_lb v_ub v_quality
2015 01 35896 v v_lb v_ub v_quality
2015 01 37000 v v_lb v_ub v_quality

There should be a row in the overall data files for every place in each available year. If created correctly, the final row count will equal the product of the number of places and the number of years. For information on the number of places you should have per year please consult the crosswalks section. 

Subgroups

The subgroups files follow a similar structure to the overall file, but unlike the overall these data will have multiple observations per county per year due to the subgroup values (i.e. race/ethnicity, poverty status, etc.). There currently exists 9 different subgroup types and respective subgroup combined datasets. The table below lists these subgroups and their respective values. If you are updating a metric that includes on of these subgroups the values for the subgroups in your final data should match a selection of the values listed below.

Subgroup Values
Race-ethnicity
  • Black
  • Black, Non Hispanic
  • White
  • White, Non-Hispanic
  • Other Races and Ethnicities
  • Hispanic
Race-share
  • Neighborhoods of color 
  • White neighborhoods
  • Mixed neighborhoods
Income
  • Economically Disadvantaged
  • Not Economically Disadvantaged
  • Less than $50,000
  • $50,000 or More
  • High-Poverty
  • Not High-Poverty
Age
  • Under Age 45
  • Age 45 and Over
  • Age 10 to 14
  • Age 15 to 17
Gender
  • Male
  • Female
Tenure
  • Renter
  • Owner
Disability
  • With Disability
  • Without Disability
Industry
  • Goods Producing
  • Public Administration
  • Trade, Transit, Utilities
  • Information Services
  • Professional Services
  • Education and Health
  • Leisure and Other
Mother’s education
  • Less than High School
  • GED/High School Degree
  • Some College
  • College Degree or Higher

The first five variables in every subgroup file should be year, state, county/place, subgroup and subgroup type. The tables below provide an example of the Tenure subgroup type. 

Example data: County level

year state county subgroup subgroup_type Var1 Var1_lb Var1_ub Var1_quality
2014 01 001 All all v v_lb v_ub v_quality
2014 01 001 Renter tenure v v_lb v_ub v_quality
2014 01 001 Owner tenure v v_lb v_ub v_quality

There should be a row in the subgroup data for every county in each available year for each subgroup. If created correctly, the final row count will equal the product of the number of counties, the number of years and the number of subgroup values in the subgroup_type (including All). For information on the number of counties you should have per year please consult the crosswalks section.  

Example data: Place level  

year state place subgroup subgroup_type Var1 Var1_lb Var1_ub Var1_quality
2015 01 03076 All all v v_lb v_ub v_quality
2015 01 03076 Renter tenure v v_lb v_ub v_quality
2015 01 03076 Owner tenure v v_lb v_ub v_quality

There should be a row in the subgroup data for every place in each available year for each subgroup. If created correctly, the final row count will equal the product of the number of places, the number of years and the number of subgroup values in the subgroup_type (including All). For information on the number of places you should have per year please consult the crosswalks section. 

Subgroup “All” Values

Subgroup files should contain an all row that includes the metric value for the overall population. This value should be represented with a capital a “All” under the subgroup variable and a lower case a “all” under the subgroup_type variable.  

Subgroup all values should be calculated using the same data that you are calculating all other subgroup values with. Note this will result in cases where the All result from the overall file differs from the All result from the subgroup file – which is expected if the input data are different.

Sorting

All files should be sorted by year, state, and county/place, the first three variables in every file. Files at different geographic levels should be sorted by year and then in order by largest geographic level (i.e. state) to smallest geographic level (i.e. city). 

Subgroup files should be sorted by year, state, county/place, subgroup_type, and subgroup. All sorting should be alphanumeric. Importantly, the race/ethnicity groups should be sorted alphabetically so that "Black, Non-Hispanic" appears first and "White, Non-Hispanic" appears last.   

Missing Values

If a metric is missing the value should be reported as a character NA value. If the metric value is NA the variables related to the metric quality (_quality and confidence intervals (_lb, _ub)) should always also be NA – there should never be a value reported for quality when the metric is missing or a value reported for the metric when quality is missing, these must always align. Geography and subgroup information should still be included when the metric is missing so it is clear what is missing.