-
Notifications
You must be signed in to change notification settings - Fork 2
2_Data Structure and Expectations
The ultimate goal is to be able to seamlessly transition the separate metric datasets into a collection of main combined datasets which take on the several file formats described below. The structure of these combined files determines the standards set for individual final metric datasets.
The final files are separated by geographic level (city vs county) and whether the file includes a subgroup (e.g. race/ethnicity). The combined metric files are in the "long" format as opposed to a "wide" format, meaning each unique geography will account for more than one row because all the files contain multiple years of data. These data are hosted publicly on the Urban Institute data catalog.
The first three variables in every file (both overall and subgroups) should be year, state, and county/place. The variable year should be a four-digit numeric variable.
-
state should be a two characters FIPS code.
-
county should be a three-character FIPS code.
-
place should the 5-digit census place FIPS. Intermediate files at the tract-level should include tract as the fourth variable.
The overall county and place files contain every mobility metric for all available years. These files have exactly one row per county/place per year. As a metric lead, you should create an overall file for your metric all available geographies with a format the matches the tables below. Note that not all variables have confidence intervals (CIs) but we encourage adding them when possible. For those that do not have CIs, these columns are not required.
Example data: County level
year | state | county | Var1 | Var1_lb | Var1_ub | Var1_quality |
---|---|---|---|---|---|---|
2014 | 01 | 001 | v | v_lb | v_ub | v_quality |
2014 | 01 | 003 | v | v_lb | v_ub | v_quality |
2014 | 01 | 005 | v | v_lb | v_ub | v_quality |
2014 | 01 | 007 | v | v_lb | v_ub | v_quality |
There should be a row in the overall data files for every county in each available year. If created correctly, the final row count will equal the product of the number of counties and the number of years. For information on the number of counties you should have per year please consult the crosswalks section.
Example data: Place level
year | state | place | Var1 | Var1_lb | Var1_ub | Var1_quality |
---|---|---|---|---|---|---|
2015 | 01 | 03076 | v | v_lb | v_ub | v_quality |
2015 | 01 | 07000 | v | v_lb | v_ub | v_quality |
2015 | 01 | 35896 | v | v_lb | v_ub | v_quality |
2015 | 01 | 37000 | v | v_lb | v_ub | v_quality |
There should be a row in the overall data files for every place in each available year. If created correctly, the final row count will equal the product of the number of places and the number of years. For information on the number of places you should have per year please consult the crosswalks section.
The subgroups files follow a similar structure to the overall file, but unlike the overall these data will have multiple observations per county per year due to the subgroup values (i.e. race/ethnicity, poverty status, etc.). There currently exists 9 different subgroup types and respective subgroup combined datasets. The table below lists these subgroups and their respective values. If you are updating a metric that includes on of these subgroups the values for the subgroups in your final data should match a selection of the values listed below.
Subgroup | Values |
---|---|
Race-ethnicity |
|
Race-share |
|
Income |
|
Age |
|
Gender |
|
Tenure |
|
Disability |
|
Industry |
|
Mother’s education |
|
The first five variables in every subgroup file should be year, state, county/place, subgroup and subgroup type. The tables below provide an example of the Tenure subgroup type.
Example data: County level
year | state | county | subgroup | subgroup_type | Var1 | Var1_lb | Var1_ub | Var1_quality |
---|---|---|---|---|---|---|---|---|
2014 | 01 | 001 | All | all | v | v_lb | v_ub | v_quality |
2014 | 01 | 001 | Renter | tenure | v | v_lb | v_ub | v_quality |
2014 | 01 | 001 | Owner | tenure | v | v_lb | v_ub | v_quality |
There should be a row in the subgroup data for every county in each available year for each subgroup. If created correctly, the final row count will equal the product of the number of counties, the number of years and the number of subgroup values in the subgroup_type (including All). For information on the number of counties you should have per year please consult the crosswalks section.
Example data: Place level
year | state | place | subgroup | subgroup_type | Var1 | Var1_lb | Var1_ub | Var1_quality |
---|---|---|---|---|---|---|---|---|
2015 | 01 | 03076 | All | all | v | v_lb | v_ub | v_quality |
2015 | 01 | 03076 | Renter | tenure | v | v_lb | v_ub | v_quality |
2015 | 01 | 03076 | Owner | tenure | v | v_lb | v_ub | v_quality |
There should be a row in the subgroup data for every place in each available year for each subgroup. If created correctly, the final row count will equal the product of the number of places, the number of years and the number of subgroup values in the subgroup_type (including All). For information on the number of places you should have per year please consult the crosswalks section.
Subgroup files should contain an all row that includes the metric value for the overall population. This value should be represented with a capital a “All” under the subgroup variable and a lower case a “all” under the subgroup_type variable.
Subgroup all values should be calculated using the same data that you are calculating all other subgroup values with. Note this will result in cases where the All result from the overall file differs from the All result from the subgroup file – which is expected if the input data are different.
All files should be sorted by year, state, and county/place, the first three variables in every file. Files at different geographic levels should be sorted by year and then in order by largest geographic level (i.e. state) to smallest geographic level (i.e. city).
Subgroup files should be sorted by year, state, county/place, subgroup_type, and subgroup. All sorting should be alphanumeric. Importantly, the race/ethnicity groups should be sorted alphabetically so that "Black, Non-Hispanic" appears first and "White, Non-Hispanic" appears last.
If a metric is missing the value should be reported as a character NA value. If the metric value is NA the variables related to the metric quality (_quality and confidence intervals (_lb, _ub)) should always also be NA – there should never be a value reported for quality when the metric is missing or a value reported for the metric when quality is missing, these must always align. Geography and subgroup information should still be included when the metric is missing so it is clear what is missing.