-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iss417 #440
base: main
Are you sure you want to change the base?
Conversation
Code & final output file updated
2020 data added to code and output files for both county and city level transit trips data
Update to code and output file for transit trips - county, nonsubgroup
forgot to do this earlier
Code and output files updated to include 2020 data for transit cost & transit trips at the COUNTY level
Code and output files updates for the city (place) -level data for both transit trips and transit costs with race subgroups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @tinatinc - I'm about halfway through the review but wanted to submit my comments so far to give you more time. Overall the changes look good and I do not see anything wrong with the code but have made some suggestions for improvemnets.
Two big things:
- A folder for "data" does not currently exist in the transportation folder when I checkout to your branch. This can cuase confusion because your instructions in the program state to download the raw data in the "data" folder. I think you should either add one or update the insturctions to state that you need to create a data folder.
- The historgrams in the city trips_cost code are currently not showing anything, I believe because the bin widths are set to high. These are important as they are the only viz of the post weighted data, would be key to improve these.
@@ -46,6 +46,7 @@ repository folder | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that the "data" folder does not exist inside the Transportation folder, either add the data folder or instruct reviewers to create it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added instruction for users/reviewers to create this folder to use it
filter(year == 2020) | ||
``` | ||
|
||
The 2015 and 2019 files have the same number of observations (3134, down from 3142 due to removing the 8 CT counties). 2020 file has 3,143 for due to the Alaska county split. Checking that's the case below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the above code chunk can you add a count
argument so that the count of the dataframes prints? This would make it easier to see what you are stating here in the text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
} | ||
``` | ||
|
||
No missing values for 2020. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I ran this I am seeing there is one missing value: State 48, County 243
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Thank you for catching - 1 missing value for 2020
``` | ||
|
||
Combined file has 9427 observations, which is correct (3142+3142+3143) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, I would recommend adding a count argument so the number of observations is printed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
``` | ||
|
||
Combined file has 9427 observations, which is correct (3142+3142+3143) | ||
Keep variables of interest and order them appropriately also rename to correct var names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to see the distriubtion of transit costs by county for all three years visualized together to observe similarity or movements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added! Added commentary as well -- TLDR, the distributions are comparable, but costs increased from 2015 to 2019, and then decreased a lot in 2020 to below 2015 levels (which tracks, given this was the COVID year)
transit_trips_tracts_2020 <- transport_tracts_2020 %>% | ||
select(GEOID, state, county, tract, blkgrps, population, households, transit_trips_80ami) | ||
|
||
transit_cost_tracts_2020 <- transport_tracts_2020 %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies to all years - it would be great to have some test or viz here that confirms all files have been read in properly. Is there a count of tracts we should expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really - tract data variable each year. But I have added a counter for each, and have checked for duplicates in the final outputs. Added distribution graphs for comparison and visual checking.
|
||
Collapse to places and also create data quality marker | ||
|
||
```{r} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add to the comment above this calculation. This is an important and relatively complicated step, please explain what the weighting process does and why we multiply afact and hhwt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
@@ -371,13 +505,17 @@ Examine outliers | |||
ggplot(transit_cost_city_2015, aes(x=index_transportation_cost)) + geom_histogram(binwidth=10) + labs(y="number of places", x="Annual Transit Cost for the Regional Moderate Income Household, 2015") | |||
|
|||
ggplot(transit_cost_city_2019, aes(x=index_transportation_cost)) + geom_histogram(binwidth=10) + labs(y="number of places", x="Annual Transit Cost for the Regional Moderate Income Household, 2019") | |||
|
|||
ggplot(transit_cost_city_2020, aes(x=index_transportation_cost)) + geom_histogram(binwidth=10) + labs(y="number of places", x="Annual Transit Cost for the Regional Moderate Income Household, 2020") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way these ggplot visuals are turning out currently does not offer much information, I think the bin widths are set too wide - it is just appearing as one large grey square.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, yes -- moved it to 0.01 bin widths, much clearer. Distributions look normal
Add folder for final forms
Added 2014 PEP population data into the crosswalk manually since the API is limited
…gh 2021 As discussed, aligning with the data team decision to maintain the original 8 CT counties through 2021. Also manually adding PEP population data for 2014 to complete previously missing data.
Kicked off more specific documentation about the crosswalks in the README - will revisit as needed
Updates made based on JP's feedback
Changes made according to JP's initial review
Applying JP's suggested changes across similar code(s) & rerunning the data because the CT crosswalks were updated
Moved over all the same edits JP suggested on earlier files. Re-ran with updated crosswalk files
Updated based on JP's feedback on other codes. Reran the data output accordingly
Mobility metric pull request template
Please include the following points in your PR:
A link to the issue that this PR relates to: Issue 417
A description of the content in this pull request.
What was changed?
I added/appended the 2020 data for both the transit cost and transit trip metrics (both overall and subgroup files). Previously, we only had 2015 and 2019 data.
What should the reviewer be focusing on?
Any errors I may have made in replicating the same calculations for 2020. Specifically, tract/county differences in crosswalks (since 2020 is a decade change, and there were jurisdictional changes in both counties and tract numbers).
Is there a logical order to review the files in?
I would start with transit_cost_county, then transit_trips_county, these two calculate the two metrics at the county level. Then transit_trips_cost_city, which calculates both metrics in one file instead, now at the city (place) level. Then, transit_trips_cost_county_subgroups, which creates the subgroup data at the county level for both metrics. Then transit_trips_cost_city_subgroups, which creates the subgroup data at the city (place) level for both metrics.
None that I can think of, but I may have messed up on when/if to include the 8 CT counties that became defunct recently. I need to double check on the crosswalks