-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iss415 #438
base: version2025
Are you sure you want to change the base?
Iss415 #438
Conversation
"Relabel 'Overview' section as 'Housekeeping' in social_associations_county.qmd and remove unneeded spaces after punctuation"
- Back fill years and add 2022 - Add tests and update documentation - Visualize summary statistics - Pass unit tests
as.numeric() does not work on lists, so reversing the order of operations gives the intended outcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manu--this looks great overall. In particular, I really appreciate the clarity of the introduction to the metric at the top.
I left some minor comments throughout. A few overarching thoughts:
- I haven't really reviewed the county-level file yet. I imagine many of the city-level comments apply there as well. If you have the bandwidth and interest, and if it seems plausible, I'd love to see the city- and county-level code files consolidated into a single .qmd, with conditional logic treating the differences between the two. If that's not a priority, I'll plan to review the county-level code once the city-level code is essentially finalized.
- Querying the data from the API is a pain (can only imagine this has been a 10x headache for you). One issue I've run into is that the query works for some years, but for other years it fails. This doesn't error out, which is fine, but the result is that a subsequent check throws an error because I've got data for, e.g., 2015:2022, but not for 2014. Can you write results to disk year-by-year, and then only query years for which there aren't already local data? This might be the top priority in my mind at this point.
- Generally, I left a decent number of comments about stylistic things, which you should feel free to take or ignore as you see fit--they're obviously just my preference, and as far as I can tell with a subset of all the years' data, everything runs well at first blush, so they're definitely not critical changes.
``` | ||
|
||
```{r} | ||
library(naniar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is loaded at the top, I think.
select(-year) %>% | ||
arrange(state, county) %>% | ||
write_csv(here("06_neighborhoods/social-capital/final/social_associations_2021_county.csv")) | ||
write_csv(here("06_neighborhoods/social-capital/data/social_associations_all_county.csv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
county
should precede place
in the final file name (I think).
|
||
4. collapse estimates to unique Places | ||
|
||
5. check against official Census Place file & limit to population cutoff Places | ||
|
||
6. use crosswalk population data to construct the ratio (Numerator/Denominator) | ||
|
||
7. add data quality tag, final file cleaning and export to .csv file | ||
7. add data quality tag, final file cleaning, visualize metric and export to .csv file | ||
|
||
## Download social organization data | ||
|
||
|
||
We pull our data from `library(censusapi)`. | ||
|
||
**Note:** This will require a [Census API key](https://api.census.gov/data/key_signup.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right in thinking the intended flow here is to manually open this file, insert the user's API key, run it, then delete the file and this code chunk in this file?
(If so): Alternately, it might be cleaner to add a test for the presence of the API key in the .Renviron file. If it passes, continue, if it fails, return an error message directing the user to follow the process described here. This removes the need to delete a code chunk in this document.
813930, 813910, 813920 | ||
) | ||
|
||
years <- 2014 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just from testing, right? If not, it appears to write over the years
variable listed above.
```{r} | ||
#| label: get-social-organization-data | ||
|
||
fetch_cbp_data <- function(year, naics_codes_to_keep) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible to add a check here to see if the data exists locally before querying the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building on this: I haven't been able to get all the years' data to download successfully. On different runs, I've gotten errors on at least one, and often multiple, years' worth of data. The code executes fully, but the resulting dataset fails one of the checks below. As-is, this requires re-running the entire query, i.e., for all data years, which is time-intensive. Could you modify this to download each year's data into a single file? Then add a check, year by year, for whether the data has already been downloaded? And then, once all years have been downloaded, read them in and combine them into a single longitudinal dataset?
)) | ||
|
||
zcta_10_place_10_xwalk <- | ||
zcta_10_place_10_xwalk %>% | ||
mutate(portion_in = case_when( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preference for a clearer variable name here, e.g., zcta_fully_within_place_binary
, or something loosely along those lines
@@ -248,7 +335,7 @@ fall within the Place (`zips_in`) | |||
merged_sa_zip_city <- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strong preference to not write over existing objects. Moderate preference to consolidate this all into a single pipe chain, with comments above lines/chunks of code as needed, to make it easier to track the code logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, opportunity to re-name this object into some more expressive? sa
is not clear--possible just to use organizations
or associations
?
@@ -257,11 +344,11 @@ merged_sa_zip_city <- | |||
) | |||
``` | |||
|
|||
**Check:** Are there exactly 2 missing values in the `new_est_zip` variable? | |||
**Check:** Are there exactly 9 missing values in the `new_est_zip` variable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would that be the expected case? And should this be total_org
?
@@ -283,20 +370,42 @@ places_pop <- | |||
read_csv(here("geographic-crosswalks/data/place-populations.csv")) %>% | |||
rename(state_fips = state) %>% | |||
filter(year %in% years) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow this--why query 2014 data if it's just going to be all missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, had missed your earlier comment about this very issue, @malcalakovalski! Tagging @cdsolari for her input here. I think we should use the CBP population data and produce 2014 estimates. If so, perhaps we want to note at the top that we use a different population source for this year as compared to the others. (Though--why do we not have population data for 2014 if we're producing estimates across many metrics for 2014?)
## Run Final Tests | ||
|
||
```{r} | ||
source(here("functions/testing/evaluate_final_data.R")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you push the final expectations form in your next commit?
This PR addresses issue #415. In particular, it
I have some outstanding questions for review:
censusApi::listCensusMetadata("cbp", vintage = 2021). In particular, there's no "noise range" for
estab` (number of establishments, which we use to create the metric). Is this reasoning correct?