Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iss415 #438

Draft
wants to merge 7 commits into
base: version2025
Choose a base branch
from
Draft

Iss415 #438

wants to merge 7 commits into from

Conversation

malcalakovalski
Copy link
Contributor

@malcalakovalski malcalakovalski commented Dec 17, 2024

This PR addresses issue #415. In particular, it

  • Adds new 2022 data
  • Backfills years 2014-2019 (only 2020 and 2021 were present previously)
  • Creates a function to make the correct API call for different years and geographies
  • Checks that the expected number of counties and places are present
  • Checks that the expected number of years are present
  • Adds visualizations of key variables including missing variables
  • Runs the data against the final expectations testing function

I have some outstanding questions for review:

  • Am I crosswalking the pre-2020 ZCTA data correctly? I downloaded a 2010 ZCTA to 2010 Place crosswalk and followed the same steps as in the 2020 ZCTA to 2021 Place crosswalk used previously. Additionally, should that crosswalk belong in the geographic crosswalk folder instead?
  • I think I found a bug in the testing function and proposed a change. Specifically, the years extracted from the evaluation form csv are converted to a list and then to a numeric. However, this failed for me so I tweaked the function to unlist and then convert years to numeric.
  • One flag I have is that all values for 2014 are NA. This is because our population files are for 2015-2023. The CBP data itself does include 2014 data. Is this expected, or should we add 2014 population data to our places and county population files?
  • My understanding is this PR cannot include confidence intervals because the CBP files do not include margins of error. The available variables are shown in the screenshot below from a call to censusApi::listCensusMetadata("cbp", vintage = 2021). In particular, there's no "noise range" for estab` (number of establishments, which we use to create the metric). Is this reasoning correct?
Screenshot 2024-12-18 at 2 00 04 PM

"Relabel 'Overview' section as 'Housekeeping' in social_associations_county.qmd and remove unneeded spaces after punctuation"
- Back fill years and add 2022
- Add tests and update documentation
- Visualize summary statistics
- Pass unit tests
as.numeric() does not work on lists, so reversing the order of operations gives the intended outcome
Copy link
Collaborator

@wcurrangroome wcurrangroome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manu--this looks great overall. In particular, I really appreciate the clarity of the introduction to the metric at the top.

I left some minor comments throughout. A few overarching thoughts:

  • I haven't really reviewed the county-level file yet. I imagine many of the city-level comments apply there as well. If you have the bandwidth and interest, and if it seems plausible, I'd love to see the city- and county-level code files consolidated into a single .qmd, with conditional logic treating the differences between the two. If that's not a priority, I'll plan to review the county-level code once the city-level code is essentially finalized.
  • Querying the data from the API is a pain (can only imagine this has been a 10x headache for you). One issue I've run into is that the query works for some years, but for other years it fails. This doesn't error out, which is fine, but the result is that a subsequent check throws an error because I've got data for, e.g., 2015:2022, but not for 2014. Can you write results to disk year-by-year, and then only query years for which there aren't already local data? This might be the top priority in my mind at this point.
  • Generally, I left a decent number of comments about stylistic things, which you should feel free to take or ignore as you see fit--they're obviously just my preference, and as far as I can tell with a subset of all the years' data, everything runs well at first blush, so they're definitely not critical changes.

```

```{r}
library(naniar)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is loaded at the top, I think.

select(-year) %>%
arrange(state, county) %>%
write_csv(here("06_neighborhoods/social-capital/final/social_associations_2021_county.csv"))
write_csv(here("06_neighborhoods/social-capital/data/social_associations_all_county.csv"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

county should precede place in the final file name (I think).


4. collapse estimates to unique Places

5. check against official Census Place file & limit to population cutoff Places

6. use crosswalk population data to construct the ratio (Numerator/Denominator)

7. add data quality tag, final file cleaning and export to .csv file
7. add data quality tag, final file cleaning, visualize metric and export to .csv file

## Download social organization data


We pull our data from `library(censusapi)`.

**Note:** This will require a [Census API key](https://api.census.gov/data/key_signup.html).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in thinking the intended flow here is to manually open this file, insert the user's API key, run it, then delete the file and this code chunk in this file?

(If so): Alternately, it might be cleaner to add a test for the presence of the API key in the .Renviron file. If it passes, continue, if it fails, return an error message directing the user to follow the process described here. This removes the need to delete a code chunk in this document.

813930, 813910, 813920
)

years <- 2014
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just from testing, right? If not, it appears to write over the years variable listed above.

```{r}
#| label: get-social-organization-data

fetch_cbp_data <- function(year, naics_codes_to_keep) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to add a check here to see if the data exists locally before querying the API?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building on this: I haven't been able to get all the years' data to download successfully. On different runs, I've gotten errors on at least one, and often multiple, years' worth of data. The code executes fully, but the resulting dataset fails one of the checks below. As-is, this requires re-running the entire query, i.e., for all data years, which is time-intensive. Could you modify this to download each year's data into a single file? Then add a check, year by year, for whether the data has already been downloaded? And then, once all years have been downloaded, read them in and combine them into a single longitudinal dataset?

))

zcta_10_place_10_xwalk <-
zcta_10_place_10_xwalk %>%
mutate(portion_in = case_when(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preference for a clearer variable name here, e.g., zcta_fully_within_place_binary, or something loosely along those lines

@@ -248,7 +335,7 @@ fall within the Place (`zips_in`)
merged_sa_zip_city <-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong preference to not write over existing objects. Moderate preference to consolidate this all into a single pipe chain, with comments above lines/chunks of code as needed, to make it easier to track the code logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, opportunity to re-name this object into some more expressive? sa is not clear--possible just to use organizations or associations?

@@ -257,11 +344,11 @@ merged_sa_zip_city <-
)
```

**Check:** Are there exactly 2 missing values in the `new_est_zip` variable?
**Check:** Are there exactly 9 missing values in the `new_est_zip` variable?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would that be the expected case? And should this be total_org?

@@ -283,20 +370,42 @@ places_pop <-
read_csv(here("geographic-crosswalks/data/place-populations.csv")) %>%
rename(state_fips = state) %>%
filter(year %in% years)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this--why query 2014 data if it's just going to be all missing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, had missed your earlier comment about this very issue, @malcalakovalski! Tagging @cdsolari for her input here. I think we should use the CBP population data and produce 2014 estimates. If so, perhaps we want to note at the top that we use a different population source for this year as compared to the others. (Though--why do we not have population data for 2014 if we're producing estimates across many metrics for 2014?)

## Run Final Tests

```{r}
source(here("functions/testing/evaluate_final_data.R"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you push the final expectations form in your next commit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants