Iss415 #438

malcalakovalski · 2024-12-17T21:44:35Z

This PR addresses issue #415. In particular, it

Adds new 2022 data
Backfills years 2014-2019 (only 2020 and 2021 were present previously)
Creates a function to make the correct API call for different years and geographies
Checks that the expected number of counties and places are present
Checks that the expected number of years are present
Adds visualizations of key variables including missing variables
Runs the data against the final expectations testing function

I have some outstanding questions for review:

Am I crosswalking the pre-2020 ZCTA data correctly? I downloaded a 2010 ZCTA to 2010 Place crosswalk and followed the same steps as in the 2020 ZCTA to 2021 Place crosswalk used previously. Additionally, should that crosswalk belong in the geographic crosswalk folder instead?
I think I found a bug in the testing function and proposed a change. Specifically, the years extracted from the evaluation form csv are converted to a list and then to a numeric. However, this failed for me so I tweaked the function to unlist and then convert years to numeric.
One flag I have is that all values for 2014 are NA. This is because our population files are for 2015-2023. The CBP data itself does include 2014 data. Is this expected, or should we add 2014 population data to our places and county population files?
My understanding is this PR cannot include confidence intervals because the CBP files do not include margins of error. The available variables are shown in the screenshot below from a call to censusApi::listCensusMetadata("cbp", vintage = 2021). In particular, there's no "noise range" for estab` (number of establishments, which we use to create the metric). Is this reasoning correct?

"Relabel 'Overview' section as 'Housekeeping' in social_associations_county.qmd and remove unneeded spaces after punctuation"

- Back fill years and add 2022 - Add tests and update documentation - Visualize summary statistics - Pass unit tests

as.numeric() does not work on lists, so reversing the order of operations gives the intended outcome

wcurrangroome

Manu--this looks great overall. In particular, I really appreciate the clarity of the introduction to the metric at the top.

I left some minor comments throughout. A few overarching thoughts:

I haven't really reviewed the county-level file yet. I imagine many of the city-level comments apply there as well. If you have the bandwidth and interest, and if it seems plausible, I'd love to see the city- and county-level code files consolidated into a single .qmd, with conditional logic treating the differences between the two. If that's not a priority, I'll plan to review the county-level code once the city-level code is essentially finalized.
Querying the data from the API is a pain (can only imagine this has been a 10x headache for you). One issue I've run into is that the query works for some years, but for other years it fails. This doesn't error out, which is fine, but the result is that a subsequent check throws an error because I've got data for, e.g., 2015:2022, but not for 2014. Can you write results to disk year-by-year, and then only query years for which there aren't already local data? This might be the top priority in my mind at this point.
Generally, I left a decent number of comments about stylistic things, which you should feel free to take or ignore as you see fit--they're obviously just my preference, and as far as I can tell with a subset of all the years' data, everything runs well at first blush, so they're definitely not critical changes.

wcurrangroome · 2025-01-03T14:42:17Z

06_neighborhoods/social-capital/social_associations_city.qmd

+```
+
+```{r}
+library(naniar)


This is loaded at the top, I think.

wcurrangroome · 2025-01-03T14:48:04Z

06_neighborhoods/social-capital/social_associations_county.qmd

-  select(-year) %>%
-  arrange(state, county) %>%
-  write_csv(here("06_neighborhoods/social-capital/final/social_associations_2021_county.csv"))
+  write_csv(here("06_neighborhoods/social-capital/data/social_associations_all_county.csv"))


county should precede place in the final file name (I think).

wcurrangroome · 2025-01-03T14:51:24Z

06_neighborhoods/social-capital/social_associations_city.qmd


 4.  collapse estimates to unique Places

 5.  check against official Census Place file & limit to population cutoff Places

 6.  use crosswalk population data to construct the ratio (Numerator/Denominator)

-7.  add data quality tag, final file cleaning and export to .csv file
+7.  add data quality tag, final file cleaning, visualize metric and export to .csv file

 ## Download social organization data


 We pull our data from `library(censusapi)`.

 **Note:** This will require a [Census API key](https://api.census.gov/data/key_signup.html).


Am I right in thinking the intended flow here is to manually open this file, insert the user's API key, run it, then delete the file and this code chunk in this file?

(If so): Alternately, it might be cleaner to add a test for the presence of the API key in the .Renviron file. If it passes, continue, if it fails, return an error message directing the user to follow the process described here. This removes the need to delete a code chunk in this document.

wcurrangroome · 2025-01-03T19:39:31Z

06_neighborhoods/social-capital/social_associations_city.qmd

+  813930, 813910, 813920
+)
+
+years <- 2014


This is just from testing, right? If not, it appears to write over the years variable listed above.

wcurrangroome · 2025-01-03T19:39:58Z

06_neighborhoods/social-capital/social_associations_city.qmd

+```{r}
+#| label: get-social-organization-data
+
+fetch_cbp_data <- function(year, naics_codes_to_keep) {


Possible to add a check here to see if the data exists locally before querying the API?

Building on this: I haven't been able to get all the years' data to download successfully. On different runs, I've gotten errors on at least one, and often multiple, years' worth of data. The code executes fully, but the resulting dataset fails one of the checks below. As-is, this requires re-running the entire query, i.e., for all data years, which is time-intensive. Could you modify this to download each year's data into a single file? Then add a check, year by year, for whether the data has already been downloaded? And then, once all years have been downloaded, read them in and combine them into a single longitudinal dataset?

wcurrangroome · 2025-01-03T20:11:50Z

06_neighborhoods/social-capital/social_associations_city.qmd

+  ))
+
+zcta_10_place_10_xwalk <-
+  zcta_10_place_10_xwalk %>%
  mutate(portion_in = case_when(


Preference for a clearer variable name here, e.g., zcta_fully_within_place_binary, or something loosely along those lines

wcurrangroome · 2025-01-03T20:16:17Z

06_neighborhoods/social-capital/social_associations_city.qmd

@@ -248,7 +335,7 @@ fall within the Place (`zips_in`)
 merged_sa_zip_city <-


Strong preference to not write over existing objects. Moderate preference to consolidate this all into a single pipe chain, with comments above lines/chunks of code as needed, to make it easier to track the code logic.

Also, opportunity to re-name this object into some more expressive? sa is not clear--possible just to use organizations or associations?

wcurrangroome · 2025-01-03T20:21:32Z

06_neighborhoods/social-capital/social_associations_city.qmd

@@ -257,11 +344,11 @@ merged_sa_zip_city <-
  )
 ```

-**Check:** Are there exactly 2 missing values in the `new_est_zip` variable?
+**Check:** Are there exactly 9 missing values in the `new_est_zip` variable?


Why would that be the expected case? And should this be total_org?

wcurrangroome · 2025-01-03T21:59:38Z

06_neighborhoods/social-capital/social_associations_city.qmd

@@ -283,20 +370,42 @@ places_pop <-
  read_csv(here("geographic-crosswalks/data/place-populations.csv")) %>%
  rename(state_fips = state) %>%
  filter(year %in% years)
+


I don't follow this--why query 2014 data if it's just going to be all missing?

Sorry, had missed your earlier comment about this very issue, @malcalakovalski! Tagging @cdsolari for her input here. I think we should use the CBP population data and produce 2014 estimates. If so, perhaps we want to note at the top that we use a different population source for this year as compared to the others. (Though--why do we not have population data for 2014 if we're producing estimates across many metrics for 2014?)

wcurrangroome · 2025-01-03T22:00:03Z

06_neighborhoods/social-capital/social_associations_city.qmd

+## Run Final Tests
+
+```{r}
+source(here("functions/testing/evaluate_final_data.R"))


Can you push the final expectations form in your next commit?

malcalakovalski added 7 commits November 19, 2024 12:21

Relabel overview section as housekeeping

16d106e

"Relabel 'Overview' section as 'Housekeeping' in social_associations_county.qmd and remove unneeded spaces after punctuation"

Modify number of missing values since we removed CT counties

f5dc7ef

Write final data to metric data folder.

3a79771

Update social-capital places metric

2f7f134

- Back fill years and add 2022 - Add tests and update documentation - Visualize summary statistics - Pass unit tests

Update county metric

70982f6

Remove bug in testing function

991ec4e

as.numeric() does not work on lists, so reversing the order of operations gives the intended outcome

Add naniar to setup chunk

4335e1a

cdsolari requested a review from wcurrangroome December 18, 2024 15:46

wcurrangroome requested changes Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iss415 #438

Iss415 #438

malcalakovalski commented Dec 17, 2024 •

edited

Loading

wcurrangroome left a comment

wcurrangroome Jan 3, 2025

wcurrangroome Jan 3, 2025

wcurrangroome Jan 3, 2025

wcurrangroome Jan 3, 2025

wcurrangroome Jan 3, 2025

wcurrangroome Jan 6, 2025

wcurrangroome Jan 3, 2025

wcurrangroome Jan 3, 2025

wcurrangroome Jan 3, 2025

wcurrangroome Jan 3, 2025

wcurrangroome Jan 3, 2025

wcurrangroome Jan 6, 2025

wcurrangroome Jan 3, 2025

		@@ -248,7 +335,7 @@ fall within the Place (`zips_in`)
		merged_sa_zip_city <-

Iss415 #438

Are you sure you want to change the base?

Iss415 #438

Conversation

malcalakovalski commented Dec 17, 2024 • edited Loading

wcurrangroome left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malcalakovalski commented Dec 17, 2024 •

edited

Loading