Merge pull request #163 from fhdsl/update-factors-lab

[Factors] Update dataset to CES for the lab
fhdsl · Oct 8, 2024 · b173d6d · b173d6d
2 parents 30fac04 + 130faa7
commit b173d6d
Show file tree

Hide file tree

Showing 3 changed files with 680 additions and 369 deletions.
diff --git a/modules/Factors/lab/Factors_Lab.Rmd b/modules/Factors/lab/Factors_Lab.Rmd
@@ -13,41 +13,44 @@ library(tidyverse)
 
 ### 1.0
 
-Load the Youth Tobacco Survey data and `select` "Sample_Size",  "Education", and "LocationAbbr". Name this data "yts". 
+Load the CalEnviroScreen dataset and use `select` to choose the `CaliforniaCounty`, `ImpWaterBodies`, and `ZIP` variables. Then subset this data using `filter` to include only the California counties Napa and San Francisco. Name this data "ces". 
+
+`ImpWaterBodies`: measure of the number of pollutants across all impaired water bodies within a given distance of populated areas.
 
 ```{r}
-yts <- 
-  read_csv("https://daseh.org/data/Youth_Tobacco_Survey_YTS_Data.csv") %>% 
-  select(Sample_Size, Education, LocationAbbr)
+ces <- 
+  read_csv("https://daseh.org/data/CalEnviroScreen_data.csv") %>% 
+  select(CaliforniaCounty, ImpWaterBodies, ZIP) %>%
+  filter(CaliforniaCounty == c("Amador", "Napa", "Ventura", "San Francisco"))
 ```
 
 ### 1.1
 
-Create a boxplot showing the difference in "Sample_Size" between Middle School and High School "Education". **Hint**: Use `aes(x = Education, y = Sample_Size)` and `geom_boxplot()`.
+Create a boxplot showing the difference in groundwater contamination threats (`ImpWaterBodies`) among Amador, Napa, San Francisco, and Ventura counties (`CaliforniaCounty`). **Hint**: Use `aes(x = CaliforniaCounty, y = ImpWaterBodies)` and `geom_boxplot()`.
 
 ```{r 1.1response}
 
 ```
 
 ### 1.2
 
-Use `count` to count up the number of observations of data for each "Education" group.
+Use `count` to count up the number of observations of data for each `CaliforniaCounty` group.
 
 ```{r 1.2response}
 
 ```
 
 ### 1.3
 
-Make "Education" a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder "Education". Reorder this variable so that "Middle School" comes before "High School". Assign the output the name "yts_fct".
+Make `CaliforniaCounty` a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder `CaliforniaCounty`. Reorder this variable so the order is now San Francisco, Ventura, Napa, and Amador. Assign the output the name "ces_fct".
 
 ```{r 1.3response}
 
 ```
 
 ### 1.4
 
-Repeat question 1.1 and 1.2 using the "yts_fct" data. You should see different ordering in the plot and `count` table.
+Repeat question 1.1 and 1.2 using the "ces_fct" data. You should see different ordering in the plot and `count` table.
 
 ```{r 1.4response}
 
@@ -57,39 +60,38 @@ Repeat question 1.1 and 1.2 using the "yts_fct" data. You should see different o
 # Practice on Your Own!
 
 ### P.1
-
-Convert "LocationAbbr" (state) in "yts_fct" into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument.
+Subset `ces_fct` so that it only includes data from Ventura county. Then convert `ZIP` (zip code) into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument.
 
 ```{r P.1response}
 
 ```
 
 ### P.2
 
-We want to create a new column that contains the group-level median sample size. 
+We want to create a new column that contains the group-level median values for `ImpWaterBodies`. 
 
-- Using the "yts_fct" data, `group_by` "LocationAbbr". 
-- Then, use `mutate` to create a new column "med_sample_size" that is the median "Sample_Size". 
-- **Hint**: Since you have already done `group_by`, a median "Sample_Size" will automatically be created for each unique level in "LocationAbbr". Use the `median` function with `na.rm = TRUE`.
+- Using the "ces_Ventura" data, group the data by `ZIP` using `group_by`
+- Then, use `mutate` to create a new column `med_ImpWaterBodies` that is the median of `ImpWaterBodies`. 
+- **Hint**: Since you have already done `group_by`, a median `ImpWaterBodies` will automatically be created for each unique level in `ZIP`. Use the `median` function with `na.rm = TRUE`.
 
 ```{r P.2response}
 
 ```
 
 ### P.3
 
-We want to plot the "LocationAbbr" (state) by the "med_sample_size" column we created above. Using the `forcats` package, create a plot that:
+We want to make a plot of the `med_ImpWaterBodies`  column we created above in the `ces_Ventura`, separated by `ZIP`. Using the `forcats` package, create a plot that:
 
-- Has "LocationAbbr" on the x-axis
-- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by "med_sample_size"
-- Has "Sample_Size" on the y-axis
+- Has `ZIP` on the x-axis
+- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by `med_ImpWaterBodies`
+- Has `med_ImpWaterBodies` on the y-axis
 - Is a boxplot (`geom_boxplot`)
-- Has the x axis label of `State`
+- Has the x axis label of "Zipcode"
 (Don't worry if you get a warning about not being able to plot `NA` values.)
 
 Save your plot using `ggsave()` with a width of 10 and height of 3.
 
-Which state has the largest median sample size?
+Which zipcode has the largest median measure of water pollution?
 
 ```{r P.3response}
 

diff --git a/modules/Factors/lab/Factors_Lab_Key.Rmd b/modules/Factors/lab/Factors_Lab_Key.Rmd
@@ -13,114 +13,118 @@ library(tidyverse)
 
 ### 1.0
 
-Load the Youth Tobacco Survey data and `select` "Sample_Size",  "Education", and "LocationAbbr". Name this data "yts". 
+Load the CalEnviroScreen dataset and use `select` to choose the `CaliforniaCounty`, `ImpWaterBodies`, and `ZIP` variables. Then subset this data using `filter` to include only the California counties Napa and San Francisco. Name this data "ces". 
+
+`ImpWaterBodies`: measure of the number of pollutants across all impaired water bodies within a given distance of populated areas.
 
 ```{r}
-yts <- 
-  read_csv("https://daseh.org/data/Youth_Tobacco_Survey_YTS_Data.csv") %>% 
-  select(Sample_Size, Education, LocationAbbr)
+ces <- 
+  read_csv("https://daseh.org/data/CalEnviroScreen_data.csv") %>% 
+  select(CaliforniaCounty, ImpWaterBodies, ZIP) %>%
+  filter(CaliforniaCounty == c("Amador", "Napa", "Ventura", "San Francisco"))
 ```
 
 ### 1.1
 
-Create a boxplot showing the difference in "Sample_Size" between Middle School and High School "Education". **Hint**: Use `aes(x = Education, y = Sample_Size)` and `geom_boxplot()`.
+Create a boxplot showing the difference in groundwater contamination threats (`ImpWaterBodies`) among Amador, Napa, San Francisco, and Ventura counties (`CaliforniaCounty`). **Hint**: Use `aes(x = CaliforniaCounty, y = ImpWaterBodies)` and `geom_boxplot()`.
 
 ```{r 1.1response}
-yts %>%
-  ggplot(mapping = aes(x = Education, y = Sample_Size)) +
+ces %>%
+  ggplot(mapping = aes(x = CaliforniaCounty, y = ImpWaterBodies)) +
   geom_boxplot()
 ```
 
 ### 1.2
 
-Use `count` to count up the number of observations of data for each "Education" group.
+Use `count` to count up the number of observations of data for each `CaliforniaCounty` group.
 
 ```{r 1.2response}
-yts %>%
-  count(Education)
+ces %>%
+  count(CaliforniaCounty)
 ```
 
 ### 1.3
 
-Make "Education" a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder "Education". Reorder this variable so that "Middle School" comes before "High School". Assign the output the name "yts_fct".
+Make `CaliforniaCounty` a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder `CaliforniaCounty`. Reorder this variable so the order is now San Francisco, Ventura, Napa, and Amador. Assign the output the name "ces_fct".
 
 ```{r 1.3response}
-yts_fct <-
-  yts %>% mutate(Education = factor(Education,
-    levels = c("Middle School", "High School")
+ces_fct <-
+  ces %>% mutate(CaliforniaCounty = factor(CaliforniaCounty,
+    levels = c("San Francisco", "Ventura", "Napa", "Amador")
   ))
 ```
 
 ### 1.4
 
-Repeat question 1.1 and 1.2 using the "yts_fct" data. You should see different ordering in the plot and `count` table.
+Repeat question 1.1 and 1.2 using the "ces_fct" data. You should see different ordering in the plot and `count` table.
 
 ```{r 1.4response}
-yts_fct %>%
-  ggplot(mapping = aes(x = Education, y = Sample_Size)) +
+ces_fct %>%
+  ggplot(mapping = aes(x = CaliforniaCounty, y = ImpWaterBodies)) +
   geom_boxplot()
 
-yts_fct %>%
-  count(Education)
+ces_fct %>%
+  count(CaliforniaCounty)
 ```
 
 
 # Practice on Your Own!
 
 ### P.1
-
-Convert "LocationAbbr" (state) in "yts_fct" into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument.
+Subset `ces_fct` so that it only includes data from Ventura county. Then convert `ZIP` (zip code) into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument.
 
 ```{r P.1response}
-yts_fct <- yts_fct %>% mutate(LocationAbbr = factor(LocationAbbr))
+ces_Ventura <- ces_fct %>% 
+  filter(CaliforniaCounty == "Ventura") %>%
+  mutate(ZIP = factor(ZIP))
 ```
 
 ### P.2
 
-We want to create a new column that contains the group-level median sample size. 
+We want to create a new column that contains the group-level median values for `ImpWaterBodies`. 
 
-- Using the "yts_fct" data, `group_by` "LocationAbbr". 
-- Then, use `mutate` to create a new column "med_sample_size" that is the median "Sample_Size". 
-- **Hint**: Since you have already done `group_by`, a median "Sample_Size" will automatically be created for each unique level in "LocationAbbr". Use the `median` function with `na.rm = TRUE`.
+- Using the "ces_Ventura" data, group the data by `ZIP` using `group_by`
+- Then, use `mutate` to create a new column `med_ImpWaterBodies` that is the median of `ImpWaterBodies`. 
+- **Hint**: Since you have already done `group_by`, a median `ImpWaterBodies` will automatically be created for each unique level in `ZIP`. Use the `median` function with `na.rm = TRUE`.
 
 ```{r P.2response}
-yts_fct <- yts_fct %>%
-  group_by(LocationAbbr) %>%
-  mutate(med_sample_size = median(Sample_Size, na.rm = TRUE))
+ces_Ventura <- ces_Ventura %>%
+  group_by(ZIP) %>%
+  mutate(med_ImpWaterBodies = median(ImpWaterBodies, na.rm = TRUE))
 ```
 
 ### P.3
 
-We want to plot the "LocationAbbr" (state) by the "med_sample_size" column we created above. Using the `forcats` package, create a plot that:
+We want to make a plot of the `med_ImpWaterBodies`  column we created above in the `ces_Ventura`, separated by `ZIP`. Using the `forcats` package, create a plot that:
 
-- Has "LocationAbbr" on the x-axis
-- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by "med_sample_size"
-- Has "Sample_Size" on the y-axis
+- Has `ZIP` on the x-axis
+- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by `med_ImpWaterBodies`
+- Has `med_ImpWaterBodies` on the y-axis
 - Is a boxplot (`geom_boxplot`)
-- Has the x axis label of `State`
+- Has the x axis label of "Zipcode"
 (Don't worry if you get a warning about not being able to plot `NA` values.)
 
 Save your plot using `ggsave()` with a width of 10 and height of 3.
 
-Which state has the largest median sample size?
+Which zipcode has the largest median measure of water pollution?
 
 ```{r P.3response}
 library(forcats)
 
-yts_fct_plot <- yts_fct %>%
+ces_Ventura_plot <- ces_Ventura %>%
   drop_na() %>%
   ggplot(mapping = aes(
     x = fct_reorder(
-      LocationAbbr, med_sample_size
+      ZIP, med_ImpWaterBodies
     ),
-    y = Sample_Size
+    y = med_ImpWaterBodies
   )) +
   geom_boxplot() +
-  labs(x = "State")
+  labs(x = "Zipcode")
 
 ggsave(
-  filename = "yts_fct.png", # will save in working directory
-  plot = yts_fct_plot,
+  filename = "ces_Ventura.png", # will save in working directory
+  plot = ces_Ventura_plot,
   width = 10, height = 3
 )
 ```