Skip to content

Commit

Permalink
Merge pull request #163 from fhdsl/update-factors-lab
Browse files Browse the repository at this point in the history
[Factors] Update dataset to CES for the lab
  • Loading branch information
carriewright11 authored Oct 8, 2024
2 parents 30fac04 + 130faa7 commit b173d6d
Show file tree
Hide file tree
Showing 3 changed files with 680 additions and 369 deletions.
42 changes: 22 additions & 20 deletions modules/Factors/lab/Factors_Lab.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,41 +13,44 @@ library(tidyverse)

### 1.0

Load the Youth Tobacco Survey data and `select` "Sample_Size", "Education", and "LocationAbbr". Name this data "yts".
Load the CalEnviroScreen dataset and use `select` to choose the `CaliforniaCounty`, `ImpWaterBodies`, and `ZIP` variables. Then subset this data using `filter` to include only the California counties Napa and San Francisco. Name this data "ces".

`ImpWaterBodies`: measure of the number of pollutants across all impaired water bodies within a given distance of populated areas.

```{r}
yts <-
read_csv("https://daseh.org/data/Youth_Tobacco_Survey_YTS_Data.csv") %>%
select(Sample_Size, Education, LocationAbbr)
ces <-
read_csv("https://daseh.org/data/CalEnviroScreen_data.csv") %>%
select(CaliforniaCounty, ImpWaterBodies, ZIP) %>%
filter(CaliforniaCounty == c("Amador", "Napa", "Ventura", "San Francisco"))
```

### 1.1

Create a boxplot showing the difference in "Sample_Size" between Middle School and High School "Education". **Hint**: Use `aes(x = Education, y = Sample_Size)` and `geom_boxplot()`.
Create a boxplot showing the difference in groundwater contamination threats (`ImpWaterBodies`) among Amador, Napa, San Francisco, and Ventura counties (`CaliforniaCounty`). **Hint**: Use `aes(x = CaliforniaCounty, y = ImpWaterBodies)` and `geom_boxplot()`.

```{r 1.1response}
```

### 1.2

Use `count` to count up the number of observations of data for each "Education" group.
Use `count` to count up the number of observations of data for each `CaliforniaCounty` group.

```{r 1.2response}
```

### 1.3

Make "Education" a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder "Education". Reorder this variable so that "Middle School" comes before "High School". Assign the output the name "yts_fct".
Make `CaliforniaCounty` a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder `CaliforniaCounty`. Reorder this variable so the order is now San Francisco, Ventura, Napa, and Amador. Assign the output the name "ces_fct".

```{r 1.3response}
```

### 1.4

Repeat question 1.1 and 1.2 using the "yts_fct" data. You should see different ordering in the plot and `count` table.
Repeat question 1.1 and 1.2 using the "ces_fct" data. You should see different ordering in the plot and `count` table.

```{r 1.4response}
Expand All @@ -57,39 +60,38 @@ Repeat question 1.1 and 1.2 using the "yts_fct" data. You should see different o
# Practice on Your Own!

### P.1

Convert "LocationAbbr" (state) in "yts_fct" into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument.
Subset `ces_fct` so that it only includes data from Ventura county. Then convert `ZIP` (zip code) into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument.

```{r P.1response}
```

### P.2

We want to create a new column that contains the group-level median sample size.
We want to create a new column that contains the group-level median values for `ImpWaterBodies`.

- Using the "yts_fct" data, `group_by` "LocationAbbr".
- Then, use `mutate` to create a new column "med_sample_size" that is the median "Sample_Size".
- **Hint**: Since you have already done `group_by`, a median "Sample_Size" will automatically be created for each unique level in "LocationAbbr". Use the `median` function with `na.rm = TRUE`.
- Using the "ces_Ventura" data, group the data by `ZIP` using `group_by`
- Then, use `mutate` to create a new column `med_ImpWaterBodies` that is the median of `ImpWaterBodies`.
- **Hint**: Since you have already done `group_by`, a median `ImpWaterBodies` will automatically be created for each unique level in `ZIP`. Use the `median` function with `na.rm = TRUE`.

```{r P.2response}
```

### P.3

We want to plot the "LocationAbbr" (state) by the "med_sample_size" column we created above. Using the `forcats` package, create a plot that:
We want to make a plot of the `med_ImpWaterBodies` column we created above in the `ces_Ventura`, separated by `ZIP`. Using the `forcats` package, create a plot that:

- Has "LocationAbbr" on the x-axis
- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by "med_sample_size"
- Has "Sample_Size" on the y-axis
- Has `ZIP` on the x-axis
- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by `med_ImpWaterBodies`
- Has `med_ImpWaterBodies` on the y-axis
- Is a boxplot (`geom_boxplot`)
- Has the x axis label of `State`
- Has the x axis label of "Zipcode"
(Don't worry if you get a warning about not being able to plot `NA` values.)

Save your plot using `ggsave()` with a width of 10 and height of 3.

Which state has the largest median sample size?
Which zipcode has the largest median measure of water pollution?

```{r P.3response}
Expand Down
86 changes: 45 additions & 41 deletions modules/Factors/lab/Factors_Lab_Key.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,114 +13,118 @@ library(tidyverse)

### 1.0

Load the Youth Tobacco Survey data and `select` "Sample_Size", "Education", and "LocationAbbr". Name this data "yts".
Load the CalEnviroScreen dataset and use `select` to choose the `CaliforniaCounty`, `ImpWaterBodies`, and `ZIP` variables. Then subset this data using `filter` to include only the California counties Napa and San Francisco. Name this data "ces".

`ImpWaterBodies`: measure of the number of pollutants across all impaired water bodies within a given distance of populated areas.

```{r}
yts <-
read_csv("https://daseh.org/data/Youth_Tobacco_Survey_YTS_Data.csv") %>%
select(Sample_Size, Education, LocationAbbr)
ces <-
read_csv("https://daseh.org/data/CalEnviroScreen_data.csv") %>%
select(CaliforniaCounty, ImpWaterBodies, ZIP) %>%
filter(CaliforniaCounty == c("Amador", "Napa", "Ventura", "San Francisco"))
```

### 1.1

Create a boxplot showing the difference in "Sample_Size" between Middle School and High School "Education". **Hint**: Use `aes(x = Education, y = Sample_Size)` and `geom_boxplot()`.
Create a boxplot showing the difference in groundwater contamination threats (`ImpWaterBodies`) among Amador, Napa, San Francisco, and Ventura counties (`CaliforniaCounty`). **Hint**: Use `aes(x = CaliforniaCounty, y = ImpWaterBodies)` and `geom_boxplot()`.

```{r 1.1response}
yts %>%
ggplot(mapping = aes(x = Education, y = Sample_Size)) +
ces %>%
ggplot(mapping = aes(x = CaliforniaCounty, y = ImpWaterBodies)) +
geom_boxplot()
```

### 1.2

Use `count` to count up the number of observations of data for each "Education" group.
Use `count` to count up the number of observations of data for each `CaliforniaCounty` group.

```{r 1.2response}
yts %>%
count(Education)
ces %>%
count(CaliforniaCounty)
```

### 1.3

Make "Education" a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder "Education". Reorder this variable so that "Middle School" comes before "High School". Assign the output the name "yts_fct".
Make `CaliforniaCounty` a factor using the `mutate` and `factor` functions. Use the `levels` argument inside `factor` to reorder `CaliforniaCounty`. Reorder this variable so the order is now San Francisco, Ventura, Napa, and Amador. Assign the output the name "ces_fct".

```{r 1.3response}
yts_fct <-
yts %>% mutate(Education = factor(Education,
levels = c("Middle School", "High School")
ces_fct <-
ces %>% mutate(CaliforniaCounty = factor(CaliforniaCounty,
levels = c("San Francisco", "Ventura", "Napa", "Amador")
))
```

### 1.4

Repeat question 1.1 and 1.2 using the "yts_fct" data. You should see different ordering in the plot and `count` table.
Repeat question 1.1 and 1.2 using the "ces_fct" data. You should see different ordering in the plot and `count` table.

```{r 1.4response}
yts_fct %>%
ggplot(mapping = aes(x = Education, y = Sample_Size)) +
ces_fct %>%
ggplot(mapping = aes(x = CaliforniaCounty, y = ImpWaterBodies)) +
geom_boxplot()
yts_fct %>%
count(Education)
ces_fct %>%
count(CaliforniaCounty)
```


# Practice on Your Own!

### P.1

Convert "LocationAbbr" (state) in "yts_fct" into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument.
Subset `ces_fct` so that it only includes data from Ventura county. Then convert `ZIP` (zip code) into a factor using the `mutate` and `factor` functions. Do not add a `levels =` argument.

```{r P.1response}
yts_fct <- yts_fct %>% mutate(LocationAbbr = factor(LocationAbbr))
ces_Ventura <- ces_fct %>%
filter(CaliforniaCounty == "Ventura") %>%
mutate(ZIP = factor(ZIP))
```

### P.2

We want to create a new column that contains the group-level median sample size.
We want to create a new column that contains the group-level median values for `ImpWaterBodies`.

- Using the "yts_fct" data, `group_by` "LocationAbbr".
- Then, use `mutate` to create a new column "med_sample_size" that is the median "Sample_Size".
- **Hint**: Since you have already done `group_by`, a median "Sample_Size" will automatically be created for each unique level in "LocationAbbr". Use the `median` function with `na.rm = TRUE`.
- Using the "ces_Ventura" data, group the data by `ZIP` using `group_by`
- Then, use `mutate` to create a new column `med_ImpWaterBodies` that is the median of `ImpWaterBodies`.
- **Hint**: Since you have already done `group_by`, a median `ImpWaterBodies` will automatically be created for each unique level in `ZIP`. Use the `median` function with `na.rm = TRUE`.

```{r P.2response}
yts_fct <- yts_fct %>%
group_by(LocationAbbr) %>%
mutate(med_sample_size = median(Sample_Size, na.rm = TRUE))
ces_Ventura <- ces_Ventura %>%
group_by(ZIP) %>%
mutate(med_ImpWaterBodies = median(ImpWaterBodies, na.rm = TRUE))
```

### P.3

We want to plot the "LocationAbbr" (state) by the "med_sample_size" column we created above. Using the `forcats` package, create a plot that:
We want to make a plot of the `med_ImpWaterBodies` column we created above in the `ces_Ventura`, separated by `ZIP`. Using the `forcats` package, create a plot that:

- Has "LocationAbbr" on the x-axis
- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by "med_sample_size"
- Has "Sample_Size" on the y-axis
- Has `ZIP` on the x-axis
- Uses the `mapping` argument and the `fct_reorder` function to order the x-axis by `med_ImpWaterBodies`
- Has `med_ImpWaterBodies` on the y-axis
- Is a boxplot (`geom_boxplot`)
- Has the x axis label of `State`
- Has the x axis label of "Zipcode"
(Don't worry if you get a warning about not being able to plot `NA` values.)

Save your plot using `ggsave()` with a width of 10 and height of 3.

Which state has the largest median sample size?
Which zipcode has the largest median measure of water pollution?

```{r P.3response}
library(forcats)
yts_fct_plot <- yts_fct %>%
ces_Ventura_plot <- ces_Ventura %>%
drop_na() %>%
ggplot(mapping = aes(
x = fct_reorder(
LocationAbbr, med_sample_size
ZIP, med_ImpWaterBodies
),
y = Sample_Size
y = med_ImpWaterBodies
)) +
geom_boxplot() +
labs(x = "State")
labs(x = "Zipcode")
ggsave(
filename = "yts_fct.png", # will save in working directory
plot = yts_fct_plot,
filename = "ces_Ventura.png", # will save in working directory
plot = ces_Ventura_plot,
width = 10, height = 3
)
```
Loading

0 comments on commit b173d6d

Please sign in to comment.