Merge pull request #213 from fhdsl/manip-updates

[Manipulating Data] Last minute updates..
fhdsl · Oct 7, 2024 · e5b4870 · e5b4870
2 parents 27c8999 + a9b33db
commit e5b4870
Show file tree

Hide file tree

Showing 2 changed files with 129 additions and 119 deletions.
diff --git a/modules/Manipulating_Data_in_R/Manipulating_Data_in_R.Rmd b/modules/Manipulating_Data_in_R/Manipulating_Data_in_R.Rmd
@@ -125,9 +125,9 @@ Long: **Easier for R to make plots & do analysis**
 ex_long
 ```
 
-## Pivoting using `tidyr` package
+## Pivoting using the `tidyr` package (part of `tidyverse`)
 
-`tidyr` allows you to "tidy" your data.  We will be talking about:
+We will be talking about:
 
 - `pivot_longer` - make multiple columns into variables, (wide to long)
 - `pivot_wider` - make a variable into multiple columns, (long to wide)
@@ -154,10 +154,20 @@ You might see old functions `gather` and `spread` when googling. These are older
 
 ```{r}
 ex_wide
-ex_long <- ex_wide %>% pivot_longer(cols = !ends_with("State"))
+ex_long <- ex_wide %>% pivot_longer(cols = ends_with("rate"))
 ex_long
 ```
 
+## GUT CHECK!
+
+What does `pivot_longer()` do?
+
+A. Summarize data
+
+B. Import data
+
+C. Reshape data
+
 ## Reshaping wide to long: Better column names {.codesmall}
 
 `pivot_longer()` - puts column data into rows (`tidyr` package)
@@ -169,29 +179,29 @@ ex_long
 <div class = "codeexample">
 ```{r, eval=FALSE}
 {long_data} <- {wide_data} %>% pivot_longer(cols = {columns to pivot},
-                                        names_to = {name for old columns},
-                                        values_to = {name for cell values})
+                                            names_to = {name for old columns},
+                                            values_to = {name for cell values})
 ```
 </div>
 
-## Reshaping data from **wide to long**
+## Reshaping wide to long: Better column names {.codesmall}
+
+Newly created column names ("Month" and "Rate") are enclosed in quotation marks. It helps us be more specific than "name" and "value".
 
 ```{r}
-ex_wide
-ex_long <- ex_wide %>% pivot_longer(cols = !ends_with("State"),
-                                        names_to = "Month",
-                                        values_to = "Rate")
+ex_long <- ex_wide %>% pivot_longer(cols = ends_with("rate"),
+                                    names_to = "Month",
+                                    values_to = "Rate")
 ex_long
 ```
 
-Newly created column names are enclosed in quotation marks.
-
-## Data used: Nitrate exposure
+## Data used: Nitrate exposure{.codesmall}
 
 Let's look at some data on levels of nitrate in water from Washington. This dataset reports the amount of people in Washington exposed to excess levels of nitrate in their water between 1999 and 2020.
 
 ```{r, message = FALSE}
-wide_nitrate <- read_csv(file = "https://daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_data.csv")
+wide_nitrate <- 
+  read_csv(file = "https://daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_data.csv")
 head(wide_nitrate)
 ```
 
@@ -219,9 +229,7 @@ wide_nitrate
 
 ```{r}
 long_nitrate <- wide_nitrate %>%
-  pivot_longer(!c(year, quarter, pop_on_sampled_PWS),
-               names_to = "conc_cat",
-               values_to = "conc_count")
+  pivot_longer(!c(year, quarter, pop_on_sampled_PWS))
 long_nitrate
 ```
 
@@ -239,7 +247,7 @@ Let's make the `conc_count` into a proportion.
 
 ```{r}
 long_nitrate <- long_nitrate %>%
-  mutate(conc_prop = conc_count / pop_on_sampled_PWS)
+  mutate(conc_prop = value / pop_on_sampled_PWS)
 long_nitrate
 ```
 
@@ -249,15 +257,14 @@ Now our data is more tidy, and we can take the averages easily!
 
 ```{r}
 long_nitrate %>% 
-  group_by(conc_cat) %>% 
+  group_by(name) %>% 
   summarize("avg_prop_exposedpop" = mean(conc_prop))
 ```
 
 ## Reshaping data from **wide to long**
 
 There are many ways to **select** the columns we want. Check out https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html to look at more column selection options.
 
-
 # `pivot_wider`...
 
 ## Reshaping data from **long to wide**
@@ -281,24 +288,42 @@ We can use `pivot_wider` to convert long data to wide format. Let's try it with
 
 ```{r}
 ex_long
+```
+
+## Reshaping data from **long to wide**
+
+We can use `pivot_wider` to convert long data to wide format. Let's try it with the vaccine data from earlier.
+
+```{r}
 ex_wide2 <- ex_long %>% pivot_wider(names_from = "Month", 
                                        values_from = "Rate") 
 ex_wide2
 ```
 
 ## Reshaping nitrate exposure data{.codesmall}
 
-Let's go back to the nitrate exposure dataset. What if we wanted to make a wide version of the data that displayed the number of people at each level of nitrate exposure for each quarter?
+Let's go back to the nitrate exposure dataset. What if we wanted to make a wide version of the data that displayed the proportion of people at each level of nitrate exposure, with each quarter as a column?
 
 ```{r}
 long_nitrate
 ```
 
 ## Reshaping nitrate exposure data
 
+Drop some columns we don't need.
+
+```{r}
+long_nitrate <- long_nitrate %>%
+  select(!c(pop_on_sampled_PWS, value))
+long_nitrate
+```
+
+## Reshaping nitrate exposure data
+
+Pivot the data!
+
 ```{r}
 wide_nitrate <- long_nitrate %>%
-  select(!c(pop_on_sampled_PWS, conc_count)) %>%
   pivot_wider(names_from = "quarter", values_from = "conc_prop")
 wide_nitrate
 ```
@@ -337,7 +362,6 @@ knitr::include_graphics("images/joins.png")
 * `anti_join(x, y)` - all rows from `x` not in `y` keeping just columns from `x`.
 
 ## Merging: Simple Data
-Let's load in some datasets about vaccination rates by state. These data are saved in two different files.
 
 ```{r message=FALSE}
 data_As <- read_csv(
@@ -447,7 +471,7 @@ fj
 
 <IMG style="position:absolute;bottom:10.5%;left:85%;width:120px;"SRC="images/full.png">
 
-## Watch out for "`includes duplicates`"
+## "`includes duplicates`"
 
 
 ```{r message=FALSE}
@@ -462,15 +486,15 @@ data_As
 data_cold
 ```
 
-## Watch out for "`includes duplicates`"
+## "`includes duplicates`"
 
 ```{r}
 lj <- left_join(data_As, data_cold)
 ```
 
 <IMG style="position:absolute;bottom:10.5%;left:85%;width:120px;"SRC="images/left.png">
 
-## Watch out for "`includes duplicates`"
+## "`includes duplicates`"
 
 Data including the joining column ("State") has been duplicated.
 
@@ -484,7 +508,7 @@ Note that "Alaska willow ptarmigan" appears twice.
 
 <IMG style="position:absolute;bottom:10.5%;left:85%;width:120px;"SRC="images/left.png">
 
-## Watch out for "`includes duplicates`"
+## "`includes duplicates`"
 
 https://github.com/gadenbuie/tidyexplain/blob/main/images/left-join-extra.gif
 
@@ -538,6 +562,16 @@ anti_join(data_cold, data_As, by = "State") # order switched
 
 <IMG style="position:absolute;bottom:10.5%;left:85%;width:120px;"SRC="images/anti.png">
 
+## GUT CHECK!
+
+Why use `join` functions?
+
+A. Combine different data sources
+
+B. Connect Rmd to other files
+
+C. Using one data source is too easy and we want our analysis ~ fancy ~
+
 ## Summary
 
 * Merging/joining data sets together - assumes all column names that overlap
@@ -555,6 +589,12 @@ anti_join(data_cold, data_As, by = "State") # order switched
 
 💻 [Lab](https://daseh.org/modules/Manipulating_Data_in_R/lab/Manipulating_Data_in_R_Lab.Rmd)
 
+📃 [Day 6 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-6.pdf)
+
+📃 [Posit's `tidyr` Cheatsheet](https://rstudio.github.io/cheatsheets/tidyr.pdf)
+
+📃 [Posit's `dplyr` Cheatsheet](https://rstudio.github.io/cheatsheets/data-transformation.pdf)
+
 ```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
 knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
 ```