Skip to content

Commit

Permalink
Merge pull request #213 from fhdsl/manip-updates
Browse files Browse the repository at this point in the history
[Manipulating Data] Last minute updates..
  • Loading branch information
avahoffman authored Oct 7, 2024
2 parents 27c8999 + a9b33db commit e5b4870
Show file tree
Hide file tree
Showing 2 changed files with 129 additions and 119 deletions.
94 changes: 67 additions & 27 deletions modules/Manipulating_Data_in_R/Manipulating_Data_in_R.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -125,9 +125,9 @@ Long: **Easier for R to make plots & do analysis**
ex_long
```

## Pivoting using `tidyr` package
## Pivoting using the `tidyr` package (part of `tidyverse`)

`tidyr` allows you to "tidy" your data. We will be talking about:
We will be talking about:

- `pivot_longer` - make multiple columns into variables, (wide to long)
- `pivot_wider` - make a variable into multiple columns, (long to wide)
Expand All @@ -154,10 +154,20 @@ You might see old functions `gather` and `spread` when googling. These are older

```{r}
ex_wide
ex_long <- ex_wide %>% pivot_longer(cols = !ends_with("State"))
ex_long <- ex_wide %>% pivot_longer(cols = ends_with("rate"))
ex_long
```

## GUT CHECK!

What does `pivot_longer()` do?

A. Summarize data

B. Import data

C. Reshape data

## Reshaping wide to long: Better column names {.codesmall}

`pivot_longer()` - puts column data into rows (`tidyr` package)
Expand All @@ -169,29 +179,29 @@ ex_long
<div class = "codeexample">
```{r, eval=FALSE}
{long_data} <- {wide_data} %>% pivot_longer(cols = {columns to pivot},
names_to = {name for old columns},
values_to = {name for cell values})
names_to = {name for old columns},
values_to = {name for cell values})
```
</div>

## Reshaping data from **wide to long**
## Reshaping wide to long: Better column names {.codesmall}

Newly created column names ("Month" and "Rate") are enclosed in quotation marks. It helps us be more specific than "name" and "value".

```{r}
ex_wide
ex_long <- ex_wide %>% pivot_longer(cols = !ends_with("State"),
names_to = "Month",
values_to = "Rate")
ex_long <- ex_wide %>% pivot_longer(cols = ends_with("rate"),
names_to = "Month",
values_to = "Rate")
ex_long
```

Newly created column names are enclosed in quotation marks.

## Data used: Nitrate exposure
## Data used: Nitrate exposure{.codesmall}

Let's look at some data on levels of nitrate in water from Washington. This dataset reports the amount of people in Washington exposed to excess levels of nitrate in their water between 1999 and 2020.

```{r, message = FALSE}
wide_nitrate <- read_csv(file = "https://daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_data.csv")
wide_nitrate <-
read_csv(file = "https://daseh.org/data/Nitrate_Exposure_for_WA_Public_Water_Systems_byquarter_data.csv")
head(wide_nitrate)
```

Expand Down Expand Up @@ -219,9 +229,7 @@ wide_nitrate

```{r}
long_nitrate <- wide_nitrate %>%
pivot_longer(!c(year, quarter, pop_on_sampled_PWS),
names_to = "conc_cat",
values_to = "conc_count")
pivot_longer(!c(year, quarter, pop_on_sampled_PWS))
long_nitrate
```

Expand All @@ -239,7 +247,7 @@ Let's make the `conc_count` into a proportion.

```{r}
long_nitrate <- long_nitrate %>%
mutate(conc_prop = conc_count / pop_on_sampled_PWS)
mutate(conc_prop = value / pop_on_sampled_PWS)
long_nitrate
```

Expand All @@ -249,15 +257,14 @@ Now our data is more tidy, and we can take the averages easily!

```{r}
long_nitrate %>%
group_by(conc_cat) %>%
group_by(name) %>%
summarize("avg_prop_exposedpop" = mean(conc_prop))
```

## Reshaping data from **wide to long**

There are many ways to **select** the columns we want. Check out https://dplyr.tidyverse.org/reference/dplyr_tidy_select.html to look at more column selection options.


# `pivot_wider`...

## Reshaping data from **long to wide**
Expand All @@ -281,24 +288,42 @@ We can use `pivot_wider` to convert long data to wide format. Let's try it with

```{r}
ex_long
```

## Reshaping data from **long to wide**

We can use `pivot_wider` to convert long data to wide format. Let's try it with the vaccine data from earlier.

```{r}
ex_wide2 <- ex_long %>% pivot_wider(names_from = "Month",
values_from = "Rate")
ex_wide2
```

## Reshaping nitrate exposure data{.codesmall}

Let's go back to the nitrate exposure dataset. What if we wanted to make a wide version of the data that displayed the number of people at each level of nitrate exposure for each quarter?
Let's go back to the nitrate exposure dataset. What if we wanted to make a wide version of the data that displayed the proportion of people at each level of nitrate exposure, with each quarter as a column?

```{r}
long_nitrate
```

## Reshaping nitrate exposure data

Drop some columns we don't need.

```{r}
long_nitrate <- long_nitrate %>%
select(!c(pop_on_sampled_PWS, value))
long_nitrate
```

## Reshaping nitrate exposure data

Pivot the data!

```{r}
wide_nitrate <- long_nitrate %>%
select(!c(pop_on_sampled_PWS, conc_count)) %>%
pivot_wider(names_from = "quarter", values_from = "conc_prop")
wide_nitrate
```
Expand Down Expand Up @@ -337,7 +362,6 @@ knitr::include_graphics("images/joins.png")
* `anti_join(x, y)` - all rows from `x` not in `y` keeping just columns from `x`.

## Merging: Simple Data
Let's load in some datasets about vaccination rates by state. These data are saved in two different files.

```{r message=FALSE}
data_As <- read_csv(
Expand Down Expand Up @@ -447,7 +471,7 @@ fj

<IMG style="position:absolute;bottom:10.5%;left:85%;width:120px;"SRC="images/full.png">

## Watch out for "`includes duplicates`"
## "`includes duplicates`"


```{r message=FALSE}
Expand All @@ -462,15 +486,15 @@ data_As
data_cold
```

## Watch out for "`includes duplicates`"
## "`includes duplicates`"

```{r}
lj <- left_join(data_As, data_cold)
```

<IMG style="position:absolute;bottom:10.5%;left:85%;width:120px;"SRC="images/left.png">

## Watch out for "`includes duplicates`"
## "`includes duplicates`"

Data including the joining column ("State") has been duplicated.

Expand All @@ -484,7 +508,7 @@ Note that "Alaska willow ptarmigan" appears twice.

<IMG style="position:absolute;bottom:10.5%;left:85%;width:120px;"SRC="images/left.png">

## Watch out for "`includes duplicates`"
## "`includes duplicates`"

https://github.com/gadenbuie/tidyexplain/blob/main/images/left-join-extra.gif

Expand Down Expand Up @@ -538,6 +562,16 @@ anti_join(data_cold, data_As, by = "State") # order switched

<IMG style="position:absolute;bottom:10.5%;left:85%;width:120px;"SRC="images/anti.png">

## GUT CHECK!

Why use `join` functions?

A. Combine different data sources

B. Connect Rmd to other files

C. Using one data source is too easy and we want our analysis ~ fancy ~

## Summary

* Merging/joining data sets together - assumes all column names that overlap
Expand All @@ -555,6 +589,12 @@ anti_join(data_cold, data_As, by = "State") # order switched

💻 [Lab](https://daseh.org/modules/Manipulating_Data_in_R/lab/Manipulating_Data_in_R_Lab.Rmd)

📃 [Day 6 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-6.pdf)

📃 [Posit's `tidyr` Cheatsheet](https://rstudio.github.io/cheatsheets/tidyr.pdf)

📃 [Posit's `dplyr` Cheatsheet](https://rstudio.github.io/cheatsheets/data-transformation.pdf)

```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
```
Expand Down
Loading

0 comments on commit e5b4870

Please sign in to comment.