Skip to content

Commit

Permalink
Addressing manipulating data buggies
Browse files Browse the repository at this point in the history
  • Loading branch information
cansavvy committed Aug 12, 2024
1 parent b3508d1 commit 2544b08
Show file tree
Hide file tree
Showing 3 changed files with 33 additions and 27 deletions.
32 changes: 16 additions & 16 deletions modules/Manipulating_Data_in_R/Manipulating_Data_in_R.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ library(tidyverse)

📃[Cheatsheet](https://daseh.org/modules/cheatsheets/Day-5.pdf)

## Manipulating Data
## Manipulating Data

In this module, we will show you how to:

Expand Down Expand Up @@ -89,7 +89,7 @@ ex_wide <- tibble(State = c("Alabama", "Alaska"),
ex_long <- pivot_longer(ex_wide, cols = !State)
```

Wide: multiple columns per individual, values spread across multiple columns
Wide: multiple columns per individual, values spread across multiple columns

```{r, echo = FALSE}
ex_wide
Expand Down Expand Up @@ -136,7 +136,7 @@ You might see old functions `gather` and `spread` when googling. These are older

# `pivot_longer`...

## Reshaping data from **wide to long** {.codesmall}
## Reshaping data from **wide to long** {.codesmall}

`pivot_longer()` - puts column data into rows (`tidyr` package)

Expand All @@ -161,7 +161,7 @@ long_vacc <- wide_vacc %>% pivot_longer(cols = everything())
long_vacc
```

## Reshaping wide to long: Better column names {.codesmall}
## Reshaping wide to long: Better column names {.codesmall}

`pivot_longer()` - puts column data into rows (`tidyr` package)

Expand Down Expand Up @@ -189,7 +189,7 @@ long_vacc

Newly created column names are enclosed in quotation marks.

## Data used: Nitrate exposure
## Data used: Nitrate exposure

Nitrate exposure by quarter for populations on public water systems in the state of Washington for 1999-2020.

Expand Down Expand Up @@ -239,7 +239,7 @@ Un-pivoted columns (`year`, `quarter`, `pop_on_sampled_PWS`) are still columns.
long
```

## Cleaning up long data{.codesmall}
## Cleaning up long data{.codesmall}

Let's make the `conc_count` into a proportion.

Expand All @@ -254,8 +254,8 @@ long
Now our data is more tidy, and we can take the averages easily!

```{r}
long %>%
group_by(conc_cat) %>%
long %>%
group_by(conc_cat) %>%
summarize("avg_prop" = mean(conc_prop))
```

Expand All @@ -275,7 +275,7 @@ There are many ways to **select** the columns we want. Check out https://dplyr.t

<div class = "codeexample">
```{r, eval=FALSE}
{wide_data} <- {long_data} %>%
{wide_data} <- {long_data} %>%
pivot_wider(names_from = {Old column name: contains new column names},
values_from = {Old column name: contains new cell values})
```
Expand All @@ -285,12 +285,12 @@ There are many ways to **select** the columns we want. Check out https://dplyr.t

```{r}
long_vacc
wide_vacc <- long_vacc %>% pivot_wider(names_from = "Month",
values_from = "Rate")
wide_vacc <- long_vacc %>% pivot_wider(names_from = "Month",
values_from = "Rate")
wide_vacc
```

## Reshaping nitrate exposure data{.codesmall}
## Reshaping nitrate exposure data{.codesmall}

What if we wanted different columns for each quarter?

Expand Down Expand Up @@ -335,7 +335,7 @@ knitr::include_graphics("images/joins.png")
* Merging/joining data sets together - usually on key variables, usually "id"
* `?join` - see different types of joining for `dplyr`
* `inner_join(x, y)` - only rows that match for `x` and `y` are kept
* `full_join(x, y)` - all rows of `x` and `y` are kept
* `full_join(x, y)` - all rows of `x` and `y` are kept
* `left_join(x, y)` - all rows of `x` are kept even if not merged with `y`
* `right_join(x, y)` - all rows of `y` are kept even if not merged with `x`
* `anti_join(x, y)` - all rows from `x` not in `y` keeping just columns from `x`.
Expand Down Expand Up @@ -545,11 +545,11 @@ anti_join(data_cold, data_As, by = "State") # order switched
* Merging/joining data sets together - assumes all column names that overlap
- use the `by = c("a" = "b")` if they differ
* `inner_join(x, y)` - only rows that match for `x` and `y` are kept
* `full_join(x, y)` - all rows of `x` and `y` are kept
* `full_join(x, y)` - all rows of `x` and `y` are kept
* `left_join(x, y)` - all rows of `x` are kept even if not merged with `y`
* `right_join(x, y)` - all rows of `y` are kept even if not merged with `x`
* Use the `tidylog` package for a detailed summary
* `antijoin(x, y)` shows what is only in `x` (missing from `y`)
* `anti_join(x, y)` shows what is only in `x` (missing from `y`)

## Lab Part 2

Expand Down Expand Up @@ -596,7 +596,7 @@ dplyr::setdiff(cold_states, A_states)

## Getting the set difference with `setdiff`

Why did we use `dplyr::setdiff`?
Why did we use `dplyr::setdiff`?

There is a base R function, also called `setdiff` that requires vectors.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Manipulating Data in R Lab"
output: html_document
editor_options:
editor_options:
chunk_output_type: console
---

Expand Down Expand Up @@ -41,7 +41,10 @@ Look at the column names using `colnames` - do you notice any patterns?

### 1.3

Let's rename the column "2011" in "co2" to "CO2_2011" using `rename`. Repeat this for the years 2012, 2013, and 2014. Make sure to reassign to `co2` here and in subsequent steps.
Let's rename the columns "co2" from this type of format: "2011" to this: "CO2_2011" using `rename`.
Be sure to do this for all years 2012, 2013, and 2014. Make sure that you end up with the renamed columns in a data frame named `co2` here and in subsequent steps.

Hint: If you run code to rename the columns and store back into a data frame of the same name like `co2` you will not be able to re-run the renaming code without error (the columns are already renamed so it won't be able to find the oldname of the column anymore)

```
# General format
Expand Down Expand Up @@ -119,7 +122,7 @@ Take the code from Questions 1.1 and 1.3-1.7. Chain all of this code together us

Modify the code from Question P.1:

- Choose 4 different years to examine
- Choose 4 different years to examine
- Select different countries to compare
- Call your data `co2_compare2`

Expand Down Expand Up @@ -176,7 +179,7 @@ What countries are present in "co2" that are not present in "cc"? Use `anti_join

```
# General format
anti_join(data1, data2, by = "") %>% select(index)
anti_join(data1, data2, by = "") %>% select(columnname)
```

```{r 2.4response}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Manipulating Data in R Lab - Key"
output: html_document
editor_options:
editor_options:
chunk_output_type: console
---

Expand Down Expand Up @@ -44,7 +44,10 @@ colnames(co2)

### 1.3

Let's rename the column "2011" in "co2" to "CO2_2011" using `rename`. Repeat this for the years 2012, 2013, and 2014. Make sure to reassign to `co2` here and in subsequent steps.
Let's rename the columns "co2" from this type of format: "2011" to this: "CO2_2011" using `rename`.
Be sure to do this for all years 2012, 2013, and 2014. Make sure that you end up with the renamed columns in a data frame named `co2` here and in subsequent steps.

Hint: If you run code to rename the columns and store back into a data frame of the same name like `co2` you will not be able to re-run the renaming code without error (the columns are already renamed so it won't be able to find the oldname of the column anymore)

```
# General format
Expand Down Expand Up @@ -146,7 +149,7 @@ co2_compare

Modify the code from Question P.1:

- Choose 4 different years to examine
- Choose 4 different years to examine
- Select different countries to compare
- Call your data `co2_compare2`

Expand Down Expand Up @@ -177,7 +180,7 @@ Open the `Yearly_CC_Disasters` dataset using the url below. Save the dataset as


```{r 2.1response}
cc <- read_csv("https://daseh.org/data/Yearly_CC_Disasters.csv") %>%
cc <- read_csv("https://daseh.org/data/Yearly_CC_Disasters.csv") %>%
rename(country = Country)
```

Expand Down Expand Up @@ -218,7 +221,7 @@ What countries are present in "co2" that are not present in "cc"? Use `anti_join

```
# General format
anti_join(data1, data2, by = "") %>% select(index)
anti_join(data1, data2, by = "") %>% select(columnname)
```

```{r 2.4response}
Expand All @@ -234,7 +237,7 @@ anti_join(cc, co2, by = "country") %>% select(country) %>% distinct()
Take the code from 2.2 and save the output as an object "co2_cc". Filter the dataset. Filter so that you only keep Indonesia and Canada.

```{r P.3response}
co2_cc <- full_join(co2, cc, by = "country") %>%
co2_cc <- full_join(co2, cc, by = "country") %>%
filter(country %in% c("Indonesia", "Canada"))
```

Expand Down Expand Up @@ -279,7 +282,7 @@ Pivot the dataset so that there are columns for country, emissions, and a column

```{r P.6response}
co2_cc %>% pivot_wider(
names_from = Indicator,
names_from = Indicator,
values_from = disasters
)
```

0 comments on commit 2544b08

Please sign in to comment.