Skip to content

Commit

Permalink
Merge pull request #252 from fhdsl/post_daseh2_subset
Browse files Browse the repository at this point in the history
Updating subsetting
  • Loading branch information
carriewright11 authored Dec 6, 2024
2 parents ae04137 + 46d9a73 commit 96277ff
Show file tree
Hide file tree
Showing 2 changed files with 59 additions and 26 deletions.
67 changes: 50 additions & 17 deletions modules/Subsetting_Data_in_R/Subsetting_Data_in_R.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,8 @@ knitr::include_graphics("images/rename.png")
"Artwork by @allison_horst". https://allisonhorst.com/


## checking names of columns, we can use the `colnames()` function (or `names()`)
## checking names of columns, we can use the `colnames()` function

```{r}
colnames(er)
```
Expand Down Expand Up @@ -379,7 +380,7 @@ test
clean_names(test)
```

## GUT CHECK: Which of the following would NOT always work with a column called `counties_of_seattle_with_population_over_10,000`?
## GUT CHECK: Which of the following would work well with a column called `counties_of_seattle_with_population_over_10,000`?

A. Renaming it using `rename` function to something simpler like `seattle_counties_over_10thous`.

Expand Down Expand Up @@ -420,15 +421,15 @@ We'll work with the CO heat-related ER visits dataset again.
This time lets also make it a smaller subset so it is easier for us to see the full dataset as we work through examples.

```{r}
#read_csv("https://daseh.org/data/CO_ER_heat_visits.csv")
# er<-read_csv("https://daseh.org/data/CO_ER_heat_visits.csv")
set.seed(1234)
er_30 <-slice_sample(er, n = 30)
```

## Subset columns of a data frame - `tidyverse` way:

To grab (or "pull" out) the `year` column the `tidyverse` way we can use the `pull` function:
To grab a vector version (or "pull" out) the `year` column the `tidyverse` way we can use the `pull` function:

```{r}
pull(er_30, year)
```
Expand All @@ -437,10 +438,17 @@ pull(er_30, year)
## Subset columns of a data frame: dplyr

The `select` command from `dplyr` allows you to subset (still a `tibble`!)

```{r}
select(er_30, year)
```

## GUT CHECK: What function would be useful for getting a vector version of a column?

A. `pull()`

B. `select()`

## Select multiple columns

We can use `select` to select for multiple columns.
Expand Down Expand Up @@ -505,11 +513,6 @@ select(er_30, where(is.numeric))
```

## GUT CHECK: What function would be useful for getting a vector version of a column?

A. `pull()`

B. `select()`



Expand Down Expand Up @@ -650,11 +653,8 @@ https://media.giphy.com/media/5b5OU7aUekfdSAER5I/giphy.gif

A. `filter()` with `|`

B. `select()` with `|`

C. `filter()` with `&`
B. `filter()` with `&`

D. `select()` with `&`

## Summary

Expand All @@ -663,7 +663,7 @@ D. `select()` with `&`
- you can `select()` based on patterns in the column names
- you can also `select()` based on column class with the `where()` function
- you can combine multiple tidyselect functions together like `select(starts_with("C"), ends_with("state"))`
- you can combine multiple patterns with the `c()` function like `select(starts_with(c("A", "C")))`
- you can combine multiple patterns with the `c()` function like `select(starts_with(c("A", "C")))` (see extra slides at the end for more info!)
- `filter()` can be used to filter out rows based on logical conditions
- avoid using quotes when referring to column names with `filter()`

Expand All @@ -675,8 +675,10 @@ D. `select()` with `&`

## Lab Part 2

🏠 [Class Website](https://daseh.org)
🏠 [Class Website](https://daseh.org)

💻 [Lab](https://daseh.org/modules/Subsetting_Data_in_R/lab/Subsetting_Data_in_R_Lab.Rmd)

📃 [Day 3 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-3.pdf)

📃 [Posit's `dplyr` Cheatsheet](https://rstudio.github.io/cheatsheets/data-transformation.pdf)
Expand All @@ -685,7 +687,7 @@ D. `select()` with `&`

```{r}
#er <- read_csv("https://daseh.org/data/Colorado_ER_heat_visits.csv")
#er <- read_csv("https://daseh.org/data/CO_ER_heat_visits.csv")
set.seed(1234)
er_30 <-slice_sample(er, n = 30)
```
Expand Down Expand Up @@ -733,6 +735,13 @@ The pipe `%>%` makes this much more readable. It reads left side "pipes" into r
er_30 %>% filter(year > 2000 & county == "Denver") %>% select(year, rate)
```

## Alternative Pipes

There are multiple ways to write a pipe and you might see these (they work the same!):

|>

%>%

# Adding/Removing Columns

Expand Down Expand Up @@ -993,6 +1002,30 @@ select(er_30, starts_with(c("r", "l"))) # here we combine two patterns
```

## Nuances about `filter()`

```{r}
test <- tibble(A = c(1,2,3,4), B = c(1,2,3,4))
test
# These are technically the same but >= is easier to read
# Separating can cause issues
filter(test, B > 2 | B==2)
filter(test, B >= 2)
```

## Order of operations for `filter()`

Order can matter. Think of individual statements separately first.
```{r}
filter(test, A>3 | B==2 & B>2) # A is greater than 3 or B is equal to 2 AND (think but also) B must be greater than 2 , thus 2 is dropped.
filter(test, A>3 & B>2 | B==2) # A is greater than 3 AND B is greater than 2 leaving only 4s OR B is equal to 2, (since this comes later, 2 is preserved)
```




## Ordering the column names of a data frame: alphabetically {.codesmall}

Using the base R `order()` function.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -130,29 +130,29 @@ head(select(ces_sub, Asthma))
Subset the rows of `ces_sub` that have **more** than 100 for `Asthma` - how many rows are there? Use `filter()`.

```{r 2.4response}
ces_sub <- filter(ces_sub, Asthma > 100)
nrow(ces_sub)
nrow(filter(ces_sub, Asthma > 100))
```

### 2.5

Subset the rows of `ces_sub` that have a `Traffic` value **less** than 500 and an `Asthma` value **more** than 100 - how many are there?
Subset the rows of `ces_sub` that have an `Asthma` value **more** than 100 and a `Traffic` value **less** than 500 and — how many are there?


```{r 2.5response}
filter(ces_sub, Traffic < 500 & Asthma > 100) # all of these options work
nrow(filter(ces_sub, Traffic < 500 & Asthma > 100))
nrow(filter(ces_sub, Traffic < 500, Asthma > 100))
filter(ces_sub, Asthma > 100 & Traffic < 500) # all of these options work
nrow(filter(ces_sub, Asthma > 100 & Traffic < 500))
nrow(filter(ces_sub, Asthma > 100, Traffic < 500))
```

### 2.6

Subset the rows of `ces_sub` that have a `Traffic` value **less than or equal to** 500 and an `Asthma` value **more** than 100 - how many are there?
Subset the rows of `ces_sub` that have an `Asthma` value **more** than 100 and a `Traffic` value **less than or equal to (`<=`)** 500 — how many are there?

```{r 2.6response}
filter(ces_sub, Traffic <= 500 & Asthma > 100) # all of these options work
nrow(filter(ces_sub, Traffic <= 500 & Asthma > 100))
nrow(filter(ces_sub, Traffic <= 500, Asthma > 100))
nrow(filter(ces_sub, Asthma > 100 & Traffic <= 500))
nrow(filter(ces_sub, Asthma > 100, Traffic <= 500))
```

### 2.7
Expand Down

0 comments on commit 96277ff

Please sign in to comment.