Why are my changes not taking effect? It’s making my results look
Here we are creating a new object from an existing one:
new_rivers <- sample(rivers, 5)
new_rivers
-## [1] 250 380 444 1054 350
+## [1] 327 1885 460 3710 280
Using just this will only print the result and not actually change new_rivers
:
new_rivers + 1
-## [1] 251 381 445 1055 351
+## [1] 328 1886 461 3711 281
If we want to modify new_rivers
and save that modified version, then we need to reassign new_rivers
like so:
new_rivers <- new_rivers + 1
new_rivers
-## [1] 251 381 445 1055 351
+## [1] 328 1886 461 3711 281
If we forget to reassign this can cause subsequent steps to not work as expected because we will not be working with the data that has been modified.
@@ -409,7 +409,7 @@ Error: object ‘X’ not found
Make sure you run something like this, with the <-
operator:
rivers2 <- new_rivers + 1
rivers2
-## [1] 252 382 446 1056 352
+## [1] 329 1887 462 3712 282
diff --git a/index.html b/index.html
index 19103f83..65c03ea6 100644
--- a/index.html
+++ b/index.html
@@ -348,7 +348,7 @@
Testimonials from our other courses:
Find an Error!?
Feel free to submit typos/errors/etc via the GitHub repository associated with the class: https://github.com/fhdsl/DaSEH
-
This page was last updated on 2024-12-05.
+
This page was last updated on 2024-12-06.
diff --git a/modules/Subsetting_Data_in_R/Subsetting_Data_in_R.html b/modules/Subsetting_Data_in_R/Subsetting_Data_in_R.html
index 45c404a0..b3fe70a2 100644
--- a/modules/Subsetting_Data_in_R/Subsetting_Data_in_R.html
+++ b/modules/Subsetting_Data_in_R/Subsetting_Data_in_R.html
@@ -233,16 +233,6 @@
library(tidyverse) # loads dplyr and other packages!
-
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
-✔ forcats 1.0.0 ✔ readr 2.1.5
-✔ ggplot2 3.5.1 ✔ stringr 1.5.1
-✔ lubridate 1.9.3 ✔ tibble 3.2.1
-✔ purrr 1.0.2 ✔ tidyr 1.3.1
-── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
-✖ dplyr::filter() masks stats::filter()
-✖ dplyr::lag() masks stats::lag()
-ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
-
Getting data to work with
We will work with data called er
about heat-related ER visits between 2011 and 2022, as reported by the state of Colorado, specifically made available by the Colorado Environmental Public Health Tracking program website. Full dataset available at https://coepht.colorado.gov/heat-related-illness.
@@ -299,18 +289,18 @@
slice_sample(er, n = 2)
# A tibble: 2 × 6
- county rate lower95cl upper95cl visits year
- <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
-1 Garfield NA NA NA NA 2011
-2 Morgan NA NA NA NA 2020
+ county rate lower95cl upper95cl visits year
+ <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
+1 Teller NA NA NA NA 2021
+2 Broomfield NA NA NA NA 2021
slice_sample(er, n = 2)
# A tibble: 2 × 6
- county rate lower95cl upper95cl visits year
- <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
-1 Boulder 3.79 1.90 6.32 12 2013
-2 Montezuma NA NA NA NA 2022
+ county rate lower95cl upper95cl visits year
+ <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
+1 Mesa 8.88 4.53 14.7 12 2012
+2 Huerfano 0 0 0 0 2020
Data frames and tibbles
@@ -449,7 +439,7 @@
“Artwork by @allison_horst”. https://allisonhorst.com/
-checking names of columns, we can use the colnames()
function (or names()
)
+checking names of columns, we can use the colnames()
function
colnames(er)
@@ -680,7 +670,7 @@
2 2 3
3 3 4
-GUT CHECK: Which of the following would NOT always work with a column called counties_of_seattle_with_population_over_10,000
?
+GUT CHECK: Which of the following would work well with a column called counties_of_seattle_with_population_over_10,000
?
A. Renaming it using rename
function to something simpler like seattle_counties_over_10thous
.
@@ -720,13 +710,13 @@
This time lets also make it a smaller subset so it is easier for us to see the full dataset as we work through examples.
-#read_csv("https://daseh.org/data/CO_ER_heat_visits.csv")
+# er<-read_csv("https://daseh.org/data/CO_ER_heat_visits.csv")
set.seed(1234)
er_30 <-slice_sample(er, n = 30)
Subset columns of a data frame - tidyverse
way:
-To grab (or “pull” out) the year
column the tidyverse
way we can use the pull
function:
+To grab a vector version (or “pull” out) the year
column the tidyverse
way we can use the pull
function:
pull(er_30, year)
@@ -754,6 +744,12 @@
10 2012
# ℹ 20 more rows
+GUT CHECK: What function would be useful for getting a vector version of a column?
+
+A. pull()
+
+B. select()
+
Select multiple columns
We can use select
to select for multiple columns.
@@ -894,12 +890,6 @@
10 0 0 0 0 2012
# ℹ 20 more rows
-GUT CHECK: What function would be useful for getting a vector version of a column?
-
-A. pull()
-
-B. select()
-
Subsetting Rows
filter
function
@@ -1102,11 +1092,7 @@
A. filter()
with |
-B. select()
with |
-
-C. filter()
with &
-
-D. select()
with &
+B. filter()
with &
Summary
@@ -1116,7 +1102,7 @@
you can select()
based on patterns in the column names
you can also select()
based on column class with the where()
function
you can combine multiple tidyselect functions together like select(starts_with("C"), ends_with("state"))
-you can combine multiple patterns with the c()
function like select(starts_with(c("A", "C")))
+you can combine multiple patterns with the c()
function like select(starts_with(c("A", "C")))
(see extra slides at the end for more info!)
filter()
can be used to filter out rows based on logical conditions
avoid using quotes when referring to column names with filter()
@@ -1131,13 +1117,17 @@
Lab Part 2
-🏠 Class Website
💻 Lab 📃 Day 3 Cheatsheet
+🏠 Class Website
+
+💻 Lab
+
+📃 Day 3 Cheatsheet
📃 Posit’s dplyr
Cheatsheet
Get the data
-#er <- read_csv("https://daseh.org/data/Colorado_ER_heat_visits.csv")
+#er <- read_csv("https://daseh.org/data/CO_ER_heat_visits.csv")
set.seed(1234)
er_30 <-slice_sample(er, n = 30)
@@ -1208,6 +1198,14 @@
<dbl> <dbl>
1 2013 2.95
+Alternative Pipes
+
+There are multiple ways to write a pipe and you might see these (they work the same!):
+
+ |>
+
+ %>%
+
Adding/Removing Columns
Adding columns to a data frame: dplyr (tidyverse
way)
@@ -1580,6 +1578,58 @@
10 0 0
# ℹ 20 more rows
+Nuances about filter()
+
+test <- tibble(A = c(1,2,3,4), B = c(1,2,3,4))
+test
+
+# A tibble: 4 × 2
+ A B
+ <dbl> <dbl>
+1 1 1
+2 2 2
+3 3 3
+4 4 4
+
+# These are technically the same but >= is easier to read
+# Separating can cause issues
+filter(test, B > 2 | B==2)
+
+# A tibble: 3 × 2
+ A B
+ <dbl> <dbl>
+1 2 2
+2 3 3
+3 4 4
+
+filter(test, B >= 2)
+
+# A tibble: 3 × 2
+ A B
+ <dbl> <dbl>
+1 2 2
+2 3 3
+3 4 4
+
+Order of operations for filter()
+
+Order can matter. Think of individual statements separately first.
+
+filter(test, A>3 | B==2 & B>2) # A is greater than 3 or B is equal to 2 AND (think but also) B must be greater than 2 , thus 2 is dropped.
+
+# A tibble: 1 × 2
+ A B
+ <dbl> <dbl>
+1 4 4
+
+filter(test, A>3 & B>2 | B==2) # A is greater than 3 AND B is greater than 2 leaving only 4s OR B is equal to 2, (since this comes later, 2 is preserved)
+
+# A tibble: 2 × 2
+ A B
+ <dbl> <dbl>
+1 2 2
+2 4 4
+
Ordering the column names of a data frame: alphabetically
Using the base R order()
function.
diff --git a/modules/Subsetting_Data_in_R/lab/Subsetting_Data_in_R_Lab.Rmd b/modules/Subsetting_Data_in_R/lab/Subsetting_Data_in_R_Lab.Rmd
index 29a2cd95..9d513da8 100644
--- a/modules/Subsetting_Data_in_R/lab/Subsetting_Data_in_R_Lab.Rmd
+++ b/modules/Subsetting_Data_in_R/lab/Subsetting_Data_in_R_Lab.Rmd
@@ -125,7 +125,7 @@ Subset the rows of `ces_sub` that have **more** than 100 for `Asthma` - how many
### 2.5
-Subset the rows of `ces_sub` that have a `Traffic` value **less** than 500 and an `Asthma` value **more** than 100 - how many are there?
+Subset the rows of `ces_sub` that have an `Asthma` value **more** than 100 and a `Traffic` value **less** than 500 and — how many are there?
```{r 2.5response}
@@ -134,7 +134,7 @@ Subset the rows of `ces_sub` that have a `Traffic` value **less** than 500 and a
### 2.6
-Subset the rows of `ces_sub` that have a `Traffic` value **less than or equal to** 500 and an `Asthma` value **more** than 100 - how many are there?
+Subset the rows of `ces_sub` that have an `Asthma` value **more** than 100 and a `Traffic` value **less than or equal to (`<=`)** 500 — how many are there?
```{r 2.6response}
diff --git a/modules/Subsetting_Data_in_R/lab/Subsetting_Data_in_R_Lab_Key.html b/modules/Subsetting_Data_in_R/lab/Subsetting_Data_in_R_Lab_Key.html
index a06f2e72..a2662771 100644
--- a/modules/Subsetting_Data_in_R/lab/Subsetting_Data_in_R_Lab_Key.html
+++ b/modules/Subsetting_Data_in_R/lab/Subsetting_Data_in_R_Lab_Key.html
@@ -362,14 +362,13 @@ 2.3
2.4
Subset the rows of ces_sub
that have more than 100 for Asthma
- how many rows are there? Use filter()
.
-
ces_sub <- filter(ces_sub, Asthma > 100)
-nrow(ces_sub)
+
nrow(filter(ces_sub, Asthma > 100))
## [1] 592
2.5
-
Subset the rows of ces_sub
that have a Traffic
value less than 500 and an Asthma
value more than 100 - how many are there?
-
filter(ces_sub, Traffic < 500 & Asthma > 100) # all of these options work
+
Subset the rows of ces_sub
that have an Asthma
value more than 100 and a Traffic
value less than 500 and — how many are there?
+
filter(ces_sub, Asthma > 100 & Traffic < 500) # all of these options work
## # A tibble: 130 × 3
## CensusTract Traffic Asthma
## <dbl> <dbl> <dbl>
@@ -384,14 +383,14 @@ 2.5
## 9 6001407600 311. 115.
## 10 6001408200 437. 187.
## # ℹ 120 more rows
-
nrow(filter(ces_sub, Traffic < 500 & Asthma > 100))
+
nrow(filter(ces_sub, Asthma > 100 & Traffic < 500))
## [1] 130
-
nrow(filter(ces_sub, Traffic < 500, Asthma > 100))
+
nrow(filter(ces_sub, Asthma > 100, Traffic < 500))
## [1] 130
2.6
-
Subset the rows of ces_sub
that have a Traffic
value less than or equal to 500 and an Asthma
value more than 100 - how many are there?
+
Subset the rows of ces_sub
that have an Asthma
value more than 100 and a Traffic
value less than or equal to (<=
) 500 — how many are there?
filter(ces_sub, Traffic <= 500 & Asthma > 100) # all of these options work
## # A tibble: 130 × 3
## CensusTract Traffic Asthma
@@ -407,9 +406,9 @@ 2.6
## 9 6001407600 311. 115.
## 10 6001408200 437. 187.
## # ℹ 120 more rows
-
nrow(filter(ces_sub, Traffic <= 500 & Asthma > 100))
+
nrow(filter(ces_sub, Asthma > 100 & Traffic <= 500))
## [1] 130
-
nrow(filter(ces_sub, Traffic <= 500, Asthma > 100))
+
nrow(filter(ces_sub, Asthma > 100, Traffic <= 500))
## [1] 130