diff --git a/modules/Data_Summarization/Data_Summarization.Rmd b/modules/Data_Summarization/Data_Summarization.Rmd
index 1df85a57..720344f1 100644
--- a/modules/Data_Summarization/Data_Summarization.Rmd
+++ b/modules/Data_Summarization/Data_Summarization.Rmd
@@ -34,34 +34,44 @@ pre { /* Code block - slightly smaller in this lecture */
📃[Day 3 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-3.pdf)
-## Summarization with Data
+## The Data
We can use the CO heat-related ER visits dataset to explore different ways of summarizing data.
-(*Reminder* This dataset contains information about the number and rate of visits for heat-related illness to ERs in Colorado from 2011-2022, adjusted for age.)
+*Reminder*: This dataset contains information about the number and rate of visits for heat-related illness to ERs in Colorado from 2011-2022, adjusted for age.
-The `head` command displays the first rows of an object:
+
+## The Data
+
+We can use the CO heat-related ER visits dataset to explore different ways of summarizing data.
+
+The `head` function displays the first rows of an object:
+
+
+
+
+
+
+
```{r}
er <-
- read_csv("https://daseh.org/data/CO_ER_heat_visits.csv")
+ read_csv("../../data/CO_ER_heat_visits.csv")
head(er)
```
-
## Behavior of `pull()` function
-`pull()` converts a single data column into a vector. This allows you to run summary functions.
+`pull()` converts a single data column into a vector.
```{r, eval=FALSE}
er %>% pull(visits)
```
-
## Data Summarization
-Now that we have a vector of numbers.. what can we do with it?
+Now that we have a vector of numbers.. what can we do with it?
* Basic statistical summarization
* `mean(x)`: takes the mean of x
@@ -73,12 +83,19 @@ Now that we have a vector of numbers.. what can we do with it?
* `max(x)`: maximum value in x
* `min(x)`: minimum value in x
-## Statistical summarization the "tidy" way
+## Pipe (`%>%`) vectors into summarizing functions
-**Add the ** `na.rm =` **argument for missing data**
+A vector can be summarized:
```{r}
er %>% pull(visits) %>% mean()
+```
+
+
+
+Add the `na.rm =` argument for missing data
+
+```{r}
er %>% pull(visits) %>% mean(na.rm=T)
```
@@ -98,7 +115,7 @@ C. A dataset
`summarize` works on datasets without `pull()`.
-Multiple summary statistics can be calculated at once. `pull()` can only do one column.
+Multiple summary statistics can be calculated at once!
```{r, eval = FALSE}
@@ -150,10 +167,11 @@ summary(er)
## Summary & Lab Part 1
-- summary stats (`mean()`) work with `pull()`
+- `pull()` creates a *vector*
- don't forget the `na.rm = TRUE` argument!
- `summary(x)`: quantile information
- `summarize`: creates a summary table of columns of interest
+- summary stats (`mean()`) work with vectors or with `summarize()`
🏠 [Class Website](https://daseh.org/)
@@ -172,7 +190,9 @@ er %>%
## How many `distinct()` values?
-`n_distinct()` tells you the number of unique elements. _Must pull the column first!_
+`n_distinct()` tells you the number of unique elements.
+
+It needs a vector so you _must pull the column first!_
```{r}
er %>%
@@ -186,29 +206,45 @@ options(max.print = 1000)
```
-## `dplyr`: `count`
-
-Use `count` to return row count by category.
+## Use `count()` to return row count per category.
```{r, message = FALSE}
er %>% count(county)
```
+_Looks like 12 rows/observations per county!_
-## `dplyr`: `count`
-
-Multiple columns listed further subdivides the count.
+## Multiple columns listed further subdivides the `count()`
```{r, message = FALSE}
er %>% count(county, year)
```
+_Looks like 1 row/observation per county and year!_
+
+## GUT CHECK!
+The `count()` function can help us tally:
+
+A. Sample size
+
+B. Rows per each category
+
+C. How many categories
# Grouping
+## Goal
+
+We want to find the mean number of ER visits per year in the dataset.
+
+_How do we do this?_
+
+
## Perform Operations By Groups: dplyr
+First, let's group the data.
+
`group_by` allows you group the data set by variables/columns you specify:
```{r}
@@ -227,7 +263,7 @@ er_grouped %>%
```
-## Use the `pipe` to string these together!
+## Do it in one step: use `%>%` to string these together!
Pipe `CO_heat_ER` into `group_by`, then pipe that into `summarize`:
@@ -247,9 +283,9 @@ er %>%
summarize(avg_visits = mean(visits, na.rm = TRUE))
```
-## Counting
+## Counting rows/observations
-There are other functions, such as `n()` count the number of observations (NAs included).
+There are other summarizing functions, such as `n()` count the number of rows/observations (NAs included).
```{r}
er %>%
@@ -259,15 +295,23 @@ er %>%
```
-## Counting{.codesmall}
+## Counting: `count()` and `n()`
`count()` and `n()` can give very similar information.
```{r}
+# Here we use count()
er %>% count(year)
-er %>% group_by(year) %>% summarize(n()) # n() typically used with summarize
```
+## Counting: `count()` and `n()`
+
+`count()` and `n()` can give very similar information.
+
+```{r}
+# n() with summarize
+er %>% group_by(year) %>% summarize(n())
+```
# A few miscellaneous topics ..
@@ -278,6 +322,7 @@ These functions require a column as a vector using `pull()`.
```{r, message = FALSE}
er_year <- er %>% pull(year) # pull() to make a vector
+
er_year %>% unique() # similar to distinct()
```
@@ -301,7 +346,6 @@ er_year %>% unique() %>% length() # similar to n_distinct()
- `n_distinct()` with `pull()`: how many distinct values?
- `group_by()`: changes all subsequent functions
- combine with `summarize()` to get statistics per group
- - combine with `mutate()` to add column
- `summarize()` with `n()` gives the count (NAs included)
## Lab Part 2
@@ -314,9 +358,12 @@ er_year %>% unique() %>% length() # similar to n_distinct()
📃[Posit's data transformation Cheatsheet](https://rstudio.github.io/cheatsheets/data-transformation.pdf)
-For more advanced learning, check out https://www.danieldsjoberg.com/gtsummary/ for tables of summary statistics and the extra slides in this file.
+**For more advanced learning:**
+
+- https://www.danieldsjoberg.com/gtsummary/ for tables
+- extra slides in this file.
-```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
+```{r, fig.alt="The End", out.width = "30%", echo = FALSE, fig.align='center'}
knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
```
@@ -333,8 +380,13 @@ Image by
+
+
+
```{r}
-yearly_co2 <- read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
+yearly_co2 <- read_csv("../../data/Yearly_CO2_Emissions_1000_tonnes.csv")
```
diff --git a/modules/Data_Summarization/lab/Data_Summarization_Lab_Key.Rmd b/modules/Data_Summarization/lab/Data_Summarization_Lab_Key.Rmd
index a69a5bcf..f3332907 100644
--- a/modules/Data_Summarization/lab/Data_Summarization_Lab_Key.Rmd
+++ b/modules/Data_Summarization/lab/Data_Summarization_Lab_Key.Rmd
@@ -22,7 +22,7 @@ ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")
### 1.1
-How observations/rows are in the `ces` data set? You can use `dim()` or `nrow()` or examine the Environment.
+How many observations/rows are in the `ces` data set? You can use `dim()` or `nrow()` or examine the Environment.
```{r 1.1response}
nrow(ces)
diff --git a/modules/cheatsheets/Day-4.md b/modules/cheatsheets/Day-4.md
index 32e0ba5d..45252d5c 100644
--- a/modules/cheatsheets/Day-4.md
+++ b/modules/cheatsheets/Day-4.md
@@ -1,6 +1,6 @@
---
-classoption:
-- landscape
+classoption: landscape
+output: pdf_document
---
# Day 4 Cheatsheet
@@ -10,30 +10,29 @@ classoption:
### Functions
|Library/Package|Piece of code|Example of usage|What it does|
|---------------|-------------|----------------|-------------|
-|Base `R`| [`min(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Extremes) |`min(x)`| Returns the minimum value of all values in an object `x`.|
-|Base `R`| [`sum(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sum) | `sum(x)`| Returns the sum of all values (values must be integer, numeric, or logical) in object `x`.|
-|Base `R`| [`mean(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean) |`mean(x)`| Returns the arithmetic mean of all values (values must be integer or numeric) in object `x` or logical vector `x`.|
-| Base `R`|[`log(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/log) |`log(x)`| Gives the natural logarithm of object `x`. `log2(x)` can be used to give the logarithm of the object in base 2. Or the base can be specified as an argument.|
-| Base `R`|[`range(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/range) |`range(x)`| Gives the min and max for object `x`.|
-| Base `R`|[`sd(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/sd) |`sd(x)`| Gives the standard deviation for object `x`.|
-| Base `R`|[`sqrt(x)`](https://www.rdocumentation.org/packages/SparkR/versions/2.1.2/topics/sqrt) |`sqrt(x)`| Gives the square root for object `x`.|
-| Base `R`|[`quantile(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile)|`quantile(x, probs = .5)`| Produces sample quantiles corresponding to the given probabilities `x`.|
+| Base `R`| [`min(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Extremes) |`min(x)`| Returns the minimum value of all values in an object `x`.|
+| Base `R`| [`sum(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sum) | `sum(x)`| Returns the sum of all values (values must be integer, numeric, or logical) in object `x`.|
+| Base `R`| [`mean(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean) |`mean(x)`| Returns the arithmetic mean of all values (values must be integer or numeric) in object `x` or logical vector `x`.|
+| Base `R`| [`log(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/log) |`log(x)`| Gives the natural logarithm of object `x`. `log2(x)` can be used to give the logarithm of the object in base 2. Or the base can be specified as an argument.|
+| Base `R`| [`range(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/range) |`range(x)`| Gives the min and max for object `x`.|
+| Base `R`| [`sd(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/sd) |`sd(x)`| Gives the standard deviation for object `x`.|
+| Base `R`| [`sqrt(x)`](https://www.rdocumentation.org/packages/SparkR/versions/2.1.2/topics/sqrt) |`sqrt(x)`| Gives the square root for object `x`.|
+| Base `R`| [`quantile(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile)|`quantile(x, probs = .5)`| Produces sample quantiles corresponding to the given probabilities `x`.|
| Base `R`| [`summary(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/summary)|`summary(x)`| Returns a summary of the values in object `x`.|
+| `dplyr`| [`pull()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/pull)| `x_vect <- df %>% pull(x)` | Extract a single column into vector form. `pull()` is very handy before summary functions like `mean()`, `sum()`, etc. |
+| `dplyr`| [`summarize()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/summarize) | `df <- df %>% summarize(mean_x = mean(x))` | Summarizes multiple values in an object into a single value. This function can be used with other functions to retrieve a single output value for the grouped values. `summarize` and `summarise` are synonyms in this package. However, note that this function does not work in the same manner as the base R `summary` function.|
+| `dplyr`| [`distinct()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/distinct) |`df %>% distinct(factor_name)`| Display unique/distinct rows from a data frame or tibble|
+| `dplyr`| [`n_distinct()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/n_distinct) |`x_vect %>% n_distinct()`|Counts the number of unique/distinct combinations in a set of one or more vectors.|
+| `dplyr`| [`count()`](https://dplyr.tidyverse.org/reference/count.html)|`df %>% count(factor_name)`|Count the number of groups in a factor variable of a data frame or tibble|
+| `dplyr`| [`group_by()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% group_by(factor_name)`| Groups data into rows that contain the same specified value(s)|
+| `dplyr`| [`ungroup()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% ungroup()`| Undo a grouping that was done by `group_by()`|
+| Base `R`| [`unique()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/unique)| `unique(df)`|Returns a vector, data frame or array like x but with duplicate elements/rows removed.|
| Base `R`| [`rowSums()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rowsum) | `rowSums(df)`|Calculates sums for each row|
| Base `R`| [`colSums()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/colSums)|`colSums(df)`| Calculates sums for each column|
| Base `R`| [`rowMeans()`](https://www.rdocumentation.org/packages/fame/versions/1.03/topics/rowMeans)| `rowMeans(df)`|Calculates means for each row|
| Base `R`| [`colMeans()`](https://www.statology.org/colmeans-in-r/)|`colMeans(df)`| Calculates means for each column|
-| `dplyr`|[`summarize()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/summarize) | `df <- df %>% summarize(mean_x = mean(x))` | Summarizes multiple values in an object into a single value. This function can be used with other functions to retrieve a single output value for the grouped values. `summarize` and `summarise` are synonyms in this package. However, note that this function does not work in the same manner as the base R `summary` function.|
-| `dplyr`|[`across()`](https://dplyr.tidyverse.org/reference/across.html)| `df %>% summarize(across( c('col_a', 'col_b'), ~ sum(.x)))`| Use the across function with summarize to summarize across multiple columns of your data.|
-| Base `R`| [`unique()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/unique)| `unique(df)`|Returns a vector, data frame or array like x but with duplicate elements/rows removed.|
-| Base `R`| [`table()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/table)| `table(x)`| Builds a contingency table of the counts at each combination of factor levels.|
-| `dplyr`| [`count()`](https://dplyr.tidyverse.org/reference/count.html)|`df %>% count(factor_name)`|Count the number of groups in a factor variable of a data frame or tibble|
-| `dplyr`| [`group_by()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% count(factor_name)`| Groups data into rows that contain the same specified value(s)|
-| `dplyr`| [`ungroup()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% count(factor_name)`| Undo a grouping that was done by `group_by()`|
-| Base `R`| [`plot()`](https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/plot)|`plot(x, y)`| Creates a scatterplot of x and y vector data|
-| Base `R`| [`boxplot()`](https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/boxplot)|`boxplot(x, y)`| Creates a boxplot of y against levels of x|
-| Base `R`| [`hist()`](https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/hist)|`hist(x)`| Creates a histogram of x|
-| Base `R`| [`density()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/density) |`plot(density(x))`| Creates a kernel density plot of x when used with `plot()`|
+
+- Many summarizing functions (e.g., `mean()`, `sum()`) have the argument `na.rm = TRUE`. This can be used to ignore missing data.
diff --git a/modules/cheatsheets/Day-4.pdf b/modules/cheatsheets/Day-4.pdf
index 7f6374e8..b8d4797f 100644
Binary files a/modules/cheatsheets/Day-4.pdf and b/modules/cheatsheets/Day-4.pdf differ