Finalize some changes to summarization

fhdsl · Oct 2, 2024 · c7f2f14 · c7f2f14
1 parent 7d10d18
commit c7f2f14
Show file tree

Hide file tree

Showing 4 changed files with 101 additions and 50 deletions.
diff --git a/modules/Data_Summarization/Data_Summarization.Rmd b/modules/Data_Summarization/Data_Summarization.Rmd
@@ -34,34 +34,44 @@ pre { /* Code block - slightly smaller in this lecture */
 📃[Day 3 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-3.pdf)
 
 
-## Summarization with Data
+## The Data
 
 We can use the CO heat-related ER visits dataset to explore different ways of summarizing data. 
 
-(*Reminder* This dataset contains information about the number and rate of visits for heat-related illness to ERs in Colorado from 2011-2022, adjusted for age.)  
+*Reminder*: This dataset contains information about the number and rate of visits for heat-related illness to ERs in Colorado from 2011-2022, adjusted for age.
 
-The `head` command displays the first rows of an object:
+
+## The Data
+
+We can use the CO heat-related ER visits dataset to explore different ways of summarizing data. 
+
+The `head` function displays the first rows of an object:
+
+<!-- ```{r} -->
+<!-- er <-  -->
+<!--   read_csv("https://daseh.org/data/CO_ER_heat_visits.csv") -->
+
+<!-- head(er) -->
+<!-- ``` -->
 
 ```{r}
 er <- 
-  read_csv("https://daseh.org/data/CO_ER_heat_visits.csv")
+  read_csv("../../data/CO_ER_heat_visits.csv")
 
 head(er)
 ```
 
-
 ## Behavior of `pull()` function
 
-`pull()` converts a single data column into a vector. This allows you to run summary functions.
+`pull()` converts a single data column into a <span style="color:blue">vector</span>.
 
 ```{r, eval=FALSE}
 er %>% pull(visits)
 ```
 
-
 ## Data Summarization
 
-Now that we have a vector of numbers.. what can we do with it?
+Now that we have a <span style="color:blue">vector of numbers</span>.. what can we do with it?
 
 * Basic statistical summarization
     * `mean(x)`: takes the mean of x
@@ -73,12 +83,19 @@ Now that we have a vector of numbers.. what can we do with it?
     * `max(x)`: maximum value in x
     * `min(x)`: minimum value in x
 
-## Statistical summarization the "tidy" way
+## Pipe (`%>%`) vectors into summarizing functions
 
-**Add the ** `na.rm =` **argument for missing data**
+A vector can be summarized:
 
 ```{r}
 er %>% pull(visits) %>% mean()
+```
+
+<br> 
+
+Add the `na.rm =` argument for missing data
+
+```{r}
 er %>% pull(visits) %>% mean(na.rm=T)
 ```
 
@@ -98,7 +115,7 @@ C. A dataset
 
 `summarize` works on datasets without `pull()`.
 
-Multiple summary statistics can be calculated at once. `pull()` can only do one column.
+Multiple summary statistics can be calculated at once!
 
 <div class = "codeexample">
 ```{r, eval = FALSE}
@@ -150,10 +167,11 @@ summary(er)
 
 ## Summary & Lab Part 1
 
-- summary stats (`mean()`) work with `pull()`
+- `pull()` creates a *vector*
 - don't forget the `na.rm = TRUE` argument!
 - `summary(x)`: quantile information
 - `summarize`: creates a summary table of columns of interest
+- summary stats (`mean()`) work with vectors or with `summarize()`
 
 🏠 [Class Website](https://daseh.org/)
 
@@ -172,7 +190,9 @@ er %>%
 
 ## How many `distinct()` values?
 
-`n_distinct()` tells you the number of unique elements. _Must pull the column first!_
+`n_distinct()` tells you the number of unique elements. 
+
+It needs a vector so you _must pull the column first!_
 
 ```{r}
 er %>%
@@ -186,29 +206,45 @@ options(max.print = 1000)
 ```
 
 
-## `dplyr`: `count` 
-
-Use `count` to return row count by category.
+## Use `count()` to return row count per category.
 
 ```{r, message = FALSE}
 er %>% count(county)
 ```
 
+_Looks like 12 rows/observations per county!_
 
-## `dplyr`: `count` 
-
-Multiple columns listed further subdivides the count.
+## Multiple columns listed further subdivides the `count()`
 
 ```{r, message = FALSE}
 er %>% count(county, year)
 ```
 
+_Looks like 1 row/observation per county and year!_
+
+## GUT CHECK!
 
+The `count()` function can help us tally:
+
+A. Sample size
+
+B. Rows per each category
+
+C. How many categories
 
 # Grouping
 
+## Goal
+
+We want to find the mean number of ER visits per year in the dataset.
+
+_How do we do this?_
+
+
 ## Perform Operations By Groups: dplyr
 
+First, let's group the data.
+
 `group_by` allows you group the data set by variables/columns you specify:
 
 ```{r}
@@ -227,7 +263,7 @@ er_grouped %>%
 ```
 
 
-## Use the `pipe` to string these together!
+## Do it in one step: use `%>%` to string these together!
 
 Pipe `CO_heat_ER` into `group_by`, then pipe that into `summarize`:
 
@@ -247,9 +283,9 @@ er %>%
   summarize(avg_visits = mean(visits, na.rm = TRUE))
 ```
 
-## Counting
+## Counting rows/observations
 
-There are other functions, such as `n()` count the number of observations (NAs included).
+There are other summarizing functions, such as `n()` count the number of rows/observations (NAs included).
 
 ```{r}
 er %>%
@@ -259,15 +295,23 @@ er %>%
 ```
 
 
-## Counting{.codesmall}
+## Counting: `count()` and `n()`
 
 `count()` and `n()` can give very similar information.
 
 ```{r}
+# Here we use count()
 er %>% count(year)
-er %>% group_by(year) %>% summarize(n()) # n() typically used with summarize
 ```
 
+## Counting: `count()` and `n()`
+
+`count()` and `n()` can give very similar information.
+
+```{r}
+# n() with summarize
+er %>% group_by(year) %>% summarize(n()) 
+```
 
 # A few miscellaneous topics .. 
 
@@ -278,6 +322,7 @@ These functions require a column as a vector using `pull()`.
 
 ```{r, message = FALSE}
 er_year <- er %>% pull(year) # pull() to make a vector
+
 er_year %>% unique() # similar to distinct()
 ```
 
@@ -301,7 +346,6 @@ er_year %>% unique() %>% length() # similar to n_distinct()
   - `n_distinct()` with `pull()`: how many distinct values?
 - `group_by()`: changes all subsequent functions
   - combine with `summarize()` to get statistics per group
-  - combine with `mutate()` to add column
 - `summarize()` with `n()` gives the count (NAs included) 
 
 ## Lab Part 2
@@ -314,9 +358,12 @@ er_year %>% unique() %>% length() # similar to n_distinct()
 
 📃[Posit's data transformation Cheatsheet](https://rstudio.github.io/cheatsheets/data-transformation.pdf)
 
-For more advanced learning, check out https://www.danieldsjoberg.com/gtsummary/ for tables of summary statistics and the extra slides in this file.
+**For more advanced learning:**
+
+- https://www.danieldsjoberg.com/gtsummary/ for tables
+- extra slides in this file.
 
-```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
+```{r, fig.alt="The End", out.width = "30%", echo = FALSE, fig.align='center'}
 knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
 ```
 
@@ -333,8 +380,13 @@ Image by <a href="https://pixabay.com/users/geralt-9301/?utm_source=link-attribu
     * `rowSums(x)`: takes the sum of each row of x
     * `colSums(x)`: takes the sum of each column of x
 
+<!-- ```{r} -->
+<!-- yearly_co2 <- read_csv("https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv") -->
+<!-- ``` -->
+
+
 ```{r}
-yearly_co2 <- read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
+yearly_co2 <- read_csv("../../data/Yearly_CO2_Emissions_1000_tonnes.csv")
 ```
 
 

diff --git a/modules/Data_Summarization/lab/Data_Summarization_Lab_Key.Rmd b/modules/Data_Summarization/lab/Data_Summarization_Lab_Key.Rmd
@@ -22,7 +22,7 @@ ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")
 
 ### 1.1 
 
-How observations/rows are in the `ces` data set? You can use `dim()` or `nrow()` or examine the Environment.
+How many observations/rows are in the `ces` data set? You can use `dim()` or `nrow()` or examine the Environment.
 
 ```{r 1.1response}
 nrow(ces)

diff --git a/modules/cheatsheets/Day-4.md b/modules/cheatsheets/Day-4.md
@@ -1,6 +1,6 @@
 ---
-classoption:
-- landscape
+classoption: landscape
+output: pdf_document
 ---
 
 # Day 4 Cheatsheet
@@ -10,30 +10,29 @@ classoption:
 ### Functions
 |Library/Package|Piece of code|Example of usage|What it does|
 |---------------|-------------|----------------|-------------|
-|Base `R`| [`min(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Extremes) |`min(x)`| Returns the minimum value of all values in an object `x`.|
-|Base `R`| [`sum(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sum) | `sum(x)`| Returns the sum of all values (values must be integer, numeric, or logical) in object `x`.|
-|Base `R`| [`mean(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean) |`mean(x)`| Returns the arithmetic mean of all values (values must be integer or numeric) in object `x` or logical vector `x`.|
-| Base `R`|[`log(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/log) |`log(x)`| Gives the natural logarithm of object `x`. `log2(x)` can be used to give the logarithm of the object in base 2. Or the base can be specified as an argument.|
-| Base `R`|[`range(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/range) |`range(x)`| Gives the min and max for object `x`.|
-| Base `R`|[`sd(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/sd) |`sd(x)`| Gives the standard deviation for object `x`.|
-| Base `R`|[`sqrt(x)`](https://www.rdocumentation.org/packages/SparkR/versions/2.1.2/topics/sqrt) |`sqrt(x)`| Gives the square root for object `x`.|
-| Base `R`|[`quantile(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile)|`quantile(x, probs = .5)`| Produces sample quantiles corresponding to the given probabilities `x`.|
+| Base `R`| [`min(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Extremes) |`min(x)`| Returns the minimum value of all values in an object `x`.|
+| Base `R`| [`sum(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sum) | `sum(x)`| Returns the sum of all values (values must be integer, numeric, or logical) in object `x`.|
+| Base `R`| [`mean(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean) |`mean(x)`| Returns the arithmetic mean of all values (values must be integer or numeric) in object `x` or logical vector `x`.|
+| Base `R`| [`log(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/log) |`log(x)`| Gives the natural logarithm of object `x`. `log2(x)` can be used to give the logarithm of the object in base 2. Or the base can be specified as an argument.|
+| Base `R`| [`range(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/range) |`range(x)`| Gives the min and max for object `x`.|
+| Base `R`| [`sd(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/sd) |`sd(x)`| Gives the standard deviation for object `x`.|
+| Base `R`| [`sqrt(x)`](https://www.rdocumentation.org/packages/SparkR/versions/2.1.2/topics/sqrt) |`sqrt(x)`| Gives the square root for object `x`.|
+| Base `R`| [`quantile(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile)|`quantile(x, probs = .5)`| Produces sample quantiles corresponding to the given probabilities `x`.|
 | Base `R`| [`summary(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/summary)|`summary(x)`| Returns a summary of the values in object `x`.|
+| `dplyr`| [`pull()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/pull)| `x_vect <- df %>% pull(x)` | Extract a single column into vector form. `pull()` is very handy before summary functions like `mean()`, `sum()`, etc. |
+| `dplyr`| [`summarize()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/summarize)      | `df <- df %>% summarize(mean_x = mean(x))` | Summarizes multiple values in an object into a single value. This function can be used with other functions to retrieve a single output value for the grouped values. `summarize` and `summarise` are synonyms in this package. However, note that this function does not work in the same manner as the base R `summary` function.|
+| `dplyr`| [`distinct()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/distinct) |`df %>% distinct(factor_name)`| Display unique/distinct rows from a data frame or tibble|
+| `dplyr`| [`n_distinct()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/n_distinct) |`x_vect %>% n_distinct()`|Counts the number of unique/distinct combinations in a set of one or more vectors.|
+| `dplyr`| [`count()`](https://dplyr.tidyverse.org/reference/count.html)|`df %>% count(factor_name)`|Count the number of groups in a factor variable of a data frame or tibble|
+| `dplyr`| [`group_by()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% group_by(factor_name)`| Groups data into rows that contain the same specified value(s)|
+| `dplyr`| [`ungroup()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% ungroup()`| Undo a grouping that was done by `group_by()`|
+| Base `R`| [`unique()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/unique)| `unique(df)`|Returns a vector, data frame or array like x but with duplicate elements/rows removed.|
 | Base `R`| [`rowSums()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rowsum) | `rowSums(df)`|Calculates sums for each row|
 | Base `R`| [`colSums()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/colSums)|`colSums(df)`| Calculates sums for each column|
 | Base `R`| [`rowMeans()`](https://www.rdocumentation.org/packages/fame/versions/1.03/topics/rowMeans)| `rowMeans(df)`|Calculates means for each row|
 | Base `R`| [`colMeans()`](https://www.statology.org/colmeans-in-r/)|`colMeans(df)`| Calculates means for each column|
-| `dplyr`|[`summarize()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/summarize)      | `df <- df %>% summarize(mean_x = mean(x))` | Summarizes multiple values in an object into a single value. This function can be used with other functions to retrieve a single output value for the grouped values. `summarize` and `summarise` are synonyms in this package. However, note that this function does not work in the same manner as the base R `summary` function.|
-| `dplyr`|[`across()`](https://dplyr.tidyverse.org/reference/across.html)| `df %>% summarize(across( c('col_a', 'col_b'), ~ sum(.x)))`| Use the across function with summarize to summarize across multiple columns of your data.|
-| Base `R`| [`unique()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/unique)| `unique(df)`|Returns a vector, data frame or array like x but with duplicate elements/rows removed.|
-| Base `R`| [`table()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/table)| `table(x)`| Builds a contingency table of the counts at each combination of factor levels.|
-| `dplyr`| [`count()`](https://dplyr.tidyverse.org/reference/count.html)|`df %>% count(factor_name)`|Count the number of groups in a factor variable of a data frame or tibble|
-| `dplyr`| [`group_by()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% count(factor_name)`| Groups data into rows that contain the same specified value(s)|
-| `dplyr`| [`ungroup()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% count(factor_name)`| Undo a grouping that was done by `group_by()`|
-| Base `R`| [`plot()`](https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/plot)|`plot(x, y)`| Creates a scatterplot of x and y vector data|
-| Base `R`| [`boxplot()`](https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/boxplot)|`boxplot(x, y)`| Creates a boxplot of y against levels of x|
-| Base `R`| [`hist()`](https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/hist)|`hist(x)`| Creates a histogram of x|
-| Base `R`| [`density()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/density) |`plot(density(x))`| Creates a kernel density plot of x when used with `plot()`|
+
+- Many summarizing functions (e.g., `mean()`, `sum()`) have the argument `na.rm = TRUE`. This can be used to ignore missing data.
 
 <div style="page-break-after: always;"></div>
 

diff --git a/modules/cheatsheets/Day-4.pdf b/modules/cheatsheets/Day-4.pdf