diff --git a/modules/Data_Summarization/Data_Summarization.Rmd b/modules/Data_Summarization/Data_Summarization.Rmd index 1df85a57..720344f1 100644 --- a/modules/Data_Summarization/Data_Summarization.Rmd +++ b/modules/Data_Summarization/Data_Summarization.Rmd @@ -34,34 +34,44 @@ pre { /* Code block - slightly smaller in this lecture */ 📃[Day 3 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-3.pdf) -## Summarization with Data +## The Data We can use the CO heat-related ER visits dataset to explore different ways of summarizing data. -(*Reminder* This dataset contains information about the number and rate of visits for heat-related illness to ERs in Colorado from 2011-2022, adjusted for age.) +*Reminder*: This dataset contains information about the number and rate of visits for heat-related illness to ERs in Colorado from 2011-2022, adjusted for age. -The `head` command displays the first rows of an object: + +## The Data + +We can use the CO heat-related ER visits dataset to explore different ways of summarizing data. + +The `head` function displays the first rows of an object: + + + + + + + ```{r} er <- - read_csv("https://daseh.org/data/CO_ER_heat_visits.csv") + read_csv("../../data/CO_ER_heat_visits.csv") head(er) ``` - ## Behavior of `pull()` function -`pull()` converts a single data column into a vector. This allows you to run summary functions. +`pull()` converts a single data column into a vector. ```{r, eval=FALSE} er %>% pull(visits) ``` - ## Data Summarization -Now that we have a vector of numbers.. what can we do with it? +Now that we have a vector of numbers.. what can we do with it? * Basic statistical summarization * `mean(x)`: takes the mean of x @@ -73,12 +83,19 @@ Now that we have a vector of numbers.. what can we do with it? * `max(x)`: maximum value in x * `min(x)`: minimum value in x -## Statistical summarization the "tidy" way +## Pipe (`%>%`) vectors into summarizing functions -**Add the ** `na.rm =` **argument for missing data** +A vector can be summarized: ```{r} er %>% pull(visits) %>% mean() +``` + +
+ +Add the `na.rm =` argument for missing data + +```{r} er %>% pull(visits) %>% mean(na.rm=T) ``` @@ -98,7 +115,7 @@ C. A dataset `summarize` works on datasets without `pull()`. -Multiple summary statistics can be calculated at once. `pull()` can only do one column. +Multiple summary statistics can be calculated at once!
```{r, eval = FALSE} @@ -150,10 +167,11 @@ summary(er) ## Summary & Lab Part 1 -- summary stats (`mean()`) work with `pull()` +- `pull()` creates a *vector* - don't forget the `na.rm = TRUE` argument! - `summary(x)`: quantile information - `summarize`: creates a summary table of columns of interest +- summary stats (`mean()`) work with vectors or with `summarize()` 🏠 [Class Website](https://daseh.org/) @@ -172,7 +190,9 @@ er %>% ## How many `distinct()` values? -`n_distinct()` tells you the number of unique elements. _Must pull the column first!_ +`n_distinct()` tells you the number of unique elements. + +It needs a vector so you _must pull the column first!_ ```{r} er %>% @@ -186,29 +206,45 @@ options(max.print = 1000) ``` -## `dplyr`: `count` - -Use `count` to return row count by category. +## Use `count()` to return row count per category. ```{r, message = FALSE} er %>% count(county) ``` +_Looks like 12 rows/observations per county!_ -## `dplyr`: `count` - -Multiple columns listed further subdivides the count. +## Multiple columns listed further subdivides the `count()` ```{r, message = FALSE} er %>% count(county, year) ``` +_Looks like 1 row/observation per county and year!_ + +## GUT CHECK! +The `count()` function can help us tally: + +A. Sample size + +B. Rows per each category + +C. How many categories # Grouping +## Goal + +We want to find the mean number of ER visits per year in the dataset. + +_How do we do this?_ + + ## Perform Operations By Groups: dplyr +First, let's group the data. + `group_by` allows you group the data set by variables/columns you specify: ```{r} @@ -227,7 +263,7 @@ er_grouped %>% ``` -## Use the `pipe` to string these together! +## Do it in one step: use `%>%` to string these together! Pipe `CO_heat_ER` into `group_by`, then pipe that into `summarize`: @@ -247,9 +283,9 @@ er %>% summarize(avg_visits = mean(visits, na.rm = TRUE)) ``` -## Counting +## Counting rows/observations -There are other functions, such as `n()` count the number of observations (NAs included). +There are other summarizing functions, such as `n()` count the number of rows/observations (NAs included). ```{r} er %>% @@ -259,15 +295,23 @@ er %>% ``` -## Counting{.codesmall} +## Counting: `count()` and `n()` `count()` and `n()` can give very similar information. ```{r} +# Here we use count() er %>% count(year) -er %>% group_by(year) %>% summarize(n()) # n() typically used with summarize ``` +## Counting: `count()` and `n()` + +`count()` and `n()` can give very similar information. + +```{r} +# n() with summarize +er %>% group_by(year) %>% summarize(n()) +``` # A few miscellaneous topics .. @@ -278,6 +322,7 @@ These functions require a column as a vector using `pull()`. ```{r, message = FALSE} er_year <- er %>% pull(year) # pull() to make a vector + er_year %>% unique() # similar to distinct() ``` @@ -301,7 +346,6 @@ er_year %>% unique() %>% length() # similar to n_distinct() - `n_distinct()` with `pull()`: how many distinct values? - `group_by()`: changes all subsequent functions - combine with `summarize()` to get statistics per group - - combine with `mutate()` to add column - `summarize()` with `n()` gives the count (NAs included) ## Lab Part 2 @@ -314,9 +358,12 @@ er_year %>% unique() %>% length() # similar to n_distinct() 📃[Posit's data transformation Cheatsheet](https://rstudio.github.io/cheatsheets/data-transformation.pdf) -For more advanced learning, check out https://www.danieldsjoberg.com/gtsummary/ for tables of summary statistics and the extra slides in this file. +**For more advanced learning:** + +- https://www.danieldsjoberg.com/gtsummary/ for tables +- extra slides in this file. -```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'} +```{r, fig.alt="The End", out.width = "30%", echo = FALSE, fig.align='center'} knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg")) ``` @@ -333,8 +380,13 @@ Image by + + + ```{r} -yearly_co2 <- read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv") +yearly_co2 <- read_csv("../../data/Yearly_CO2_Emissions_1000_tonnes.csv") ``` diff --git a/modules/Data_Summarization/lab/Data_Summarization_Lab_Key.Rmd b/modules/Data_Summarization/lab/Data_Summarization_Lab_Key.Rmd index a69a5bcf..f3332907 100644 --- a/modules/Data_Summarization/lab/Data_Summarization_Lab_Key.Rmd +++ b/modules/Data_Summarization/lab/Data_Summarization_Lab_Key.Rmd @@ -22,7 +22,7 @@ ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv") ### 1.1 -How observations/rows are in the `ces` data set? You can use `dim()` or `nrow()` or examine the Environment. +How many observations/rows are in the `ces` data set? You can use `dim()` or `nrow()` or examine the Environment. ```{r 1.1response} nrow(ces) diff --git a/modules/cheatsheets/Day-4.md b/modules/cheatsheets/Day-4.md index 32e0ba5d..45252d5c 100644 --- a/modules/cheatsheets/Day-4.md +++ b/modules/cheatsheets/Day-4.md @@ -1,6 +1,6 @@ --- -classoption: -- landscape +classoption: landscape +output: pdf_document --- # Day 4 Cheatsheet @@ -10,30 +10,29 @@ classoption: ### Functions |Library/Package|Piece of code|Example of usage|What it does| |---------------|-------------|----------------|-------------| -|Base `R`| [`min(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Extremes) |`min(x)`| Returns the minimum value of all values in an object `x`.| -|Base `R`| [`sum(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sum) | `sum(x)`| Returns the sum of all values (values must be integer, numeric, or logical) in object `x`.| -|Base `R`| [`mean(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean) |`mean(x)`| Returns the arithmetic mean of all values (values must be integer or numeric) in object `x` or logical vector `x`.| -| Base `R`|[`log(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/log) |`log(x)`| Gives the natural logarithm of object `x`. `log2(x)` can be used to give the logarithm of the object in base 2. Or the base can be specified as an argument.| -| Base `R`|[`range(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/range) |`range(x)`| Gives the min and max for object `x`.| -| Base `R`|[`sd(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/sd) |`sd(x)`| Gives the standard deviation for object `x`.| -| Base `R`|[`sqrt(x)`](https://www.rdocumentation.org/packages/SparkR/versions/2.1.2/topics/sqrt) |`sqrt(x)`| Gives the square root for object `x`.| -| Base `R`|[`quantile(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile)|`quantile(x, probs = .5)`| Produces sample quantiles corresponding to the given probabilities `x`.| +| Base `R`| [`min(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Extremes) |`min(x)`| Returns the minimum value of all values in an object `x`.| +| Base `R`| [`sum(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sum) | `sum(x)`| Returns the sum of all values (values must be integer, numeric, or logical) in object `x`.| +| Base `R`| [`mean(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean) |`mean(x)`| Returns the arithmetic mean of all values (values must be integer or numeric) in object `x` or logical vector `x`.| +| Base `R`| [`log(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/log) |`log(x)`| Gives the natural logarithm of object `x`. `log2(x)` can be used to give the logarithm of the object in base 2. Or the base can be specified as an argument.| +| Base `R`| [`range(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/range) |`range(x)`| Gives the min and max for object `x`.| +| Base `R`| [`sd(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/sd) |`sd(x)`| Gives the standard deviation for object `x`.| +| Base `R`| [`sqrt(x)`](https://www.rdocumentation.org/packages/SparkR/versions/2.1.2/topics/sqrt) |`sqrt(x)`| Gives the square root for object `x`.| +| Base `R`| [`quantile(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile)|`quantile(x, probs = .5)`| Produces sample quantiles corresponding to the given probabilities `x`.| | Base `R`| [`summary(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/summary)|`summary(x)`| Returns a summary of the values in object `x`.| +| `dplyr`| [`pull()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/pull)| `x_vect <- df %>% pull(x)` | Extract a single column into vector form. `pull()` is very handy before summary functions like `mean()`, `sum()`, etc. | +| `dplyr`| [`summarize()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/summarize) | `df <- df %>% summarize(mean_x = mean(x))` | Summarizes multiple values in an object into a single value. This function can be used with other functions to retrieve a single output value for the grouped values. `summarize` and `summarise` are synonyms in this package. However, note that this function does not work in the same manner as the base R `summary` function.| +| `dplyr`| [`distinct()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/distinct) |`df %>% distinct(factor_name)`| Display unique/distinct rows from a data frame or tibble| +| `dplyr`| [`n_distinct()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/n_distinct) |`x_vect %>% n_distinct()`|Counts the number of unique/distinct combinations in a set of one or more vectors.| +| `dplyr`| [`count()`](https://dplyr.tidyverse.org/reference/count.html)|`df %>% count(factor_name)`|Count the number of groups in a factor variable of a data frame or tibble| +| `dplyr`| [`group_by()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% group_by(factor_name)`| Groups data into rows that contain the same specified value(s)| +| `dplyr`| [`ungroup()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% ungroup()`| Undo a grouping that was done by `group_by()`| +| Base `R`| [`unique()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/unique)| `unique(df)`|Returns a vector, data frame or array like x but with duplicate elements/rows removed.| | Base `R`| [`rowSums()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rowsum) | `rowSums(df)`|Calculates sums for each row| | Base `R`| [`colSums()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/colSums)|`colSums(df)`| Calculates sums for each column| | Base `R`| [`rowMeans()`](https://www.rdocumentation.org/packages/fame/versions/1.03/topics/rowMeans)| `rowMeans(df)`|Calculates means for each row| | Base `R`| [`colMeans()`](https://www.statology.org/colmeans-in-r/)|`colMeans(df)`| Calculates means for each column| -| `dplyr`|[`summarize()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/summarize) | `df <- df %>% summarize(mean_x = mean(x))` | Summarizes multiple values in an object into a single value. This function can be used with other functions to retrieve a single output value for the grouped values. `summarize` and `summarise` are synonyms in this package. However, note that this function does not work in the same manner as the base R `summary` function.| -| `dplyr`|[`across()`](https://dplyr.tidyverse.org/reference/across.html)| `df %>% summarize(across( c('col_a', 'col_b'), ~ sum(.x)))`| Use the across function with summarize to summarize across multiple columns of your data.| -| Base `R`| [`unique()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/unique)| `unique(df)`|Returns a vector, data frame or array like x but with duplicate elements/rows removed.| -| Base `R`| [`table()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/table)| `table(x)`| Builds a contingency table of the counts at each combination of factor levels.| -| `dplyr`| [`count()`](https://dplyr.tidyverse.org/reference/count.html)|`df %>% count(factor_name)`|Count the number of groups in a factor variable of a data frame or tibble| -| `dplyr`| [`group_by()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% count(factor_name)`| Groups data into rows that contain the same specified value(s)| -| `dplyr`| [`ungroup()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% count(factor_name)`| Undo a grouping that was done by `group_by()`| -| Base `R`| [`plot()`](https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/plot)|`plot(x, y)`| Creates a scatterplot of x and y vector data| -| Base `R`| [`boxplot()`](https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/boxplot)|`boxplot(x, y)`| Creates a boxplot of y against levels of x| -| Base `R`| [`hist()`](https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/hist)|`hist(x)`| Creates a histogram of x| -| Base `R`| [`density()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/density) |`plot(density(x))`| Creates a kernel density plot of x when used with `plot()`| + +- Many summarizing functions (e.g., `mean()`, `sum()`) have the argument `na.rm = TRUE`. This can be used to ignore missing data.
diff --git a/modules/cheatsheets/Day-4.pdf b/modules/cheatsheets/Day-4.pdf index 7f6374e8..b8d4797f 100644 Binary files a/modules/cheatsheets/Day-4.pdf and b/modules/cheatsheets/Day-4.pdf differ