Skip to content

Commit

Permalink
Finalize some changes to summarization
Browse files Browse the repository at this point in the history
  • Loading branch information
avahoffman committed Oct 2, 2024
1 parent 7d10d18 commit c7f2f14
Show file tree
Hide file tree
Showing 4 changed files with 101 additions and 50 deletions.
108 changes: 80 additions & 28 deletions modules/Data_Summarization/Data_Summarization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,34 +34,44 @@ pre { /* Code block - slightly smaller in this lecture */
📃[Day 3 Cheatsheet](https://daseh.org/modules/cheatsheets/Day-3.pdf)


## Summarization with Data
## The Data

We can use the CO heat-related ER visits dataset to explore different ways of summarizing data.

(*Reminder* This dataset contains information about the number and rate of visits for heat-related illness to ERs in Colorado from 2011-2022, adjusted for age.)
*Reminder*: This dataset contains information about the number and rate of visits for heat-related illness to ERs in Colorado from 2011-2022, adjusted for age.

The `head` command displays the first rows of an object:

## The Data

We can use the CO heat-related ER visits dataset to explore different ways of summarizing data.

The `head` function displays the first rows of an object:

<!-- ```{r} -->
<!-- er <- -->
<!-- read_csv("https://daseh.org/data/CO_ER_heat_visits.csv") -->

<!-- head(er) -->
<!-- ``` -->

```{r}
er <-
read_csv("https://daseh.org/data/CO_ER_heat_visits.csv")
read_csv("../../data/CO_ER_heat_visits.csv")
head(er)
```


## Behavior of `pull()` function

`pull()` converts a single data column into a vector. This allows you to run summary functions.
`pull()` converts a single data column into a <span style="color:blue">vector</span>.

```{r, eval=FALSE}
er %>% pull(visits)
```


## Data Summarization

Now that we have a vector of numbers.. what can we do with it?
Now that we have a <span style="color:blue">vector of numbers</span>.. what can we do with it?

* Basic statistical summarization
* `mean(x)`: takes the mean of x
Expand All @@ -73,12 +83,19 @@ Now that we have a vector of numbers.. what can we do with it?
* `max(x)`: maximum value in x
* `min(x)`: minimum value in x

## Statistical summarization the "tidy" way
## Pipe (`%>%`) vectors into summarizing functions

**Add the ** `na.rm =` **argument for missing data**
A vector can be summarized:

```{r}
er %>% pull(visits) %>% mean()
```

<br>

Add the `na.rm =` argument for missing data

```{r}
er %>% pull(visits) %>% mean(na.rm=T)
```

Expand All @@ -98,7 +115,7 @@ C. A dataset

`summarize` works on datasets without `pull()`.

Multiple summary statistics can be calculated at once. `pull()` can only do one column.
Multiple summary statistics can be calculated at once!

<div class = "codeexample">
```{r, eval = FALSE}
Expand Down Expand Up @@ -150,10 +167,11 @@ summary(er)

## Summary & Lab Part 1

- summary stats (`mean()`) work with `pull()`
- `pull()` creates a *vector*
- don't forget the `na.rm = TRUE` argument!
- `summary(x)`: quantile information
- `summarize`: creates a summary table of columns of interest
- summary stats (`mean()`) work with vectors or with `summarize()`

🏠 [Class Website](https://daseh.org/)

Expand All @@ -172,7 +190,9 @@ er %>%

## How many `distinct()` values?

`n_distinct()` tells you the number of unique elements. _Must pull the column first!_
`n_distinct()` tells you the number of unique elements.

It needs a vector so you _must pull the column first!_

```{r}
er %>%
Expand All @@ -186,29 +206,45 @@ options(max.print = 1000)
```


## `dplyr`: `count`

Use `count` to return row count by category.
## Use `count()` to return row count per category.

```{r, message = FALSE}
er %>% count(county)
```

_Looks like 12 rows/observations per county!_

## `dplyr`: `count`

Multiple columns listed further subdivides the count.
## Multiple columns listed further subdivides the `count()`

```{r, message = FALSE}
er %>% count(county, year)
```

_Looks like 1 row/observation per county and year!_

## GUT CHECK!

The `count()` function can help us tally:

A. Sample size

B. Rows per each category

C. How many categories

# Grouping

## Goal

We want to find the mean number of ER visits per year in the dataset.

_How do we do this?_


## Perform Operations By Groups: dplyr

First, let's group the data.

`group_by` allows you group the data set by variables/columns you specify:

```{r}
Expand All @@ -227,7 +263,7 @@ er_grouped %>%
```


## Use the `pipe` to string these together!
## Do it in one step: use `%>%` to string these together!

Pipe `CO_heat_ER` into `group_by`, then pipe that into `summarize`:

Expand All @@ -247,9 +283,9 @@ er %>%
summarize(avg_visits = mean(visits, na.rm = TRUE))
```

## Counting
## Counting rows/observations

There are other functions, such as `n()` count the number of observations (NAs included).
There are other summarizing functions, such as `n()` count the number of rows/observations (NAs included).

```{r}
er %>%
Expand All @@ -259,15 +295,23 @@ er %>%
```


## Counting{.codesmall}
## Counting: `count()` and `n()`

`count()` and `n()` can give very similar information.

```{r}
# Here we use count()
er %>% count(year)
er %>% group_by(year) %>% summarize(n()) # n() typically used with summarize
```

## Counting: `count()` and `n()`

`count()` and `n()` can give very similar information.

```{r}
# n() with summarize
er %>% group_by(year) %>% summarize(n())
```

# A few miscellaneous topics ..

Expand All @@ -278,6 +322,7 @@ These functions require a column as a vector using `pull()`.

```{r, message = FALSE}
er_year <- er %>% pull(year) # pull() to make a vector
er_year %>% unique() # similar to distinct()
```

Expand All @@ -301,7 +346,6 @@ er_year %>% unique() %>% length() # similar to n_distinct()
- `n_distinct()` with `pull()`: how many distinct values?
- `group_by()`: changes all subsequent functions
- combine with `summarize()` to get statistics per group
- combine with `mutate()` to add column
- `summarize()` with `n()` gives the count (NAs included)

## Lab Part 2
Expand All @@ -314,9 +358,12 @@ er_year %>% unique() %>% length() # similar to n_distinct()

📃[Posit's data transformation Cheatsheet](https://rstudio.github.io/cheatsheets/data-transformation.pdf)

For more advanced learning, check out https://www.danieldsjoberg.com/gtsummary/ for tables of summary statistics and the extra slides in this file.
**For more advanced learning:**

- https://www.danieldsjoberg.com/gtsummary/ for tables
- extra slides in this file.

```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
```{r, fig.alt="The End", out.width = "30%", echo = FALSE, fig.align='center'}
knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
```

Expand All @@ -333,8 +380,13 @@ Image by <a href="https://pixabay.com/users/geralt-9301/?utm_source=link-attribu
* `rowSums(x)`: takes the sum of each row of x
* `colSums(x)`: takes the sum of each column of x

<!-- ```{r} -->
<!-- yearly_co2 <- read_csv("https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv") -->
<!-- ``` -->


```{r}
yearly_co2 <- read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
yearly_co2 <- read_csv("../../data/Yearly_CO2_Emissions_1000_tonnes.csv")
```


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")

### 1.1

How observations/rows are in the `ces` data set? You can use `dim()` or `nrow()` or examine the Environment.
How many observations/rows are in the `ces` data set? You can use `dim()` or `nrow()` or examine the Environment.

```{r 1.1response}
nrow(ces)
Expand Down
41 changes: 20 additions & 21 deletions modules/cheatsheets/Day-4.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
classoption:
- landscape
classoption: landscape
output: pdf_document
---

# Day 4 Cheatsheet
Expand All @@ -10,30 +10,29 @@ classoption:
### Functions
|Library/Package|Piece of code|Example of usage|What it does|
|---------------|-------------|----------------|-------------|
|Base `R`| [`min(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Extremes) |`min(x)`| Returns the minimum value of all values in an object `x`.|
|Base `R`| [`sum(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sum) | `sum(x)`| Returns the sum of all values (values must be integer, numeric, or logical) in object `x`.|
|Base `R`| [`mean(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean) |`mean(x)`| Returns the arithmetic mean of all values (values must be integer or numeric) in object `x` or logical vector `x`.|
| Base `R`|[`log(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/log) |`log(x)`| Gives the natural logarithm of object `x`. `log2(x)` can be used to give the logarithm of the object in base 2. Or the base can be specified as an argument.|
| Base `R`|[`range(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/range) |`range(x)`| Gives the min and max for object `x`.|
| Base `R`|[`sd(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/sd) |`sd(x)`| Gives the standard deviation for object `x`.|
| Base `R`|[`sqrt(x)`](https://www.rdocumentation.org/packages/SparkR/versions/2.1.2/topics/sqrt) |`sqrt(x)`| Gives the square root for object `x`.|
| Base `R`|[`quantile(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile)|`quantile(x, probs = .5)`| Produces sample quantiles corresponding to the given probabilities `x`.|
| Base `R`| [`min(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Extremes) |`min(x)`| Returns the minimum value of all values in an object `x`.|
| Base `R`| [`sum(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sum) | `sum(x)`| Returns the sum of all values (values must be integer, numeric, or logical) in object `x`.|
| Base `R`| [`mean(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean) |`mean(x)`| Returns the arithmetic mean of all values (values must be integer or numeric) in object `x` or logical vector `x`.|
| Base `R`| [`log(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/log) |`log(x)`| Gives the natural logarithm of object `x`. `log2(x)` can be used to give the logarithm of the object in base 2. Or the base can be specified as an argument.|
| Base `R`| [`range(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/range) |`range(x)`| Gives the min and max for object `x`.|
| Base `R`| [`sd(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/sd) |`sd(x)`| Gives the standard deviation for object `x`.|
| Base `R`| [`sqrt(x)`](https://www.rdocumentation.org/packages/SparkR/versions/2.1.2/topics/sqrt) |`sqrt(x)`| Gives the square root for object `x`.|
| Base `R`| [`quantile(x)`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile)|`quantile(x, probs = .5)`| Produces sample quantiles corresponding to the given probabilities `x`.|
| Base `R`| [`summary(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/summary)|`summary(x)`| Returns a summary of the values in object `x`.|
| `dplyr`| [`pull()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/pull)| `x_vect <- df %>% pull(x)` | Extract a single column into vector form. `pull()` is very handy before summary functions like `mean()`, `sum()`, etc. |
| `dplyr`| [`summarize()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/summarize) | `df <- df %>% summarize(mean_x = mean(x))` | Summarizes multiple values in an object into a single value. This function can be used with other functions to retrieve a single output value for the grouped values. `summarize` and `summarise` are synonyms in this package. However, note that this function does not work in the same manner as the base R `summary` function.|
| `dplyr`| [`distinct()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/distinct) |`df %>% distinct(factor_name)`| Display unique/distinct rows from a data frame or tibble|
| `dplyr`| [`n_distinct()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/n_distinct) |`x_vect %>% n_distinct()`|Counts the number of unique/distinct combinations in a set of one or more vectors.|
| `dplyr`| [`count()`](https://dplyr.tidyverse.org/reference/count.html)|`df %>% count(factor_name)`|Count the number of groups in a factor variable of a data frame or tibble|
| `dplyr`| [`group_by()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% group_by(factor_name)`| Groups data into rows that contain the same specified value(s)|
| `dplyr`| [`ungroup()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% ungroup()`| Undo a grouping that was done by `group_by()`|
| Base `R`| [`unique()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/unique)| `unique(df)`|Returns a vector, data frame or array like x but with duplicate elements/rows removed.|
| Base `R`| [`rowSums()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/rowsum) | `rowSums(df)`|Calculates sums for each row|
| Base `R`| [`colSums()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/colSums)|`colSums(df)`| Calculates sums for each column|
| Base `R`| [`rowMeans()`](https://www.rdocumentation.org/packages/fame/versions/1.03/topics/rowMeans)| `rowMeans(df)`|Calculates means for each row|
| Base `R`| [`colMeans()`](https://www.statology.org/colmeans-in-r/)|`colMeans(df)`| Calculates means for each column|
| `dplyr`|[`summarize()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/summarize) | `df <- df %>% summarize(mean_x = mean(x))` | Summarizes multiple values in an object into a single value. This function can be used with other functions to retrieve a single output value for the grouped values. `summarize` and `summarise` are synonyms in this package. However, note that this function does not work in the same manner as the base R `summary` function.|
| `dplyr`|[`across()`](https://dplyr.tidyverse.org/reference/across.html)| `df %>% summarize(across( c('col_a', 'col_b'), ~ sum(.x)))`| Use the across function with summarize to summarize across multiple columns of your data.|
| Base `R`| [`unique()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/unique)| `unique(df)`|Returns a vector, data frame or array like x but with duplicate elements/rows removed.|
| Base `R`| [`table()`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/table)| `table(x)`| Builds a contingency table of the counts at each combination of factor levels.|
| `dplyr`| [`count()`](https://dplyr.tidyverse.org/reference/count.html)|`df %>% count(factor_name)`|Count the number of groups in a factor variable of a data frame or tibble|
| `dplyr`| [`group_by()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% count(factor_name)`| Groups data into rows that contain the same specified value(s)|
| `dplyr`| [`ungroup()`](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/group_by)|`df %>% count(factor_name)`| Undo a grouping that was done by `group_by()`|
| Base `R`| [`plot()`](https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/plot)|`plot(x, y)`| Creates a scatterplot of x and y vector data|
| Base `R`| [`boxplot()`](https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/boxplot)|`boxplot(x, y)`| Creates a boxplot of y against levels of x|
| Base `R`| [`hist()`](https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/hist)|`hist(x)`| Creates a histogram of x|
| Base `R`| [`density()`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/density) |`plot(density(x))`| Creates a kernel density plot of x when used with `plot()`|

- Many summarizing functions (e.g., `mean()`, `sum()`) have the argument `na.rm = TRUE`. This can be used to ignore missing data.

<div style="page-break-after: always;"></div>

Expand Down
Binary file modified modules/cheatsheets/Day-4.pdf
Binary file not shown.

0 comments on commit c7f2f14

Please sign in to comment.