26-iteration.Rmd

# Iteration

**Learning objectives:**

-   Modify multiple columns using the same patterns.
-   Filter based on the contents of multiple columns.
-   Process multiple files.
-   Write multiple files.

```{r iteration-packages_used, message=FALSE, warning=FALSE}
library(tidyverse)
```

## Intro to iteration {-}

**Iteration** = repeatedly performing the same action on different objects

R is full of *hidden* iteration!

-   `ggplot2::facet_wrap()` / `ggplot::facet_grid()`
-   `dplyr::group_by()` + `dplyr::summarize()`
-   `tidyr::unnest_wider()` / `tidyr::unnest_longer()`
-   Anything with a vector! 
    -   `1:10 + 1` requires loops in other languages!

## Summarize w/ `across()`: setup {-}

```{r iteration-summarize-setup}
df <- tibble(a = rnorm(10), b = rnorm(10), c = rnorm(10))
glimpse(df)
```

## Summarize w/ `across()`: motivation {-}

```{r iteration-summarize-motivation}
messy <- df |> summarize(
  n = n(), 
  a = median(a), 
  b = median(b), 
  c = median(c)
)
```

## Summarize w/ `across()`: cleaner {-}

```{r iteration-summarize-across}
clean <- df |> summarize(
  n = n(), 
  dplyr::across(a:c, median)
)
identical(clean, messy)
```

## Selecting columns {-}

-   `everything()` for all non-grouping columns
-   `where()` to select based on a condition
    -   `where(is.numeric)` = all numeric columns
    -   `where(is.character)` = all character columns
    -   `starts_with("a") & !where(is_numeric)` = all columns that start with "a" and are not numeric
    -   `where(\(x) any(stringr::str_detect("name")))` = all columns that contain the word "name" in at least one value

## Passing functions {-}

Pass actual function to `across()`, ***not*** a call to the function!

-   ✅ `across(a:c, mean)`
-   ❌ `across(a:c, mean())`

## Multiple functions {-}

```{r iteration-multiple_functions}
df |> summarize(
  across(a:c, list(mean, median))
)
```

## Multiple functions with names {-}

```{r iteration-multiple_functions-names}
df |> summarize(across(a:c, 
    list(mean = mean, median = median)
))
```

## Multiple functions with names & args {-}

```{r iteration-multiple_functions-args}
df |> summarize(across(a:c, 
    list(
      mean = \(x) mean(x, na.rm = TRUE), 
      median = \(x) median(x, na.rm = TRUE)
    )
))
```

## Fancier naming {-}

```{r iteration-across-glue_names}
df |> summarize(across(a:c,
    list(
      mean = \(x) mean(x, na.rm = TRUE), 
      median = \(x) median(x, na.rm = TRUE)
    ),
    .names = "{.fn}_of_{.col}"
))
```

## Filtering with if_any() and if_all() {-}

```{r iteration-if_any}
df2 <- tibble(x = 1:3, y = c(1, 2, NA), z = c(NA, 2, 3))
df2 |> filter(if_any(everything(), is.na))
df2 |> filter(if_all(everything(), \(x) !is.na(x)))
```

## across() in functions: setup {-}

```{r iteration-across-in-functions}
summarize_datatypes <- function(df) {
  df |> summarize(
    across(
      where(is.numeric), 
      list(mean = \(x) mean(x, na.rm = TRUE))
    ),
    across(
      where(is.factor) | where(is.character), 
      list(n_distinct = n_distinct)
    )
  ) |> 
    glimpse()
}
```

## across() in functions: mpg {-}

```{r iteration-across_in_functions-mpg}
mpg |> summarize_datatypes()
```

## across() in functions: diamonds {-}

```{r iteration-across_in_functions-diamonds}
diamonds |> summarize_datatypes()
```

## Iterate over files {-}

```{r iteration-map_files, eval = FALSE}
list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE) |>
  set_names(basename) |> 
  purrr::map(readxl::read_excel) |> 
  map(\(df) "Fix something that might be weird in each df") |> 
  map(\(df) "Fix a different thing") |> 
  purrr::list_rbind(names_to = "filename")
```

## One vs everything {-}

> We recommend this approach [perform each step on each file instead of in a function] because it stops you getting fixated on getting the first file right before moving on to the rest. By considering all of the data when doing tidying and cleaning, you’re more likely to think holistically and end up with a higher quality result.

Discuss!

-   Jon's preference: Do 1-2 files first, iterate on iteration
-   Book: Do everything on everything

## Walk vs map {-}

-   Use `purrr::walk()` to do things without keeping result
    -   Book example: Saving things
-   `purrr::map2()` & `purrr::walk2()`: 2 inputs
-   `purrr::pmap()` & `purrr::pwalk()`: list of inputs (largely replaced by `across()`)

## Meeting Videos {-}

### Cohort 5 {-}

`r knitr::include_url("https://www.youtube.com/embed/0rsV1jlxhws")`

<details>
  <summary> Meeting chat log </summary>  
```
00:03:23	Becki R. (she/her):	I'm having trouble with a buzz again
00:03:37	Njoki Njuki Lucy:	I so look forward to the discussion. I have been struggling with understanding this particular chapter! :)
00:26:58	Jon Harmon (jonthegeek):	> x <- purrr::set_names(month.name, month.abb)
> x
00:27:21	Jon Harmon (jonthegeek):	x[["Jan"]]
00:30:12	Jon Harmon (jonthegeek):	results <- purrr::set_names(vector("list", length(x)), names(x))
00:35:10	lucus w:	A data frame is simply a list in disguise
00:45:37	Njoki Njuki Lucy:	It makes sense now, thanks!
00:48:05	Ryan Metcalf:	Sorry team, I have to drop. Sister-in-law is stranded and needs a jump-start. I’ll finish watching the recording and catch any questions.
00:51:19	Jon Harmon (jonthegeek):	> paste(month.name, collapse = "")
[1] "JanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecember"
> paste(month.name, collapse = " ")
[1] "January February March April May June July August September October November December"
01:09:10	Njoki Njuki Lucy:	that's so cool! I wondered! Ha!!
01:10:30	Federica Gazzelloni:	thanks Jon!!
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/CEKPAUWTA3c")`

<details>
  <summary> Meeting chat log </summary>  
```
00:10:17	Becki R. (she/her):	brb!
00:26:00	Njoki Njuki Lucy:	does this mean, in this case, we will have 3 regression models?
00:26:21	Federica Gazzelloni:	can you specify: mtcars%>%map(.f=
00:27:12	Federica Gazzelloni:	mtcars%>%split(.$cyl)%>%map(.f=lm(….))
00:27:41	Ryan Metcalf:	I’m reading it as `mpg` being dependent (y) and `wt` being independent.
00:29:08	Jon Harmon (jonthegeek):	# A more realistic example: split a data frame into pieces, fit a
# model to each piece, summarise and extract R^2
mtcars %>%
  split(.$cyl) %>%
  map(~ lm(mpg ~ wt, data = .x)) %>%
  map(summary) %>%
  map_dbl("r.squared")
00:29:55	Jon Harmon (jonthegeek):	mtcars %>%
  split(.$cyl) %>%
  map(.f = ~ lm(mpg ~ wt, data = .x))
00:30:22	Jon Harmon (jonthegeek):	mtcars %>%
  split(.$cyl) %>%
  map(.f = lm(mpg ~ wt, data = .x))
00:45:11	Federica Gazzelloni:	coalesce()
01:17:01	Ryan Metcalf:	Great Job Becki!!!
01:17:25	Becki R. (she/her):	Thanks :)
```
</details>

### Cohort 6 {-}

`r knitr::include_url("https://www.youtube.com/embed/NVUHFpYUmA4")`

<details>
  <summary> Meeting chat log </summary>  
```
00:04:39	Marielena Soilemezidi:	I'll be back in 3'! No haste with the setup:)
00:04:52	Adeyemi Olusola:	Ok
00:07:22	Adeyemi Olusola:	Let me know when you return
```
</details>

`r knitr::include_url("https://www.youtube.com/embed/YnZSfzMGhTE")`

<details>
  <summary> Meeting chat log </summary>  
```
00:11:10	Marielena Soilemezidi:	hello! :)
00:20:31	Marielena Soilemezidi:	yep, it looks good!
00:46:18	Daniel Adereti:	How does it get the list of 3?
00:47:55	Marielena Soilemezidi:	sorry, got disconnected for some time, but I'm back!
00:48:05	Daniel Adereti:	No worries!
00:58:28	Adeyemi Olusola:	olusolaadeyemi.ao@gmail.com
```
</details>


### Cohort 7 {-}

`r knitr::include_url("https://www.youtube.com/embed/vPEgWgs0q7s")`

<details>
<summary> Meeting chat log </summary>

```
00:06:31	Oluwafemi Oyedele:	Hi Tim!!!
00:06:47	Tim Newby:	Hi Oluwafemi,  can you here me?
00:06:48	Oluwafemi Oyedele:	We will start in 6 minutes time!!!
00:13:26	Oluwafemi Oyedele:	start
00:56:02	Oluwafemi Oyedele:	https://adv-r.hadley.nz/functionals.html
00:56:25	Oluwafemi Oyedele:	stop
```
</details>


### Cohort 8 {-}

`r knitr::include_url("https://www.youtube.com/embed/TQabUIBbJKs")`