16-factors.Rmd

# Factors

**Learning objectives:**

-   **Create `factor()`** variables.
-   **Explore** the **General Social Survey** dataset via `forcats::gss_cat`.
-   **Reorder factor levels.**
    -   `forcats::fct_reorder()`
    -   `forcats::fct_relevel()`
    -   `forcats::fct_reorder2()`
    -   `forcats::fct_infreq()`
    -   `forcats::fct_rev()`
-   **Modify factor levels.**
    -   `forcats::fct_recode()`
    -   `forcats::fct_collapse()`
    -   `forcats::fct_lump()`

## Introduction {-}

- Factors -> categorical variables: variables that have a fixed and known set of possible values. 
- Also useful when you want to display character vectors in a non-alphabetical order.

```{r, warning=FALSE, message=FALSE}
library(tidyverse)
```

## Factor basics {-}

- A variable that records month:

```{r}
x1 <- c("Dec", "Apr", "Jan", "Mar")
```

Problems with using a string: 
  - Only 12 possible values 
  - No way of accounting for typos

```{r}
x2 <- c("Dec", "Apr", "Jam", "Mar")
```

  - Doesn't sort in a useful way but alphabetically

```{r}
sort(x1)
```

## Fix issues with strings using factors {-}

- Fix those problems with a factor. 
- Start by creating a list of the valid levels:

```{r}
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

month_levels
```

- Now create a factor:

```{r}
y1 <- factor(x1, levels = month_levels)

sort(y1)
```

- If you omit the levels, they'll be taken from the data alphabetically:

```{r}
factor(x1)
```

## Values not included in levels {-}

- <b style="color:red;">WARNING:</b> Any values not in the level will be silently converted to NA:

```{r}
y2 <- factor(x2, levels = month_levels)

y2
```

- But if you use `forcats::fct()` insted: 

```{r, eval=FALSE}
y2 <- fct(x2, levels = month_levels)
```

#> Error in `fct()`:
#> ! All values of `x` must appear in `levels` or `na`
#> ℹ Missing level: "Jam"


- If you need to access the valid levels use `levels()`:
```{r}
levels(y2)
```

## Other ways of dealing with factors {-}

- If we want the order of the levels match the order of the first appearance in the data then use `unique()`, or after the fact, with [fct_inorder():]('https://forcats.tidyverse.org/reference/fct_inorder.html')

```{r}
f1 <- factor(x1, levels = unique(x1))
f1
```

```{r}
f2 <- x1 |> factor() |> fct_inorder()
f2
```

```{r}
levels(f2)
```

- We can also create a factor when reading your data with readr with [col_factor():]('https://readr.tidyverse.org/reference/parse_factor.html')

```{r}
csv <- "
month,value
Jan,12
Feb,56
Mar,12"
```

```{r}
df <- read_csv(csv, col_types = cols(month = col_factor(month_levels)))

df$month
```

## General Social Survey {-}

It's a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in `gss_cat` Hadley selected a handful that will illustrate some common challenges you’ll encounter when working with factors.

```{r}
gss_cat
```

- In a tibble we can use `count()` to see the levels of a factor:

```{r}
gss_cat |>
  count(race)
```

## Modifying factor order {-}

- Let's look at an example in a plot for which we will modify the order of factors on the y-axis: 

```{r}
relig_summary <- gss_cat |>
  group_by(relig) |>
  summarize(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig_summary, aes(x = tvhours, y = relig)) + 
  geom_point()
```

## Let's re order the factos with `fct_reorder()` {-}

- It is hard to read this plot because there's no overall pattern. We can improve it by reordering the levels of relig using [fct_reorder()]('https://forcats.tidyverse.org/reference/fct_reorder.html'). [fct_reorder()]('https://forcats.tidyverse.org/reference/fct_reorder.html') takes three arguments:

  - f, the factor whose levels you want to modify.

  - x, a numeric vector that you want to use to reorder the levels.

  - Optionally, fun, a function that's used if there are multiple values of x for each value of f. The default value is median.


```{r}
relig_summary |>
  mutate(
    relig = fct_reorder(relig, tvhours)
  ) |>
  ggplot(aes(x = tvhours, y = relig)) +
  geom_point()
```

## Change levels of a factor {-}

- Changing the levels of a factor

```{r}
rincome_summary <- gss_cat |>
  group_by(rincome) |>
  summarize(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )


ggplot(rincome_summary, 
       aes(x = age, 
           y = fct_relevel(rincome, 
                           "Not applicable"))) +
  geom_point()
```

```{r}
by_age <- gss_cat |>
  filter(!is.na(age)) |>
  count(age, marital) |>
  group_by(age) |>
  mutate(
    prop = n / sum(n)
  )

ggplot(by_age, aes(x = age, 
                   y = prop, color = marital)) +
  geom_line(na.rm = TRUE)

ggplot(by_age, aes(x = age, 
                   y = prop, 
                   color = fct_reorder2(marital, 
                                        age, prop))) +
  geom_line() +
  labs(color = "marital")
```

- Changing the order of a bar plot in decreasing frequency with [fct_infreq()]('https://forcats.tidyverse.org/reference/fct_inorder.html') and in increasing frequency with [fct_rev()]('https://forcats.tidyverse.org/reference/fct_rev.html')

```{r}
gss_cat |>
  mutate(marital = marital |> 
           fct_infreq() |> 
           fct_rev()) |>
  ggplot(aes(x = marital)) +
  geom_bar()
```

## Modifying factor levels {-}

- We can change the values of the levels.
  - Now we can clarify levels for publication
  - Collapse levels for high-level displays
  - `fct_recode()` allows you to recode or change the value of each level. It will leave levels not mentioned as is and will warn if you refer to a level that doesn't exist. 

```{r}
gss_cat |>
  mutate(
    partyid = fct_recode(partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat"
    )
  ) |>
  count(partyid)
```

## Combine groups {-}

  - To combine groups, we can assign multiple old levels to the same new level ('Other'):

```{r}
gss_cat |>
  mutate(
    partyid = fct_recode(partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat",
      "Other"                 = "No answer",
      "Other"                 = "Don't know",
      "Other"                 = "Other party"
    )
  ) |>
  count(partyid)
```

## Collapse levels into one {-}

  - If we want to collapse a lot of levels, [fct_collapse()]('https://forcats.tidyverse.org/reference/fct_collapse.html') is a useful variant of [fct_recode().]('https://forcats.tidyverse.org/reference/fct_recode.html'). 
    - For each new variable, you can provide a vector of old levels:

```{r}
gss_cat |>
  mutate(
    partyid = fct_collapse(partyid,
      "other" = c("No answer", "Don't know", "Other party"),
      "rep" = c("Strong republican", "Not str republican"),
      "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
      "dem" = c("Not str democrat", "Strong democrat")
    )
  ) |>
  count(partyid)
```

## Lump together several small groups {-}

  - If you want to lump together small groups to make a simpler table or plot then use `fct_lump*()`. For example: `fct_lump_lofreq()` lumps progressively smallest groups categories into 'Other':

```{r}
gss_cat |>
  mutate(relig = fct_lump_lowfreq(relig)) |>
  count(relig)
```

-   Read the documentation to learn about [fct_lump_min()]('https://forcats.tidyverse.org/reference/fct_lump.html') and [fct_lump_prop()]('https://forcats.tidyverse.org/reference/fct_lump.html') which are useful in other cases.

## Ordered factors {-}

- Ordered factors, created with ordered(), imply a strict ordering and equal distance between levels:

```{r}
ordered(c('a', 'b', 'c'))
```
- If you map an ordered factor to color or fill in ggplot2, it will default to `scale_color_viridis()` or `scale_fill_viridis()`.
- If you use an ordered function in a linear model, it will use 'polygonal contrasts'. These are mildly useful and you can learn more here: `vignette("contrasts", package = "faux")` by [Lisa DeBruine](https://debruine.github.io/faux/articles/contrasts.html).


## Meeting Videos {-}

### Cohort 5

`r knitr::include_url("https://www.youtube.com/embed/2ySAk-lgT88")`

<details>

<summary>Meeting chat log</summary>

    00:05:04    Federica Gazzelloni:    Hello
    00:23:34    Jon Harmon (jonthegeek):    Useful: R has month.name and month.abb character vectors built in. So you can do things like y3 <- factor(month.abb, levels = month.abb)
    00:35:46    Ryan Metcalf:   Open ended question for the team. If Factors are a built-in enumeration in categorical data….what if the data in review has a dictionary and the variable (column) of each record is entered as a numeral. Would a best practice to use a join or mutate to enter the text instead of a numeral.
    01:00:25    Ryan Metcalf:   I’m not finding a direct definition of “level” in the Forecats text. Would it be appropriate to state a “level” in this Factor chapter is the “quantity of a given category?”
    01:05:05    Jon Harmon (jonthegeek):    state.abb

</details>

### Cohort 6

`r knitr::include_url("https://www.youtube.com/embed/Xaax7EX-WIQ")`

<details>

<summary>Meeting chat log</summary>

    00:12:43    Daniel Adereti: https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/
    00:13:46    Adeyemi Olusola:    Its freezing but I don’t know if it’s from my end
    00:15:05    Shannon:    Yes, Adeyemi, it's freezing a bit...cleared up for now
    00:19:52    Adeyemi Olusola:    I guess as.factor( ) does the same without aorting
    01:01:46    Marielena Soilemezidi:  thank you Daniel!
    01:01:52    Adeyemi Olusola:    Thank you Daniel
    01:02:34    Marielena Soilemezidi:  bye all!
    01:02:40    Daniel Adereti: Bye!

</details>


### Cohort 7

`r knitr::include_url("https://www.youtube.com/embed/KUTSJFGy3kY")`


### Cohort 8

`r knitr::include_url("https://www.youtube.com/embed/PeJICZRmwvI")`

<details>
<summary> Meeting chat log </summary>

```
00:39:21	shamsuddeen:	https://forcats.tidyverse.org/reference/fct_lump.html
00:39:43	Abdou:	Sometimes you just want to lump together the small groups to make a plot or table simpler
00:39:52	shamsuddeen:	yes
00:39:55	shamsuddeen:	exactly
00:40:50	Abdou:	Reacted to "https://forcats.tidy..." with 👍
```
</details>