-
Notifications
You must be signed in to change notification settings - Fork 41
/
16-factors.Rmd
384 lines (281 loc) · 10.6 KB
/
16-factors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
# Factors
**Learning objectives:**
- **Create `factor()`** variables.
- **Explore** the **General Social Survey** dataset via `forcats::gss_cat`.
- **Reorder factor levels.**
- `forcats::fct_reorder()`
- `forcats::fct_relevel()`
- `forcats::fct_reorder2()`
- `forcats::fct_infreq()`
- `forcats::fct_rev()`
- **Modify factor levels.**
- `forcats::fct_recode()`
- `forcats::fct_collapse()`
- `forcats::fct_lump()`
## Introduction {-}
- Factors -> categorical variables: variables that have a fixed and known set of possible values.
- Also useful when you want to display character vectors in a non-alphabetical order.
```{r, warning=FALSE, message=FALSE}
library(tidyverse)
```
## Factor basics {-}
- A variable that records month:
```{r}
x1 <- c("Dec", "Apr", "Jan", "Mar")
```
Problems with using a string:
- Only 12 possible values
- No way of accounting for typos
```{r}
x2 <- c("Dec", "Apr", "Jam", "Mar")
```
- Doesn't sort in a useful way but alphabetically
```{r}
sort(x1)
```
## Fix issues with strings using factors {-}
- Fix those problems with a factor.
- Start by creating a list of the valid levels:
```{r}
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
month_levels
```
- Now create a factor:
```{r}
y1 <- factor(x1, levels = month_levels)
sort(y1)
```
- If you omit the levels, they'll be taken from the data alphabetically:
```{r}
factor(x1)
```
## Values not included in levels {-}
- <b style="color:red;">WARNING:</b> Any values not in the level will be silently converted to NA:
```{r}
y2 <- factor(x2, levels = month_levels)
y2
```
- But if you use `forcats::fct()` insted:
```{r, eval=FALSE}
y2 <- fct(x2, levels = month_levels)
```
#> Error in `fct()`:
#> ! All values of `x` must appear in `levels` or `na`
#> ℹ Missing level: "Jam"
- If you need to access the valid levels use `levels()`:
```{r}
levels(y2)
```
## Other ways of dealing with factors {-}
- If we want the order of the levels match the order of the first appearance in the data then use `unique()`, or after the fact, with [fct_inorder():]('https://forcats.tidyverse.org/reference/fct_inorder.html')
```{r}
f1 <- factor(x1, levels = unique(x1))
f1
```
```{r}
f2 <- x1 |> factor() |> fct_inorder()
f2
```
```{r}
levels(f2)
```
- We can also create a factor when reading your data with readr with [col_factor():]('https://readr.tidyverse.org/reference/parse_factor.html')
```{r}
csv <- "
month,value
Jan,12
Feb,56
Mar,12"
```
```{r}
df <- read_csv(csv, col_types = cols(month = col_factor(month_levels)))
df$month
```
## General Social Survey {-}
It's a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in `gss_cat` Hadley selected a handful that will illustrate some common challenges you’ll encounter when working with factors.
```{r}
gss_cat
```
- In a tibble we can use `count()` to see the levels of a factor:
```{r}
gss_cat |>
count(race)
```
## Modifying factor order {-}
- Let's look at an example in a plot for which we will modify the order of factors on the y-axis:
```{r}
relig_summary <- gss_cat |>
group_by(relig) |>
summarize(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(x = tvhours, y = relig)) +
geom_point()
```
## Let's re order the factos with `fct_reorder()` {-}
- It is hard to read this plot because there's no overall pattern. We can improve it by reordering the levels of relig using [fct_reorder()]('https://forcats.tidyverse.org/reference/fct_reorder.html'). [fct_reorder()]('https://forcats.tidyverse.org/reference/fct_reorder.html') takes three arguments:
- f, the factor whose levels you want to modify.
- x, a numeric vector that you want to use to reorder the levels.
- Optionally, fun, a function that's used if there are multiple values of x for each value of f. The default value is median.
```{r}
relig_summary |>
mutate(
relig = fct_reorder(relig, tvhours)
) |>
ggplot(aes(x = tvhours, y = relig)) +
geom_point()
```
## Change levels of a factor {-}
- Changing the levels of a factor
```{r}
rincome_summary <- gss_cat |>
group_by(rincome) |>
summarize(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(rincome_summary,
aes(x = age,
y = fct_relevel(rincome,
"Not applicable"))) +
geom_point()
```
```{r}
by_age <- gss_cat |>
filter(!is.na(age)) |>
count(age, marital) |>
group_by(age) |>
mutate(
prop = n / sum(n)
)
ggplot(by_age, aes(x = age,
y = prop, color = marital)) +
geom_line(na.rm = TRUE)
ggplot(by_age, aes(x = age,
y = prop,
color = fct_reorder2(marital,
age, prop))) +
geom_line() +
labs(color = "marital")
```
- Changing the order of a bar plot in decreasing frequency with [fct_infreq()]('https://forcats.tidyverse.org/reference/fct_inorder.html') and in increasing frequency with [fct_rev()]('https://forcats.tidyverse.org/reference/fct_rev.html')
```{r}
gss_cat |>
mutate(marital = marital |>
fct_infreq() |>
fct_rev()) |>
ggplot(aes(x = marital)) +
geom_bar()
```
## Modifying factor levels {-}
- We can change the values of the levels.
- Now we can clarify levels for publication
- Collapse levels for high-level displays
- `fct_recode()` allows you to recode or change the value of each level. It will leave levels not mentioned as is and will warn if you refer to a level that doesn't exist.
```{r}
gss_cat |>
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)
) |>
count(partyid)
```
## Combine groups {-}
- To combine groups, we can assign multiple old levels to the same new level ('Other'):
```{r}
gss_cat |>
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)
) |>
count(partyid)
```
## Collapse levels into one {-}
- If we want to collapse a lot of levels, [fct_collapse()]('https://forcats.tidyverse.org/reference/fct_collapse.html') is a useful variant of [fct_recode().]('https://forcats.tidyverse.org/reference/fct_recode.html').
- For each new variable, you can provide a vector of old levels:
```{r}
gss_cat |>
mutate(
partyid = fct_collapse(partyid,
"other" = c("No answer", "Don't know", "Other party"),
"rep" = c("Strong republican", "Not str republican"),
"ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
"dem" = c("Not str democrat", "Strong democrat")
)
) |>
count(partyid)
```
## Lump together several small groups {-}
- If you want to lump together small groups to make a simpler table or plot then use `fct_lump*()`. For example: `fct_lump_lofreq()` lumps progressively smallest groups categories into 'Other':
```{r}
gss_cat |>
mutate(relig = fct_lump_lowfreq(relig)) |>
count(relig)
```
- Read the documentation to learn about [fct_lump_min()]('https://forcats.tidyverse.org/reference/fct_lump.html') and [fct_lump_prop()]('https://forcats.tidyverse.org/reference/fct_lump.html') which are useful in other cases.
## Ordered factors {-}
- Ordered factors, created with ordered(), imply a strict ordering and equal distance between levels:
```{r}
ordered(c('a', 'b', 'c'))
```
- If you map an ordered factor to color or fill in ggplot2, it will default to `scale_color_viridis()` or `scale_fill_viridis()`.
- If you use an ordered function in a linear model, it will use 'polygonal contrasts'. These are mildly useful and you can learn more here: `vignette("contrasts", package = "faux")` by [Lisa DeBruine](https://debruine.github.io/faux/articles/contrasts.html).
## Meeting Videos {-}
### Cohort 5
`r knitr::include_url("https://www.youtube.com/embed/2ySAk-lgT88")`
<details>
<summary>Meeting chat log</summary>
00:05:04 Federica Gazzelloni: Hello
00:23:34 Jon Harmon (jonthegeek): Useful: R has month.name and month.abb character vectors built in. So you can do things like y3 <- factor(month.abb, levels = month.abb)
00:35:46 Ryan Metcalf: Open ended question for the team. If Factors are a built-in enumeration in categorical data….what if the data in review has a dictionary and the variable (column) of each record is entered as a numeral. Would a best practice to use a join or mutate to enter the text instead of a numeral.
01:00:25 Ryan Metcalf: I’m not finding a direct definition of “level” in the Forecats text. Would it be appropriate to state a “level” in this Factor chapter is the “quantity of a given category?”
01:05:05 Jon Harmon (jonthegeek): state.abb
</details>
### Cohort 6
`r knitr::include_url("https://www.youtube.com/embed/Xaax7EX-WIQ")`
<details>
<summary>Meeting chat log</summary>
00:12:43 Daniel Adereti: https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/
00:13:46 Adeyemi Olusola: Its freezing but I don’t know if it’s from my end
00:15:05 Shannon: Yes, Adeyemi, it's freezing a bit...cleared up for now
00:19:52 Adeyemi Olusola: I guess as.factor( ) does the same without aorting
01:01:46 Marielena Soilemezidi: thank you Daniel!
01:01:52 Adeyemi Olusola: Thank you Daniel
01:02:34 Marielena Soilemezidi: bye all!
01:02:40 Daniel Adereti: Bye!
</details>
### Cohort 7
`r knitr::include_url("https://www.youtube.com/embed/KUTSJFGy3kY")`
### Cohort 8
`r knitr::include_url("https://www.youtube.com/embed/PeJICZRmwvI")`
<details>
<summary> Meeting chat log </summary>
```
00:39:21 shamsuddeen: https://forcats.tidyverse.org/reference/fct_lump.html
00:39:43 Abdou: Sometimes you just want to lump together the small groups to make a plot or table simpler
00:39:52 shamsuddeen: yes
00:39:55 shamsuddeen: exactly
00:40:50 Abdou: Reacted to "https://forcats.tidy..." with 👍
```
</details>