-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathsession_08.Rmd
440 lines (300 loc) · 9.89 KB
/
session_08.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
---
title: Session 8
date: "September 5, 2018"
output:
html_document:
toc: yes
toc_float: yes
editor_options:
chunk_output_type: console
---
```{r session-08-init, include=FALSE}
session_links <- list(
"R for Data Science: Strings" = "http://r4ds.had.co.nz/strings.html",
"STAT545: Character Data" = "http://stat545.com/block028_character-data.html"
)
```
## Review
### Working with Strings
**Question:** You have a vector that contains two strings.
```{r}
library(stringr)
txt <- c("pfx-003-sfx", "pfx-987-sfx")
```
Match the function that was used to the output that it created.
```{r echo=FALSE, results = "asis"}
fns <- c(
'str_detect(txt, "003")',
'str_extract(txt, "\\\\d+")',
'str_remove(txt, "-sfx")',
'str_c(txt, collapse = " and ")',
'str_replace(txt, "pfx", "prefix")'
)
answers <- letters[sample(seq_along(fns), length(fns))]
names(answers) <- fns
results <- purrr::map_chr(names(sort(answers)), ~ capture.output(dput(eval(parse(text = .)))))
cat(glue::glue("\n1. `{names(answers)}`"), sep = "\n")
cat("\n")
cat(glue::glue("\n{sort(answers)}. `{results}`"), sep = "\n")
```
<details><summary>Answers</summary>
```{r echo=FALSE, results="asis"}
cat(glue::glue("{.i}. {answers}", .i = seq_along(answers)), sep = "; ")
```
</details>
**Question:** The following ICD10 codes describe a selection of ICD codes for "Malignant melanomas of the skin".
```{r results='asis', echo=FALSE}
icd.data::icd10cm2016 %>%
as_tibble() %>%
filter(str_detect(chapter, regex("neoplasm", TRUE)), str_detect(code, "C43")) %>%
slice(1:10) %>%
select(code, Description = short_desc) %>%
mutate(code = glue::glue('<span style="color: #CD4F39; font-family: Fira Mono, monospace;">{code}</span>')) %>%
rename(`ICD Code` = code) %>%
knitr::kable(escape = FALSE)
```
1. Describe in words all of the patterns you can find in the ICD code and its description.
1. How can you extract the specific location referred to in the description text?
1. How could you find all patients with melonoma located on the eyelid? On the ear?
1. Write a regular expression that finds ICD codes indicate melanoma on the _right side_ of the body.
## Outline
- Homework and Sharing
- Base R: The other side of R
- Subsetting
- Factors (forcats)
- Project Management
- Looking Forward
- Workshop format
- Topics we can talk about
## Show and Tell
<img src="https://media.giphy.com/media/l2JejW7K5EDjFyT9m/giphy.gif" width="100%" />
<details><summary>Who's ready?</summary>
<img src="https://media.giphy.com/media/3orifdyb5uQljmxgeA/giphy.gif" width="100%" />
</details>
<details><summary>You're Awesome!</summary>
<img src="https://media.giphy.com/media/xT5LMBXjnbVFzRLVZu/giphy.gif" width="100%" />
</details>
## Base R: The Other R
### Another way to do things
**R** is about `r lubridate::year(Sys.Date()) - 1993` years old, or
`r lubridate::year(Sys.Date()) - 1976` if you count its "first version" **S**.
[dplyr]{.pkg}, on the other hand, started in October of 2012 but the version most consistent with the things that we've learned wasn't released until **October 2014** -- or
`r round(difftime(Sys.time(), lubridate::ymd_h("2014-10-06 0"), "days")/365.25, 1)` years ago.
This means that when you run into issues and search around online in places like [StackOverflow](https://stackoverflow.com/questions/tagged/r), you'll often run into R code written in _base R_.
It's also useful to know a bit about base R functions when you need to do more programming-style work, as opposed to the data analysis workflows we've been learning.
This section is intended to give a very quick overview of several parts of base R that we haven't talked about much, primarily to help you translate code you find online into [dplyr]{.pkg}-based code that you're used to.
We'll use the `example` data frame from [Session 3](session_03_answers.html#workspaces__rstudio_projects) for this section.
```{r echo=FALSE, results="show"}
source(here::here("materials/03/example_single_patient.R"))
x <- example
```
```{r echo=TRUE, prompt = TRUE, comment = ""}
x
```
### Selecting Elements of a Vector
First, let's say that you have a vector.
```{r}
v <- c(10, 14, 6)
```
You can get a particular element of the vector using `[ ]`.
```{r}
v[2]
```
[Sidenote: does R start counting elements from 0 or from 1?]{.muted}
You can extract more than one element by using a vector.
```{r}
v[c(1, 3)]
```
You can discard elements with negative indices.
```{r}
v[-2]
```
And you can use `TRUE` and `FALSE` to discard elements that are in the same position as the `FALSE`.
```{r}
v != 14
v[v != 14]
v[c(TRUE, FALSE, TRUE)]
```
Suppose this vector is actually a count of the number of fruit that we have.
If the vector has names, we can use the names to select specific elements.
```{r}
names(v) <- c("kiwi", "lime", "mango")
v["mango"]
v[c("kiwi", "lime")]
```
Suppose instead of a vector, we have a list.
```{r}
l <- list(
orange = 25,
plum = 13,
quince = 30
)
```
<details><summary>Yes, it's a real fruit</summary>
![](images/quince.jpg)
</details>
Lists are slightly more tricky.
Using one set of `[ ]` gives you back another list with the elements in the brackets, or you can get a specific element with double brackets `[[ ]]`.
```{r}
l["orange"]
l[c("orange", "plum")]
l[["quince"]]
```
The double brackets can only give you **one** entry; you'll get an error if you try to get more than one.
```{r error=TRUE}
l[[c("orange", "plum")]]
```
Did you notice the `$` in the output?
That's because the dollar sign is _yet another subsetting operator_ and it works exactly the same as `[[ ]]`.
```{r}
l$plum
l[["plum"]]
```
Why are there ~~two~~ three ways to subset lists?
First: why does there need to be a difference between `[ ]` and `[[ ]]`?
```{r}
l$other <- c("rambutan", "starfruit", "tomato")
l
```
<details><summary>Yup, it really is, too...</summary>
![](images/rambutan.png)
</details>
```{r}
l[c("orange", "other")]
l[["orange"]]
l[["other"]]
c(l$other, l$orange)
```
### Subsetting Columns and Rows
The syntax for data frames (or tibbles) is very similar.
The only difference is that the single brackets `[ ]` take two arguments -- _rows_ and _columns_ -- and the double brackets `[[ ]]` (or `$`) return _columns_.
```{r}
x
x[1, 2]
x[c(1, 4), c(2, 3)]
x[1:2, c("age_dx", "age_visit")]
x$patient_id
x[["age_dx"]]
```
What if you just want to select certain rows or certain columns?
```{r}
x[c(1, 4), ]
x[, 1:2]
```
What does this code do?
```{r}
x[x$age_visit > x$age_dx, ]$tumor_size
```
#### In [dplyr]{.pkg}-speak
```r
# Base R
x[, c("age_dx", "age_visit")]
```
```{r}
x %>% select(age_dx, age_visit)
```
```r
# Base R
x[c(1, 4), ]
```
```{r}
x %>% slice(c(1, 4))
```
```r
# Base R
x[1, 2]
```
```{r}
x %>%
slice(1) %>%
select(2)
```
```r
# Base R
x[x$age_visit > x$age_dx, "tumor_size"]
```
```{r}
x %>%
filter(age_visit > age_dx) %>%
select(tumor_size)
```
```r
# Base R
x[x$age_visit > x$age_dx, ]$tumor_size
```
```{r}
x %>%
filter(age_visit > age_dx) %>%
pull(tumor_size)
```
### WTF: What the Factor?
Imagine you have a variable that records months:
```{r}
m <- c("Dec", "Oct", "Sep", "May")
```
Using a string for this variable has two problems:
1. There are only 12 possible values. What if you make a mistake?
```{r}
m2 <- c("Dev", "Oct", "Seo", "May")
```
1. You can't sort in a useful way.
```{r}
sort(m)
```
A **factor** solves both of these problems:
1. It adds the concept of a categorical *level*
```{r}
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
m <- factor(m, levels = month_levels)
m
```
1. that have an explicit order
```{r}
sort(m)
```
If you try to create a factor with a typo, you'll get a missing value
```{r}
factor(m2, levels = month_levels)
```
#### Package Spotlings: [forcats]{.pkg}
If you end up needing to work with factors, try [[forcats]{.pkg}][forcats].
[forcats]: https://forcats.tidyverse.org
[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/forcats.png" class="img-right" style="max-width: 150px" />][forcats]
> R uses factors to handle categorical variables, variables that have a fixed and known set of possible values.
Historically, factors were much easier to work with than character vectors, so many base R functions automatically convert character vectors to factors.
>
> These days, making factors automatically is no longer so helpful, so packages in the tidyverse never create them automatically.
However, factors are still useful when you have true categorical data, and when you want to override the ordering of character vectors to improve display.
The goal of [the forcats package][forcats] is to provide a suite of useful tools that solve common problems with factors.
To load [[forcats]{.pkg}][forcats]:
```r
library(tidyverse)
# or
library(forcats)
```
## Project Management Tips
1. Start a new project, start a new **RStudio** project.
1. Save "clean" code: have one file that runs your analysis
- Write scratch code in console or a script
- Save working code into a _good_ script
- Restart R and run from the beginning after you lock-in each step
1. When you're comfortable, turn off saving `.RData` in RStudio Settings
![](images/rstudio-13-no-save-restore.png){.img-mw500}
## Looking Forward
### Bi-Weekly Workshops
Every other week, starting the week of September 17, 2018.
What time works for you?
<http://whenisgood.net/gerkelab/cds-r-follow-up>
#### Goals
1. Collaborative workspace for problem solving
1. Learn more programming and data processing techniques
### Potential Topics
1. Creating data-driven reports with [R Markdown](https://rmarkdown.rstudio.com/).
1. Visualizing data with [ggplot2](https://ggplot2.tidyverse.org/)
1. Writing functions for repeatable analysis
1. Using version control software to track changes
1. Accessing data stored in databases with [dbplyr](https://dbplyr.tidyverse.org/)
### Stay in Touch!