-
Notifications
You must be signed in to change notification settings - Fork 5
/
lab04.qmd
376 lines (260 loc) · 15.4 KB
/
lab04.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
# Lab 4: Data Wrangling I {.unnumbered}
## Package(s)
- [dplyr](https://dplyr.tidyverse.org)
- [readr](https://readr.tidyverse.org)
- [tibble](https://tibble.tidyverse.org)
## Schedule
- 08.00 - 08.15: [Recap of Lab 3](https://raw.githack.com/r4bds/r4bds.github.io/main/recap_lab04.html)
- 08.15 - 08.30: Assignment 2 walk-through
- 08.30 - 09.00: [Lecture](https://raw.githack.com/r4bds/r4bds.github.io/main/lecture_lab04.html)
- 09.00 - 09.15: Break
- 09.00 - 12.00: [Exercises](#sec-exercises)
## Learning Materials
_Please prepare the following materials:_
- Book: [R4DS2e: Chapter 3 Data transformation](https://r4ds.hadley.nz/data-transform)
- Book: [R4DS2e: Chapter 4 Workflow: Code Style](https://r4ds.hadley.nz/workflow-style)
- Book: [R4DS2e: Chapter 7 Data import](https://r4ds.hadley.nz/data-import)
- Book: [R4DS2e: Chapter 8 Workflow: getting help](https://r4ds.hadley.nz/workflow-help)
- Web: [What is data wrangling? Intro, Motivation, Outline, Setup -- Pt. 1 Data Wrangling Introduction](https://www.youtube.com/watch?v=jOd65mR1zfw)
- Web: (NB! STOP at 7:45, i.e. skip tidyr) [Tidy Data and tidyr -- Pt 2 Intro to Data Wrangling with R and the Tidyverse](https://www.youtube.com/watch?v=1ELALQlO-yM)
- Web: [Data Manipulation Tools: dplyr -- Pt 3 Intro to the Grammar of Data Manipulation with R](https://www.youtube.com/watch?v=Zc_ufg4uW4U)
_Unless explicitly stated, do not do the per-chapter exercises in the R4DS2e book_
## Learning Objectives
_A student who has met the objectives of the session will be able to:_
- Understand and apply the 6 basic dplyr verbs: `filter()`, `arrange()`, `select()`, `mutate()`, `summarise()` and `group_by()`
- Construct and apply logical statements in context with `dplyr` pipelines
- Understand and apply the additional verbs `count()`, `drop_na()`, `View()`
- Combine dplyr verbs to form a data manipulation pipeline using the pipe `|>` operator
- Decipher the components and functions hereof in a `dplyr` pipeline
## Exercises {#sec-exercises}
```{r}
#| echo: false
#| eval: true
#| message: false
library("tidyverse")
diabetes_data <- read_csv(file = "data/diabetes.csv")
```
*Important: As we progress, avoid using black-box do-everything-with-one-command R-packages like e.g. `ggpubr` - I want you to learn the underlying technical bio data science! ...and why is that? Because if you use these types and-then-magic-happens packages, you will be limited to their functionality and what if you want to create something custom? Something beyond stock? Some awesome idea that you got? If you truly understand ggplot - the sky is the limit! Basically, it's the cliché: "If you give a man a fish, you feed him for a day. If you teach a man to fish, you feed him for a lifetime"*
### Getting Started
_First, make sure to read and discuss the feedback you got from last week's assignment!_
Then, once again go to the [R for Bio Data Science RStudio Cloud Server](https://teaching.healthtech.dtu.dk/22100/rstudio.php)
First things first:
- Let's create a new Quarto Document for today's exercises, e.g. `lab04_exercises.qmd`
### Brief refresh of Quarto so far
Recall the syntax for a new code chunk, where all your `R` code goes. Any text and notes must be _outside_ the chunk tags:
```{r}
#| echo: fenced
# Here the code goes
1 + 1
x <- c(1, 2, 3)
mean(x)
```
Our markdown goes outside the code-chunks, e.g.:
````
# Header level 1
## Header level 2
### Header level 3
*A text in italics*
**A text in bold**
Normal text describing and explaining
````
Now, in your new Quarto document, create a new `Header level 2`, e.g.:
````
## Load Libraries
````
and under your new header, add a new code chunk, like so
```{r}
#| echo: fenced
#| message: false
library("tidyverse")
```
And run the chunk. This will load our data science toolbox, including `dplyr` (and `ggplot`). But wait, what does `#| message: false` do? 🤷️ 🤔
**Try to structure your Quarto document as your course notes, i.e. add notes while solving the exercises aiming to create a reference that you can revisit for your final project work... And... For future reference after this course, where hopefully you'll really reap what you sow**
Bonus info: We are engineers, so of course, we love equations, we can include standard $\LaTeX$ syntax, e.g.:
````
$E(x) = \frac{1}{n} \cdot \sum_{i=1}^{n} x_{i}$
````
Try it!
### A few handy short cuts
Insert new code chunk:
- Mac: CMD + OPTION + i
- Windows: CTRL + ALT + i
Render my Quarto document:
- Mac: CMD + SHIFT + k
- Win: CTRL + SHIFT + k
Run line in chunk:
- Mac: CMD + ENTER
- Win: CTRL + ENTER
Run entire chunk:
- Mac: CMD + SHIFT + ENTER
- Win: CTRL + SHIFT + ENTER
Insert the pipe symbol `|>`:
- Mac: CMD + SHIFT + m
- Win: CTRL + SHIFT + m
*Note, if you're trying this out and you see `%>%` instead of `|>`, then go to `Tools`, `Global Options...`, `Code` and check the box `Use native pipeoperator, |>`*
## A few initial questions
First, if you don't feel completely comfortable with the `group_by |> summarise` workflow, then no worries - Feel free to visit this short [R Tutorial: Grouping and summarizing](https://www.youtube.com/watch?v=zAlbrPozMHI)
Then, in your groups, discuss the following primer questions. Note, when asked for _"what is the output"_, do not run the code in the console, instead try to **talk and think about it** and write your answers and notes in your Quarto document for the day:
First, in a new chunk, run `tibble(x = c(4, 3, 5, 1, 2))`, so you understand what it does, then - Discuss in your group, what is the output of, remember first talk, then check understanding by running code:
- **Q1: `tibble(x = c(4, 3, 5, 1, 2)) |> filter(x > 2)`?**
- **Q2: `tibble(x = c(4, 3, 5, 1, 2)) |> arrange(x)`?**
- **Q3: `tibble(x = c(4, 3, 5, 1, 2)) |> arrange(desc(x))`?**
- **Q4: `tibble(x = c(4, 3, 5, 1, 2)) |> arrange(desc(desc(x)))`?**
- **Q5: `tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(x)`?**
- **Q6: `tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(y)`?**
- **Q7: `tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(-x)`?**
- **Q8: `tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) |> select(-x, -y)`?**
- **Q9: `tibble(x = c(4, 3, 5, 1, 2)) |> mutate(x_dbl = 2*x)`?**
- **Q10: `tibble(x = c(4, 3, 5, 1, 2)) |> mutate(x_dbl = 2 * x, x_qdr = 2*x_dbl)`?**
- **Q11: `tibble(x = c(4, 3, 5, 1, 2)) |> summarise(x_mu = mean(x))`?**
- **Q12: `tibble(x = c(4, 3, 5, 1, 2)) |> summarise(x_max = max(x))`?**
- **Q13: `tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, NA, 5, 1, 2)) |> group_by(lbl) |> summarise(x_mu = mean(x), x_max = max(x))`?**
- **Q14: `tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, 3, 5, 1, 2)) |> group_by(lbl) |> summarise(n = n())`?**
- **Q15: `tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, 3, 5, 1, 2)) |> count(lbl)`?**
In the following, return to these questions and your answers for reference on the `dplyr` verbs!
## Load data
Wait... If your write some `R`-code, which for some reason does not run, `R` will let you know what went wrong. Make sure to **read** these error messages. During this course the teaching team is available, but after the course, you'll have to be comfortable with `debugging`, i.e. finding and fixing errors in your code. By-the-way, did you know the term `debugging` is attested to [Admiral Grace Hopper](https://aws.amazon.com/what-is/debugging/#:~:text=The%20term%20debugging%20can%20be,they%20were%20debugging%20the%20system.)?
Let's continue - Again, add a new header to your Quarto document, e.g. `## Load Data`, then:
1. Go to the [Vanderbilt Biostatistics Datasets](https://hbiostat.org/data/) site
1. Find **Diabetes data** and download the `diabetes.csv` file
1. You should have a `data` folder, if not, then in the `Files` pane, click the `New Folder` button, enter folder name `data` and click `ok`
1. Now, click on the folder you created
1. Click the ![](images/button_upload.png) `Upload` button and navigate to the `diabetes.csv` file you downloaded
1. Clicking the two dots `..` above the file you uploaded, look for ![](images/paths_dir_up.png) `..`, will take you one level up in your project path
1. Insert a new code chunk in your Quarto document
1. Add and then run the following code
```{r}
#| eval: false
#| message: false
diabetes_data <- read_csv(file = data/diabetes.csv)
diabetes_data
```
Then realise that we could simply have run the following code to do the exact same thing (Yes, `readr` is pretty nifty):
```{r}
#| echo: true
#| eval: false
# Create the data directory programmatically
dir.create(path = "data")
# Retrieve the data directly
diabetes_data <- read_csv(file = "https://hbiostat.org/data/repo/diabetes.csv")
# Write the data to disk
write_csv(x = diabetes_data,
file = "data/diabetes.csv")
```
_Just remember the echo/eval trick from last session to avoid retrieving online data each time you render your Quarto document_
## Work with the diabetes data set
First, take a few minutes to read about this dataset. Go to the [Vanderbilt Department of Biostatistics site](https://hbiostat.org/data/) and find the `Diabetes data` header and directly below that, click the link `diabetes.html`. This will contain some meta data on the data.
Now, use the pipe `|>` to use the `View()` function to inspect the data set. Note, if you click the ![](images/button_view.png)-button, you will get a spreadsheet-like view of the data, allowing you to get an overview (Psst... No need to use Excel...).
- **Q1: How many observations and how many variables?**
- **Q2: Is this a tidy data set? Which three rules must be satisfied?**
- **Q3: When you run the chunk, then underneath each column name is stated `<chr>` and `<dbl>` what is that?**
Before we continue
- **T1: Change the `height`, `weight`, `waist` and `hip` from the imperial system (inches/pounds) to the metric system (cm/kg), rounding to 1 decimal**
```{r}
#| echo: false
diabetes_data <- diabetes_data |>
mutate(height = round(height * 2.54, 1),
weight = round(weight * 0.453592, 1),
waist = round(waist * 2.54, 1),
hip = round(hip * 2.54, 1))
```
Let us try to take a closer look at various subsets of the data. For the following questions, "How many ..." refers to the number of rows in the subset of the data you create:
- **Q4: How many weigh less than 100kg?**
- **Q5: How many weigh more than 100kg?**
- **Q6: How many weigh more than 100kg and are less than 1.6m tall?**
- **Q7: How many women are taller than 1.8m?**
- **Q8: How many men are taller than 1.8m?**
- **Q9: How many women in Louisa are older than 30?**
- **Q10: How many men in Buckingham are younger than 30 and taller than 1.9m?**
- **T2: Make a scatter plot of `weight` versus `height` and colour by sex for inhabitants of Louisa above the age of 40**
- **T3: Make a box plot of height versus location stratified on sex for people above the age of 50**
Sorting columns can aid in getting an overview of variable ranges (don't use the `summary()` function yet for this one)
- **Q11: How old is the youngest person?**
- **Q12: How old is the oldest person?**
- **Q13: Of all the 20-year-olds, what is the height of the tallest?**
- **Q14: Of all the 20-year-olds, what is the height of the shortest?**
Choosing specific columns can be used to work with a subset of the data for a specific purpose
- **Q15: How many columns (variables) `starts_with` a "b"?**
- **Q16: How many columns (variables) `contains` the word "eight"?**
Creating new variables is an integral part of data manipulation
- **T4: Create a new variable, where you calculate the BMI**
```{r}
#| echo: false
diabetes_data <- diabetes_data |>
mutate(BMI = weight / (height / 100)^2)
```
- **T5: Create a `BMI_class` variable**
*Take a look at the following code snippet to get you started:*
```{r}
#| echo: true
#| eval: false
tibble(x = rnorm(10)) |>
mutate(trichotomised = case_when(
x < -1 ~ "Less than -1",
-1 <= x & x < 1 ~ "larger than or equal to -1 and smaller than 1",
1 <= x ~ "Larger than or equal to 1"))
```
*and then go read about [BMI classification here](https://www.ncbi.nlm.nih.gov/books/NBK541070/) and discuss in your group how to extract classifications from the Definition/Introduction section*
Note, the `cut()` function could be used here, but you should try to use `case_when()` as illustrated in the example chunk above.
```{r}
#| echo: false
diabetes_data <- diabetes_data |>
mutate(BMI_class = case_when(BMI < 16.5 ~ "Severely underweight",
16.5 <= BMI & BMI < 18.5 ~ "Underweight",
18.5 <= BMI & BMI < 25.0 ~ "Normal weight",
25.0 <= BMI & BMI < 30.0 ~ "Overweight",
30.0 <= BMI & BMI < 35.0 ~ "Obesity class I",
35.0 <= BMI & BMI < 40.0 ~ "Obesity class II",
40.0 <= BMI ~ "Obesity class III"))
```
Once you have created the variable, you will need to convert it to a categorical variable, in `R`, these are called a `factor` and you can set the levels like so:
```{r}
diabetes_data <- diabetes_data |>
mutate(BMI_class = factor(BMI_class,
levels = c("my 1st category", "my 2nd category",
"my 3rd category", "my nth category")))
```
This is **very** important for plotting, as this will determine the order in which the categories appear on the plot!
- **T6: Create a box plot of `hdl` versus BMI_class**
- **Q17: What do you see?**
- **T7: Create a `BFP` (Body fat percentage) variable**
<details>
<p><summary>Click here for hint</summary></p>
BFP can be calculated using the following equation from @Jackson2002:
$$BFP = 1.39 \cdot BMI + 0.16 \cdot age - 10.34 \cdot sex - 9$$
where $sex$ is defined as being $0$ for female and $1$ for male.
</details>
```{r}
#| echo: false
diabetes_data <- diabetes_data |>
mutate(sex_class = case_when(gender == "female" ~ 0,
gender == "male" ~ 1),
BFP = 1.39 * BMI + 0.16 * age - 10.34 * sex_class - 9,
WHR = waist / hip)
```
* **T8: Create a `WHR` (waist-to-hip ratio) variable**
* **Q18: What correlates better with `BMI`: `WHR` or `BFP`?** <span style="color: red;">GROUP ASSIGNMENT</span> (Important, see: [how to](assignments.qmd))
<details>
<p><summary>Click here for hint</summary></p>
*Is there a certain plot-type, which can visualise if the relationship between two variables and give insights to if they are correlated? Can you perhaps use an `R` function to compute the "correlation coefficient"?. Do not use e.g. `ggpubr` or any other and-then-magic-happens package, use the knowledge you've gained so far)*
</details>
Now, with this augmented data set, let us create some summary statistics
- **Q19: How many women and men are there in the data set?**
- **Q20: How many women and men are there from Buckingham and Louisa respectively in the data set?**
- **Q21: How many are in each of the `BMI_class` groups?**
- **Q22: Given the code below, explain the difference between A and B?**
```{r}
#| eval: false
# A
diabetes_data |>
ggplot(aes(x = BMI_class)) +
geom_bar()
# B
diabetes_data |>
count(BMI_class) |>
ggplot(aes(x = BMI_class, y = n)) +
geom_col()
```
- **T9: For each `BMI_class` group, calculate the average weight and associated standard deviation**
- **Q23: What was the average age of the women living in Buckingham in the study?**
Finally, if you reach this point and there is still time left. Take some time to do some exploratory plots of the data set and see if you can find something interesting.