-
Notifications
You must be signed in to change notification settings - Fork 5
/
lab05.qmd
845 lines (623 loc) · 33.8 KB
/
lab05.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
# Lab 5: Data Wrangling II {.unnumbered}
## Package(s)
- [dplyr](https://dplyr.tidyverse.org)
- [stringr](https://stringr.tidyverse.org)
- [tidyr](https://tidyr.tidyverse.org)
- [forcats](https://forcats.tidyverse.org)
- [patchwork](https://patchwork.data-imaginist.com/index.html)
- [ggseqlogo](https://omarwagih.github.io/ggseqlogo/)
- [table1](https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html)
## Schedule
- 08.00 - 08.30: [Recap of Lab 4](https://raw.githack.com/r4bds/r4bds.github.io/main/recap_lab05.html)
- 08.30 - 08.35: [Lecture](https://raw.githack.com/r4bds/r4bds.github.io/main/lecture_lab05.html)
- 08.35 - 08.45: Break
- 08.45 - 12.00: [Exercises](#sec-exercises)
## Learning Materials
*Please prepare the following materials*
- Book: [R4DS2e: Chapter 5 Data tidying](https://r4ds.hadley.nz/data-tidy)
- Book: [R4DS2e: Chapter 14 Strings](https://r4ds.hadley.nz/strings)
- Book: [R4DS2e: Chapter 16 Factors](https://r4ds.hadley.nz/factors)
- Book: [Chapter 19 Joins](https://r4ds.hadley.nz/joins)
- Video: [Tidy Data and tidyr](https://youtu.be/1ELALQlO-yM?t=465) - NB! Start at 7:45 and please note: `gather()` is now `pivot_longer()` and `spread()` is now `pivot_wider()`
- Video: [Working with Two Datasets: Binds, Set Operations, and Joins](https://www.youtube.com/watch?v=AuBgYDCg1Cg)
- Video: [stringr](https://www.youtube.com/watch?v=oIu5jK8DeX8&list=PLiC1doDIe9rDwsUhd3FtN1XGCV2ES1xZ2) (Playlist with 7 short videos)
_Unless explicitly stated, do not do the per-chapter exercises in the R4DS2e book_
## Learning Objectives
*A student who has met the objectives of the session will be able to:*
- Understand and apply the various `str_*()` functions for string manipulation
- Understand and apply the family of `*_join()` functions for combining data sets
- Understand and apply `pivot_wider()` and `pivot_longer()`
- Use factors in context with plotting categorical data using `ggplot`
## Exercises {#sec-exercises}
### Prologue
Today will not be easy! But please try to remember Hadley's words of advice:
- *"The bad news is, whenever you're learning a new tool, for a long time, you're going to suck! It's gonna be very frustrating! But the good news is that that is typical and something that happens to everyone and it's only temporary! Unfortunately, there is no way to going from knowing nothing about the subject to knowing something about a subject and being an expert in it without going through a period of great frustration and much suckiness! Keep pushing through!"* - H. Wickham [(dplyr tutorial at useR 2014, 4:10 - 4:48)](https://www.youtube.com/watch?v=8SGif63VW6E)
```{r}
#| echo: false
#| eval: true
#| message: false
library("tidyverse")
library("patchwork")
library("ggseqlogo")
```
## Intro
*We are upping the game here, so expect to get stuck at some of the questions. Remember - Discuss with your group how to solve the task, revisit the materials you prepared for today and naturally, the TAs and I are happy to nudge you in the right direction. Finally, remember... Have fun!*
Remember what you have worked on so far:
- RStudio
- Quarto
- `ggplot`
- `filter`
- `arrange`
- `select`
- `mutate`
- `group_by`
- `summarise`
- The pipe and creating pipelines
- `stringr`
- joining data
- pivoting data
*That's quite a lot! Well done - You've come quite far already! Remember to think about the above tools in the following as we will synthesise your learnings so far into an analysis!*
## Background {#sec-background}
In the early 20s, the world was hit by the coronavirus disease 2019 (COVID-19) pandemic. The pandemic was caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In Denmark, the virus first confirmed case was on 27 February 2020.
While initially very little was known about the SARS-CoV-2 virus, we did know the general pathology of vira. Briefly, the virus invades the cells and hijacks the intra-cellular machinery. Using the hijacked machinery, components for new virus particles are produced, eventually being packed into the viral envelope and released from the infected cell. Some of these components, viral proteins, is broken down into smaller fragments called peptides by the proteasome. These peptides are transported into the endoplasmic reticulum by the Transporter Associated with antigen Processing (TAP) protein complex. Here, they are aided by chaperones bound to the Major Histocompatilibty Complex class I (MHC-I) and then across the Golgi apparatus they finally get displayed on the surface of the cells. Note, in humans, MHC is also called Human Leukocyte Antigen (HLA) and represents the most diverse genes. Each of us have a total of 6 HLA-alleles, 3 from the maternal and 3 from the paternal side. These are further divided into 3 classes HLA-A, HLA-B and HLA-C and the combination of these constitute the HLA-haplotype for an individual. Once the peptide is bound to the MHC class I at the cell surface and exposed, the MHC-I peptide complex can be recognised by CD8+ Cytotoxic T-Lymphocytes (CTLs) via the T-cell Receptor (TCR). If a cell displays peptides of viral origin, the CTL gets activated and via a cascade induces apoptosis (programmed cell death) of the infected cell. The process is summarised in the figure below [@McCarthy2015].
![](images/mhc_class_I_antigen_presentation_pathway.png){fig-align="center" width="60%"}
The data we will be working with today contains data on sequenced T-cell receptors, viral antigens, HLA-haplotypes and clinical meta data for a cohort:
- "A large-scale database of T-cell receptor beta (TCR$\beta$) sequences and binding associations from natural and synthetic exposure to SARS-CoV-2" [@Nolan2020].
## Your Task Today
Today, we will emulate the situation, where you are working as a Bioinformatician / Bio Data Scientist and you have been given the data and the task of answering these two burning questions:
1. What characterises the peptides binding to the HLAs?
1. What characterises T-cell Receptors binding to the pMHC-complexes?
[GROUP ASSIGNMENT: Today, your assignment will be to create a micro-report on these 2 questions!]{style="color: red;"} (Important, see: [how to](assignments.qmd))
[MAKE SURE TO READ THE LAST SECTION ON THE ASSIGNMENT](#sec-assignment)
## Getting Started
_First, make sure to read and discuss the feedback you got from last week's assignment!_
1. Then, once again go to the [R for Bio Data Science RStudio Cloud Server](https://teaching.healthtech.dtu.dk/22100/rstudio.php)
1. Make sure you are in your `r_for_bio_data_science` project, you can verify this in the upper right corner
1. In the same place as your `r_for_bio_data_science.Rproj` file and existing `data` folder, create a new folder and name it `doc`
1. Go to the aforementioned [manuscript](https://www.researchsquare.com/article/rs-51964/v1). Download the PDF and upload it to your new `doc` folder
1. Open the PDF and find the link to the data
1. Go to the data site (*Note, you may have to create and account to download, shouldn't take too long*)
. Find and download the file `ImmuneCODE-MIRA-Release002.1.zip` (*CAREFUL, do not download the superseded files*)
1. Unpack the downloaded file
1. Find the files `peptide-detail-ci.csv` and `subject-metadata.csv` and compress to `.zip` files
1. Upload the compressed `peptide-detail-ci.csv.zip` and `subject-metadata.csv.zip` files to your `data` folder in your RStudio Cloud session
1. Finally, once again, create a new Quarto document for today's exercises, containing the sections:
1. Background
1. Aim
1. Load Libraries
1. Load Data
1. Data Description
1. Analysis
## Creating the Micro-Report
### Background
*Feel free to copy paste the one stated in the [background](#sec-background)-section above*
### Aim
*State the aim of the micro-report, i.e. what are the questions you are addressing?*
### Load Libraries
```{r}
#| echo: false
#| eval: true
#| message: false
library("tidyverse")
```
*Load the libraries needed*
### Load Data
Read the two data sets into variables `peptide_data` and `meta_data`.
<details><p><summary>Click here for hint</summary></p>
*Think about which Tidyverse package deals with reading data and what are the file types we want to read here?*
</details>
```{r}
#| echo: false
#| eval: true
#| message: false
#| warning: false
peptide_data <- read_csv(file = "data/peptide-detail-ci.csv.gz")
meta_data <- read_csv(file = "data/subject-metadata.csv",
na = "N/A")
```
### Data Description
*It is customary to include a description of the data, helping the reader if the report, i.e. your stakeholder, to get an easy overview*
#### The Subject Meta Data
Let's take a look at the meta data:
```{r}
meta_data |>
sample_n(10)
```
- **Q1: How many observations of how many variables are in the data?**
- **Q2: Are there groupings in the variables, i.e. do certain variables "go together" somehow?**
- **T1: Re-create this plot**
Read this first:
- *Think about: What is on the x-axis? What is on the y-axis? And also, it looks like we need to do some `count`ing stratified by `Cohort` and `Gender`. Recall, that we can stick together a `dplyr` pipeline with a call to `ggplot`.*
```{r}
#| echo: false
#| eval: true
meta_data |>
drop_na(Cohort, Gender) |>
count(Cohort, Gender) |>
ggplot(aes(x = Cohort,
y = n,
fill = Gender)) +
geom_col(position = position_dodge(preserve = "single"),
colour = "black",
alpha = 0.5) +
geom_hline(yintercept = 0) +
theme_minimal() +
theme(legend.position = "bottom",
panel.grid.major.x = element_blank(),
axis.text.x = element_text(angle = 10,
hjust = 1,
vjust = 1.5))
```
Does your plot look different somehow? Consider peeking at the hint...
<details><p><summary>Click here for hint</summary></p>
*Perhaps not everyone agrees on how to denote `NA`s in data. I have seen `-99`, `-11`, `_` and so on... Perhaps this can be dealt with in the instance we read the data from the file? I.e. in the actual function call to your `read_csv()` function. Recall, how can we get information on the parameters of a `?function`*
</details>
- **T2: Re-create this plot**
```{r}
#| echo: false
#| eval: true
meta_data |>
mutate(Age_group = cut(Age,
breaks = seq(from = 0,
to = 100,
by = 10))) |>
drop_na(Gender, Age_group) |>
count(Gender, Age_group) |>
ggplot(aes(x = Age_group,
y = n,
fill = Gender)) +
geom_col(position = position_dodge(preserve = "single"),
colour = "black",
alpha = 0.5) +
geom_hline(yintercept = 0) +
theme_minimal() +
theme(legend.position = "bottom",
panel.grid.major.x = element_blank(),
axis.text.x = element_text(vjust = 5))
```
<details><p><summary>Click here for hint</summary></p>
*Perhaps there is a function, which can `cut` continuous observations into a set of bins?*
</details>
##### **STOP! Make sure you handled how `NA`s are denoted in the data before proceeding, see hint below T1**
- **T3: Look at the data and create yet another plot as you see fit. Also skip the redundant variables `Subject`, `Cell Type` and `Target Type`**
```{r}
#| echo: false
#| eval: true
meta_data <- meta_data |>
select(-`Subject`, -`Cell Type`, -`Target Type`)
```
```{r}
meta_data |>
sample_n(10)
```
Now, a classic way of describing a cohort, i.e. the group of subjects used for the study, is the so-called `table1` and while we could build this ourselves, this one time, in the interest of exercise focus and time, we are going to "cheat" and use an R-package, like so:
*NB!: This may look a bit odd initially, but if you render your document, you should be all good!*
```{r}
#| echo: true
#| eval: true
#| message: false
library("table1") # <= Yes, this should normally go at the beginning!
meta_data |>
mutate(Gender = factor(Gender),
Cohort = factor(Cohort)) |>
table1(x = formula(~ Gender + Age + Race | Cohort),
data = _)
```
*Note how good this looks! If you have ever done a "Table 1" before, you know how painful they can be and especially if something changes in your cohort - Dynamic reporting to the rescue!*
Lastly, before we proceed, the `meta_data` contains HLA data for both class I and class II (see background), but here we are **only** interested in class I, recall these are denoted `HLA-A`, `HLA-B` and `HLA-C`, so make sure to remove any non-class I, i.e. the one after, denoted `D`-something.
- **T4: Create a new version of the `meta_data`, which with respect to allele-data only contains information on class I and also fix the odd naming, e.g. `HLA-A...9` becomes `A1` oand `HLA-A...10` becomes `A2` and so on for `B1`, `B2`, `C1` and `C2` (Think: How can we `rename` variables? And here, just do it "manually" per variable). Remember to assign this new data to the same `meta_data` variable**
<details><p><summary>Click here for hint</summary></p>
*Which `tidyverse` function subsets variables? Perhaps there is a function, which somehow `matches` a set of variables? And perhaps for the initiated this is compatible with regular expressions (If you don't know what this means - No worries! If you do, see if you utilise this to simplify your variable selection)*
</details>
```{r}
#| echo: false
#| eval: true
meta_data <- meta_data |>
select(-matches(match = "D[PQR]")) |>
rename("A1" = `HLA-A...9`,
"A2" = `HLA-A...10`,
"B1" = `HLA-B...11`,
"B2" = `HLA-B...12`,
"C1" = `HLA-C...13`,
"C2" = `HLA-C...14`)
```
Before we proceed, this is the data we will carry on with:
```{r}
#| echo: true
#| eval: true
meta_data |>
sample_n(10)
```
Now, we have a beautiful `tidy` dataset, recall that this entails, that each row is an observation, each column is a variable and each cell holds one value.
#### The Peptide Details Data
Let's start with simply having a look see:
```{r}
#| echo: true
#| eval: true
peptide_data |>
sample_n(10)
```
- **Q3: How many observations of how many variables are in the data?**
This is a rather big data set, so let us start with two "tricks" to handle this, first:
1. *Write the data back into your `data` folder, using the filename `peptide-detail-ci.csv.gz`, note the appending of `.gz`, which is automatically recognised and results in gz-compression*
1. *Now, check in your data folder, that you have two files `peptide-detail-ci.csv` and `peptide-detail-ci.csv.gz`, delete the former*
1. *Adjust your reading-the-data-code in the "Load Data"-section, to now read in the `peptide-detail-ci.csv.gz` file*
<details><p><summary>Click here for hint</summary></p>
*Just as you can `read` a file, you can of course also `write` a file. Note the filetype we want to write here is `csv`. If you in the console type e.g. `readr::wr` and then hit the Tab key, you will see the different functions for writing different filetypes*
</details>
Then:
- **T5: As before, let's immediately subset the `peptide_data` to the variables of interest: `TCR BioIdentity`, `Experiment` and `Amino Acids`. Remember to assign this new data to the same `peptide_data` variable to avoid cluttering your environment with redundant variables. Bonus: Did you know you can click the `Environment` pane and see which variables you have?**
```{r}
#| echo: false
#| eval: true
peptide_data <- peptide_data |>
select(Experiment, `TCR BioIdentity`, `Amino Acids`)
```
Once again, before we proceed, this is the data we will carry on with:
```{r}
#| echo: true
#| eval: true
peptide_data |>
sample_n(10)
```
- **Q4: Is this tidy data? Why/why not?**
- **T6: See if you can find a way to create the below data, from the above**
```{r}
#| echo: false
#| eval: true
peptide_data <- peptide_data |>
separate(col = `TCR BioIdentity`,
into = c("CDR3b", "V_gene", "J_gene"),
sep = "\\+")
```
```{r}
#| echo: true
#| eval: true
peptide_data |>
sample_n(size = 10)
```
<details><p><summary>Click here for hint</summary></p>
*First: Compare the two datasets and identify what happened? Did any variables "disappear" and did any "appear"? Ok, so this is a bit tricky, but perhaps there is a function to `separate` a composite (untidy) `col`umn `into` a set of new variables based on a `sep`arator? But what is a `sep`arator? Just like when you read a file with `C`omma `S`eparated `V`alues, a separator denotes how a composite string is divided into fields. So, look for such a repeated value, which seem to indeed separate such fields. Also, be aware, that character, which can mean more than one thing, may need to be "escaped" using an initial two backslashed, i.e. "\\x", where x denotes the character needing to be "escaped"*
</details>
- **T7: Add a variable, which counts how many peptides are in each observation of `Amino Acids`**
```{r}
#| echo: false
#| eval: true
peptide_data <- peptide_data |>
mutate(n_peptides = str_count(`Amino Acids`,
pattern = ",") + 1)
```
<details><p><summary>Click here for hint</summary></p>
*We have been working with the `stringr` package, perhaps the contains a function to somehow count the number of occurrences of a given character in a string? Again, remember you can type e.g. `stringr::str_` and then hit the Tab key to see relevant functions*
</details>
```{r}
#| echo: true
#| eval: true
peptide_data |>
sample_n(size = 10)
```
- **T8: Re-create the following plot**
```{r}
#| echo: false
#| eval: true
peptide_data |>
ggplot(aes(x = n_peptides)) +
geom_histogram(binwidth = 1,
colour = "black",
alpha = 0.5) +
geom_hline(yintercept = 0) +
scale_x_continuous(breaks = 1:13) +
theme_minimal() +
theme(panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
axis.text.x = element_text(vjust = 5)) +
labs(x = "Number of peptides per observation",
y = "Counts")
```
- **Q4: What is the maximum number of peptides assigned to one observation?**
- **T9: Using the `str_c()` and the `seq()` functions, re-create the below**
```{r}
#| echo: false
#| eval: true
str_c("peptide_", seq(1:5))
```
<details><p><summary>Click here for hint</summary></p>
*If you're uncertain on how a function works, try going into the console and in this case e.g. type `str_c("a", "b")` and `seq(from = 1, to = 3)` and see if you combine these?*
</details>
- **T10: Use, what you learned about separating in T6 and the vector-of-strings you created in T9 adjusted to the number from Q4 to create the below data**
```{r}
#| echo: false
#| eval: true
#| warning: false
peptide_data <- peptide_data |>
separate(col = `Amino Acids`,
into = str_c("peptide_", 1:13),
sep = ",")
```
<details><p><summary>Click here for hint</summary></p>
*In the console, write `?separate` and think about how you used it earlier. Perhaps you can not only specify a vector to separate `into`, but also specify a function, which returns a vector?*
</details>
```{r}
#| echo: true
#| eval: true
peptide_data |>
sample_n(size = 10)
```
- **Q5: Now, presumable you got a warning, discuss in your group why that is?**
- **Q6: With respect to `peptide_n`, discuss in your group, if this is wide- or long-data?**
Now, finally we will use the what we prepared for today, data pivoting. There are two functions, namely `pivot_wider()` and `pivot_longer()`. Also, now, we will use a trick when developing ones data pipeline, while working with new functions, that on might not be completely comfortable with. You have seen the `sample_n()` function several times above and we can use that to randomly sample `n` observations from data. This we can utilise to work with a smaller data set in the development face and once we are ready, we can increase this `n` gradually to see if everything continues to work as anticipated.
- **T11: Using the `peptide_data`, run a few `sample_n()` calls with varying degree of `n` to make sure, that you get a feeling for what is going on**
- **T12: From the `peptide_data` data above, with peptide_1, peptide_2, etc. create this data set using one of the data pivoting functions. Remember to start initially with sampling a smaller data set and then work on that first! Also, once you're sure you're good to go, reuse the `peptide_data` variable as we don't want huge redundant data sets floating around in our environment**
```{r}
#| echo: false
#| eval: true
peptide_data <- peptide_data |>
pivot_longer(cols = contains("peptide_"),
names_to = "peptide_n",
values_to = "peptide")
```
<details><p><summary>Click here for hint</summary></p>
If the pivoting is not clear at all, then do what I do, create some example data:
```{r}
#| echo: true
#| eval: false
my_data <- tibble(
id = str_c("id_", 1:10),
var_1 = round(rnorm(10),1),
var_2 = round(rnorm(10),1),
var_3 = round(rnorm(10),1))
```
...and then play around with that. A small set like the one above is easy to handle, so perhaps start with that and then pivot back and forth a few times using `pivot_wider()`/`pivot_longer()`. Use `View()` to inspect and get a better overview of the results of pivoting.
</details>
```{r}
#| echo: true
#| eval: true
peptide_data |>
sample_n(10)
```
- **Q7: You will see some `NA`s in the `peptide` variable, discuss in your group from where these arise?**
- **Q8: How many rows and columns now and how does this compare with Q3? Discuss why/why not it is different?**
- **T13: Now, lose the redundant variables `n_peptides` and `peptide_n`, get rid of the `NA`s in the `peptide` column, and make sure that we only have unique observations (i.e. there are no repeated rows/observations).**
```{r}
#| echo: false
#| eval: true
peptide_data <- peptide_data |>
select(-n_peptides, -peptide_n) |>
drop_na(peptide) |>
distinct()
```
```{r}
#| echo: true
#| eval: true
peptide_data |>
sample_n(10)
```
- **Q8: Now how many rows and columns and is this data tidy? Discuss in your group why/why not?**
Again, we turn to the `stringr` package, as we need to make sure that the sequence data does indeed only contain valid characters. There are a total of 20 proteogenic amino acids, which we symbolise using `ARNDCQEGHILKMFPSTWYV`.
- **T14: Use the `str_detect()` function to `filter` the `CDR3b` and `peptide` variables using a `pattern` of `[^ARNDCQEGHILKMFPSTWYV]` and then play with the `negate` parameter so see what happens**
```{r}
#| echo: false
#| eval: true
peptide_data <- peptide_data |>
filter(str_detect(CDR3b, pattern = "[^ARNDCQEGHILKMFPSTWYV]", negate = TRUE),
str_detect(peptide, pattern = "[^ARNDCQEGHILKMFPSTWYV]", negate = TRUE))
```
<details><p><summary>Click here for hint</summary></p>
*Again, try to play a bit around with the function in the console, type e.g. `str_detect(string = "ARND", pattern = "A")` and `str_detect(string = "ARND", pattern = "C")` and then recall, that the `filter()` function requires a logical vector, i.e. a vector of `TRUE` and `FALSE` to filter the rows*
</details>
- **T15: Add two new variables to the data, `k_CDR3b` and `k_peptide` each signifying the length of the respective sequences**
```{r}
#| echo: false
#| eval: true
peptide_data <- peptide_data |>
mutate(k_CDR3b = str_length(CDR3b),
k_peptide = str_length(peptide))
```
<details><p><summary>Click here for hint</summary></p>
*Again, we're working with strings, so perhaps there is a package of interest and perhaps in that package, there is a function, which can get the length of a string?*
</details>
```{r}
#| echo: true
#| eval: true
peptide_data |>
sample_n(10)
```
- **T16: Re-create this plot**
```{r}
#| echo: false
#| eval: true
peptide_data |>
ggplot(aes(x = k_CDR3b)) +
geom_histogram(binwidth = 1,
colour = "black",
alpha = 0.5) +
geom_hline(yintercept = 0) +
theme_minimal()
```
- **Q9: What is the most predominant length of the CDR3b-sequences?**
- **T17: Re-create this plot**
```{r}
#| echo: false
#| eval: true
peptide_data |>
ggplot(aes(x = k_peptide)) +
geom_histogram(binwidth = 1,
colour = "black",
alpha = 0.5) +
geom_hline(yintercept = 0) +
theme_minimal()
```
- **Q10: What is the most predominant length of the peptide-sequences?**
- **Q11: Discuss in your group, if this data set is tidy or not?**
```{r}
#| echo: true
#| eval: true
peptide_data |>
sample_n(10)
```
#### Creating one data set from two data sets
Before we move onto using the family of `*_join()` functions you prepared for today, we will just take a quick peek at the meta data again:
```{r}
#| echo: true
#| eval: true
meta_data |>
sample_n(10)
```
Remember you can scroll in the data.
- **Q12: Discuss in your group, if this data with respect to the `A1`, `A2`, `B1`, `B2`, `C1` and `C2` variables is a wide or a long data format?**
As with the `peptide_data`, we will now have to use data pivoting again. I.e.:
- **T18: use either `pivot_wider()` or `pivot_longer()` to create the following data:**
```{r}
#| echo: false
#| eval: true
meta_data <- meta_data |>
pivot_longer(cols = c("A1", "A2", "B1", "B2", "C1", "C2"),
names_to = "Gene",
values_to = "Allele")
```
```{r}
#| echo: true
#| eval: true
meta_data |>
sample_n(10)
```
Remember, what we are aiming for here, is to create one data set from two. So:
- **Q13: Discuss in your group, which variable(s?) define the same observations between the `peptide_data` and the `meta_data`?**
Once you have agreed upon `Experiment`, then use that knowledge to subset the `meta_data` to the variables-of-interest:
```{r}
#| echo: false
#| eval: true
meta_data <- meta_data |>
select(Experiment, Allele)
```
```{r}
#| echo: true
#| eval: true
meta_data |>
sample_n(10)
```
Use the `View()` function again, to look at the `meta_data`. Notice something? Some alleles are e.g. `A*11:01`, whereas others are `B*51:01:02`. You can find information on why, by visiting [Nomenclature for Factors of the HLA System](http://hla.alleles.org/nomenclature/naming.html).
Long story short, we only want to include `Field 1` (allele group) and `Field 2` (Specific HLA protein). You have prepared the `stringr` package for today. See if you can find a way to reduce e.g. `B*51:01:02` to `B*51:01` and then create a new variable `Allele_F_1_2` accordingly, while also removing the `...x` (where `x` is a number) subscripts from the `Gene` variable (It is an artifact from having the data in a wide format, where you cannot have two variables with the same name) and also, remove any `NA`s and `""`s, denoting empty entries.
<details><p><summary>Click here for hint</summary></p>
*There are several ways this can be achieved, the easiest being to consider if perhaps a part of the string based on indices could be of interest. This term "a part of a string" is called a substring, perhaps the `stringr` package contains a function work with substring? In the console, type `stringr::` and hit `tab`. This will display the functions available in the `stringr` package. Scroll down and find the functionst starting with `str_` and look for on, which might be relevant and remember you can use `?function_name` to get more information on how a given function works.*
</details>
```{r}
#| eval: true
#| echo: false
meta_data <- meta_data |>
mutate(Allele_F_1_2 = str_extract(Allele,
pattern = "[ABC]\\*\\d+\\:\\d+")) |>
filter(Allele_F_1_2 != "") |>
drop_na()
```
- **T19: Create the following data, according to specifications above:**
```{r}
#| echo: true
#| eval: true
meta_data |>
sample_n(10)
```
The asterisk, i.e. `*` is a rather annoying character because of ambiguity, so:
- **T20: Clean the data a bit more, by removing the asterisk and redundant variables:**
```{r}
#| echo: false
#| eval: true
meta_data <- meta_data |>
mutate(Allele = str_remove(Allele_F_1_2, "\\*")) |>
select(-Allele_F_1_2) |>
distinct()
```
```{r}
#| echo: true
#| eval: true
meta_data |>
sample_n(size = 10)
```
<details><p><summary>Click here for hint 1</summary></p>
*Again, the `stringr` package may come in handy. Perhaps there is a function `remove`, one or more such pesky characters?*
</details>
<details><p><summary>Click here for hint 2</summary></p>
*Getting a weird error? Recall, that character ambiguity needs to be "escaped", you did this somehow earlier on...*
</details>
Recall the `peptide_data`?
```{r}
#| echo: true
#| eval: true
peptide_data |>
sample_n(10)
```
- **T21: Create a `dplyr` pipeline, starting with the `peptide_data`, which joins it with the `meta_data` and remember to make sure that you get only unqiue observations of rows. Save this data into a new variable names `peptide_meta_data` (If you get a warning, discuss in your group what it means?)**
```{r}
#| echo: false
#| eval: true
peptide_meta_data <- peptide_data |>
full_join(meta_data,
by = "Experiment",
relationship = "many-to-many") |>
distinct()
```
<details><p><summary>Click here for hint 1</summary></p>
*Which family of functions do we use to join data? Also, perhaps here it would be prudent to start with working on a smaller data set, recall we could sample a number of rows yielding a smaller development data set*
</details>
<details><p><summary>Click here for hint 2</summary></p>
*You should get a data set of around +3.000.000, **take a moment to consider how that would have been to work with in Excel?** Also, in case the servers are not liking this, you can consider subsetting the `peptide_data` prior to joining to e.g. 100,000 or 10,000 rows.*
</details>
```{r}
#| echo: true
#| eval: true
peptide_meta_data |>
sample_n(10)
```
### Analysis
Now, that we have the data in a prepared and ready-to-analyse format, let us return to the two burning questions we had:
1. What characterises the peptides binding to the HLAs?
1. What characterises T-cell Receptors binding to the pMHC-complexes?
#### Peptides binding to HLA
As we have touched upon multiple times, `R` is *very* flexible and naturally you can also create sequence logos. Finally, let us create a binding motif using the package `ggseqlogo` ([More info here](https://omarwagih.github.io/ggseqlogo/)).
- **T22: Subset the final `peptide_meta_data` data to `A02:01` and unique observations of peptides of length 9 and re-create the below sequence logo**
<details><p><summary>Click here for hint</summary></p>
*You can pipe a vector of peptides into `ggseqlogo`, but perhaps you first need to `pull` that vector from the relevant variable in your tibble? Also, consider before that, that you'll need to make sure, you are only looking at peptides of length 9*
</details>
```{r}
#| echo: false
#| eval: true
#| fig-align: center
peptide_meta_data |>
filter(Allele == "A02:01",
str_length(peptide) == 9) |>
select(peptide) |>
distinct() |>
pull(peptide) |>
ggseqlogo()
```
- **T23: Repeat for e.g. `B07:02` or another of your favourite alleles**
Now, let's take a closer look at the sequence logo:
- **Q14: Which positions in the peptide determines binding to HLA?**
<details><p><summary>Click here for hint</summary></p>
*Recall your Introduction to Bioinformatics course? And/or perhaps ask your fellow group members if they know?*
</details>
#### CDR3b-sequences binding to pMHC
- **T24: Subset the `peptide_meta_data`, such that the length of the CDR3b is 15, the allele is A02:01 and the peptide is LLFLVLIML and re-create the below sequence logo of the CDR3b sequences:**
```{r}
#| echo: false
#| eval: true
#| fig-align: center
peptide_meta_data |>
filter(k_CDR3b == 15,
Allele == "A02:01",
peptide == "LLFLVLIML") |>
pull(CDR3b) |>
ggseqlogo()
```
- **Q15: In your group, discuss what you see?**
- **T25: Play around with other combinations of `k_CDR3b`, `Allele`, and `peptide` and inspect how the logo changes**
*Disclaimer: In this data set, we only get: A given CDR3b was found to recognise a given peptide in a given subject and that subject had a given haplotype - Something's missing... Perhaps if you have had immunology, then you can spot it? There is a trick to get around this missing information, but that's beyond scope of what we're working with here.*
## Epilogue
That's it for today - I know this is overwhelming now, but commit to it and you WILL be plenty rewarded! I hope today was at least a glimpse into the flexibility and capabilities of using `tidyverse` for applied Bio Data Science
...also, noticed something? We spend maybe 80% of the time here on dealing with data-wrangling and then once we're good to go, the analysis wasn't that time consuming - That's often the way it ends up going. You'll spend a lot of time on data handling, and getting the tidyverse toolbox in your tool belt will allow you to be so much more efficient in your data wrangling, so you can get to the fun part as quickly as possible!
## [Today's Assignment]{style="color: red;"} {#sec-assignment}
After today, we are halfway through the labs of the course, so now is a good time to spend some time recalling what we have been over and practising writing a reproducible Quarto-report.
Your group assignment today is to condense the exercises into a group micro-report! Talk together and figure out how to distil the exercises from today into one small end-to-end runnable reproducible micro-report. DO NOT include ALL of the exercises, but rather include as few steps as possible to arrive at your results. Be very concise!
But WHY? WHY are you not specifying exactly what we need to hand in? *Because we are training taking independent decisions, which is crucial in applied bio data science, so take a look at the combined group code, select relevant sections and condense - If you don't make it all the way through the exercises, then condense and present what you were able to arrive at! What do you think is central/important/indispensable? Also, these hand ins are NOT for us to evaluate you, but for you to train creating products and the get feedback on your progress!*
**IMPORTANT:** Remember to check the [ASSIGNMENT GUIDELINES](https://r4bds.github.io/assignments.html)
...and as always - Have fun!