-
Notifications
You must be signed in to change notification settings - Fork 5
/
lab01.qmd
268 lines (176 loc) · 14.4 KB
/
lab01.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
# Lab 1: Course Intro & the Very Basics {.unnumbered}
## Package(s)
- [base](https://www.r-project.org)
## Schedule
- 08.00 - 08.15: Arrival, [pre-course anonymous questionaire](https://forms.gle/Hyxa93VdsXU5gKHw6) and [interest based group formation](https://docs.google.com/spreadsheets/d/1P3wbjztVje5s5REtUhldcIr9gTtEMkDLJo6YMBfPOis/edit?usp=sharing)
- 08.15 - 08.45: Lecture: [Course Introduction](https://raw.githack.com/r4bds/r4bds.github.io/main/lecture_lab01.html)
- 08.45 - 09.00: Break
- 09.00 - 11.15: [Exercises](#sec-exercises)
- 11.15 - 11.30: Break
- 11.30 - 12.00: Lecture: [Reproducibility in Modern Bio Data Science](https://raw.githack.com/r4bds/r4bds.github.io/main/how_to_organise_a_project.html)
## Learning Materials {#sec-learning-materials}
*Please prepare the following materials:*
- Read the full course description here: [22100](https://kurser.dtu.dk/course/22100) / [22160](https://kurser.dtu.dk/course/22160)
- Answer the brief anonymous _R for Bio Data Science Pre-course Questionnaire_ (See schedule above)
- Read course site sections: [Welcome to R for Bio Data Science](index.qmd), [Prologue](prologue.qmd) and lastly [Getting Started](getting_started.qmd), where it is important that you **perform any small tasks mentioned**
- Book: [R4DS2e: Welcome](https://r4ds.hadley.nz)
- Book: [R4DS2e: Preface to the second edition](https://r4ds.hadley.nz/preface-2e)
- Book: [R4DS2e: Introduction](https://r4ds.hadley.nz/intro)
- Book: [R4DS2e: Chapter 2 Workflow: basics](https://r4ds.hadley.nz/workflow-basics)
- Book: [R4DS2e: Chapter 28 Quarto](https://r4ds.hadley.nz/quarto.html) (Don't do the exercises)
- Video: [RStudio for the Total Beginner](https://www.youtube.com/watch?v=FIrsOBy5k58)
- Paper: [A Quick Guide to Organizing Computational Biology Projects](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000424)
_Unless explicitly stated, do not do the per-chapter exercises in the R4DS2e book_
## Learning Objectives
*A student who has met the objectives of the session will be able to:*
- Master the very basics of R
- Navigate the RStudio IDE
- Create, edit and run a basic Quarto document
- Explain why reproducible data analysis is important, as well as identify relevant challenges and explain replicability versus reproducibility
- Describe the components of a reproducible data analysis
## Exercises {#sec-exercises}
Today, we will focus on getting you started and up and running with the first elements of the course, namely the RStudio IDE (Integrated Developer Environment) and Quarto. If the relationship between `R` and RStudio is unclear, think of it this way: Consider a car, in that case, `R` would be the engine and RStudio would be the rest of the car. Neither is particularly useful, but together they form a functioning unit. Before you continue, make sure you in fact did watch the "RStudio for the Total Beginner" video (See the [Learning Materials](#sec-learning-materials) for today's session).
### Cloud server and the RStudio IDE
Go to the [R for Bio Data Science Cloud Server](https://teaching.healthtech.dtu.dk/22100/rstudio.php) and follow the login procedure. Upon login, you will see this:
![](images/lab01_01.png){fig-align="center" width="800px"}
First, just a quick change of settings to enable graphics. Click `Tools` $\rightarrow$ `Global Options`. Then under `General`, choose the tab `Graphics` and under `Graphics Device`, choose `Cairo` as your `Backend`. Then click `Apply` and finally `OK`. That was easy right? Now, please continue.
Welcome to the RStudio IDE. It allows you to consolidate all features needed to develop `R` code for analysis. Now, click `Tools` $\rightarrow$ `Global Options...` $\rightarrow$ `Pane Layout` and you will see this:
![](images/lab01_02.png){fig-align="center" width="600px"}
This outlines the four panes you have in your RStudio IDE and allow you rearrange them as you please. Now, re-arrange them, so that they look like this:
![](images/lab01_03.png){fig-align="center" width="600px"}
Click `Apply` $\rightarrow$ `OK` and you should see this:
![](images/lab01_04.png){fig-align="center" width="800px"}
### First steps
#### The Console
Now, in the console, as you saw in the video, you can type commands like:
```{r}
#| eval: false
2+2
1:100
3*3
sample(1:9)
X <- matrix(sample(1:9), nrow = 3, ncol = 3)
X
sum(X)
mean(X)
?sum
sum
nucleotides <- c("a", "c", "g", "t")
nucleotides
sample(nucleotides, size = 100, replace = TRUE)
table(sample(nucleotides, size = 100, replace = TRUE))
paste0(sample(nucleotides, size = 100, replace = TRUE), collapse = "")
replicate(n = 10, expr = paste0(sample(nucleotides, size = 100, replace = TRUE), collapse = ""))
df <- data.frame(id = paste0("seq", 1:10), seq = replicate(n = 10, expr = paste0(sample(nucleotides, size = 100, replace = TRUE), collapse = "")))
df
str(df)
ls()
```
Take some time and play around with these commands and other things you can come up with. Use the `?function` to get help on what that function does. Be sure to discuss what you observe in the console. Do not worry too much about the details for now; we are just getting started. But as you hopefully can see, `R` is very flexible and basically the message is: "If you can think it, you can build it in `R`".
- Go to [R4DS2e Chapter 2 Workflow: basics](https://r4ds.hadley.nz/workflow-basics) in R4DS2e and do the exercises
#### The Terminal
Notice how in the console pane, you also get a `Terminal`, click and enter:
```{bash}
#| eval: false
ls
mkdir tmp
touch tmp/test.txt
ls tmp
rm tmp/test.txt
rmdir tmp
ls
echo $SHELL
```
Basically, here you have access to a full terminal, which can prove immensely useful! Note, you may or may not be familiar with the concept of a terminal. Simply think of it as a way to interact with the computer using text command, rather than clicking on icons etc. Click back to the console.
#### The Source
The source is where you will write scripts. A script is a series of commands to be executed sequentially, i.e. first line 1, then line 2 and so on. Right now, you should have a open script called `Untitled1`. If not, you can create a new script by clicking white paper with a round green plus sign in the upper left corner.
Taking inspiration from the examples above, try to write a series of commands and include a `print()`-statement at the very end. Click `File` $\rightarrow$ `Save` and save the file as e.g. `my_first_script.R`. Now, go to the console and type in the command `source("my_first_script.R")`. Congratulations! You have now written your very first reproducible R program!
### The Whole Shebang
Enough playing around, let us embark on our modern Bio Data Science in R journey.
- In the `Files` pane, click `New Folder` and create a folder called `projects`
- In the upper right corner, click where it says `Project: (None)` and then click `New Project...`
- Click `New Directory` and then `New Project`
- In the `Directory name:`, enter e.g. `r_for_bio_data_science`
- Click the `Browse...` button and select your newly created `projects` directory and then click `Choose`
- Click `Create Project` and wait a bit for it to get created
#### On Working in Projects
Projects allow you to create fully transferable bio data science projects, meaning that the root of the project will be where the `.Rproj` file is located. You can confirm this by entering `getwd()` in the console. This means that under *no circumstances* should you ever *not* work within a project, nor should you *ever* use absolute paths. Every single path you state in your project *must* be relative to the project root.
But why? Imagine you have created a project, where you have indeed used absolute paths. Now you want to share that project with a colleague. Said colleague gets your project and tests the reproducibility by running the project end-to-end. But it *completely* fails because you have hardcoded your paths to be absolute, meaning that all file and project resource locations point to locations on your laptop.
Projects are a *must* and allow you to create reproducible encapsulated bio data science projects. Note, the concept of reproducibility is absolutely central to this course and must be considered in all aspects of the life cycle of a project!
#### Quarto
While `.R`-scripts are a perfectly valid way to write scripts, there is another Skywalker:
- In the upper left corner, again, click the white paper with the round green plus, but this time select `Quarto Document`
- Enter a `Title:`, e.g. "Lab 1 Exercises" and enter your name below in the box `Author:`
- Click `Create`
- Important: Save your Quarto document! Click `File` $\rightarrow$ `Save` and name it e.g. `lab_01_exercises.qmd`
- Minimise the `Environment` pane
You should now see something like this:
![](images/lab01_05.png){fig-align="center" width="800px"}
Try clicking the `Render` button just above the quarto-document. This will create the HTML5 output file. Note! You may get a `Connection Refused` message. If so, don't worry, just close the page to return to the cloud server and find the generated `.html` file, left-click and select `View in Web Browser`.
If you have previously worked with `Rmarkdown`, then many features of `Quarto` will be familiar. Think of `Quarto` as a complete rethinking of `Rmarkdown`, i.e. based on all the experience gained, what would the optimal way of constructing an open-source scientific and technical publishing system?
If you have previously encountered Jupyter notebooks, Quarto is similar. The basic idea is to have one document covering the entire project cycle.
Proceed to [R4DS2e Chapter 28 Quarto](https://r4ds.hadley.nz/quarto) and do the exercises.
### Bio Data Science with a Virtual AI Assistant
I am pretty sure you all know of ChatGPT by now, so let us address the elephant in the room!
#### Getting started
- Go to the [ChatGPT](https://chat.openai.com) site
- Create a user and log in
##### Let us get acquainted
Now, at the bottom it says *"Send a message"*, let us ask 3 simple question and see if we can find out what is what. Type the following questions in the prompt and read the answers:
1. *Explain in simple terms what you are*
2. *Explain in simple terms how you work*
3. *Explain in simple terms how you can be used to generate value as a virtual AI assistant, when doing Bio Data Science for R*
4. *What is chatGPT and abbreviation for?*
##### Moving onto R
In the upper left corner, click `ChatGPT` it to start a new session.
Again, type the following questions in the prompt and read the answers:
1. *R*
2. *What is R?*
3. *Give a few simple examples*
4. *Explain the difference between base R and Tidyverse R*
If you get some code examples, try copying and pasting them into the console in RStudio to see if they run.
##### Prompt Engineering
Start a new chat and enter:
1. *Give a few simple examples*
Compare the response with the one from before, is it the same or different and why so?
Discuss in your group what is "Prompt Engineering" and how does it relate to the above few tests you did? (Bonus info: Prompt engineer is already a job and people are making money off of selling "prompts")
##### ...a bit more
Start a new chat and enter:
1. *Tell me about DNA*
Check if it gives you correct information?
Now, write:
2. *Give a few fun examples on how to get started with the R programming language using DNA*
3. *Copy/paste the code into the Console, do the examples all run?*
4. *Discuss in your group if you understand what is going on*
5. *Earlier you were told to play around with some code snippets. Perhaps you didn't fully catch what was going on? If so, try to type in e.g.:*
```{verbatim}
I'm new to R, please explain in simple terms, what the following code does:
"
nucleotides <- c("a", "c", "g", "t")
replicate(n = 10, expr = paste0(sample(nucleotides, size = 100, replace = TRUE), collapse = ""))
"
```
6. *Now, you could also ask a question like this:*
```{verbatim}
I am a student in the course "R for Bio Data Science" and we have been asked to solve the following problem:
"
Now it is time to take it a step further. A) Create a function, which returns random RNA of length 'l' and B) another function, which performs reverse transcription.
"
Please give me the solution to the problem
```
7. *Try to run the code, you were presumable given. Likely, the tasks have been solved, but did you actually learn something? Do you understand the details of the code?*
##### Strawberry Fields Forever
Start yet a new chat and run:
1. *How many Rs are there in the word strawberry?*
2. *How many Rs are there in the word strawberries?*
3. *What is the difference between strawberry and strawberries?*
4. *Try as above, but using 'raspberry' and 'cranberry'*
5. *Inspect the answers, did ChatGPT get these simple questions right?*
##### Summary
**The following is IMPORTANT, so please read carefully and please discuss in your group**
While `ChatGPT` can be a powerful tool for code productivity, it comes with a major caveat: **When it fails, it fails with confidence** (See `Strawberry Fields Forever` above). This means that it will be equally confident whether it is right or wrong! The optimal yield is when you are working on a problem with a tool, where you already have a good knowledge of the problem and the tool. Here, you can use your experience to evaluate if you are heading in the right direction or if you're being send on a wild goose chase. In this situation, ChatGPT can be a powerful sparring partner to augment your bio data science workflow to enhance productivity.
Know the difference between "asking for a solution" and "asking for an explanation" - Do not cheat yourself here, it is **very important** that the yield of this course is that **you** learn **Tidyverse R** and **not** how to "use ChatGPT to produced code".
Therefore, **do not use ChatGPT to solve the exercises**, rather use ChatGPT as a learning assistant. Here is an example of the difference:
- **Wrong:** _Give me the tidyverse code for adding a new variable `z` to a dataset, which is the sum of `x` and `y`_
- **Right:** _In simple terms, explain which tidyverse function is used to add variables to datasets and give a very basic example of how it is used_
The **wrong** way will give you the solution directly, robbing you of the learning process of internalising the functionality and the translation of the general explanation to the specific application to your problem. The **right** way, will allow you to work with your understanding of the problem and how to translate the general explation into a specific solution to your problem.