-
Notifications
You must be signed in to change notification settings - Fork 5
/
how_to_organise_a_project.qmd
350 lines (219 loc) · 13.6 KB
/
how_to_organise_a_project.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
---
title: "How to Organise a Project"
subtitle: "The most important talk you never heard"
author: "Leon Eyrich Jessen"
format:
revealjs:
embed-resources: true
theme: moon
slide-number: c/t
width: 1600
height: 900
mainfont: avenir
logo: images/r4bds_logo_small.png
footer: "R for Bio Data Science"
---
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## The most important talk you never heard!
- Think about it...
- How many courses have you attended?
- How many classes have you taken?
- How many talks have you been to?
- Etc.
- Has anyone ever talked to you about the underlying "machinery"?
- Has anyone ever presented you to or with a project organisation plan?
- How were the results you are being presented to produced?
- I know there are supposed to be a materials and methods section in papers – Have you ever been tasked with deciphering and repeating such a section?
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## The Corner Stone of Research
- In essence - What is that we do?
- We produce knowledge!
- We disseminate knowledge!
- But…
- You cannot simply say I found ‘Z’
- You HAVE to be able to account for how you got from ‘A’ to ‘Z’
- Reproducible Research!
- Being able to (easily) reproduce every single result from a paper
- Why?
- Basically, we need to be able to see if you are cheating
- Others need to stand on your shoulders
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## So, how are we doing?
![](images/how_to_organise_01.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## So, how are we doing?
![](images/how_to_organise_02.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## So, how are we doing?
- "More than 70% of researchers have tried and failed to reproduce another scientist's experiments"
- "More than half have failed to reproduce their own experiments"
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## So, what can we do?
- Granted, some of the reasons for the reproducibility crisis is beyond our control
- Biology is notoriously messy
- False positives
- Etc.
- However, a step in the right direction is to think about organising and documenting your research
- I have seen many times people revisiting old projects only to find that they cannot figure the project out or even reproduce it or understanding the project is so time consuming, that repeating it is more time efficient
- Why does this happen?
- Admitted, we’re all storm chasers – Always on the hunt for the next publication
- Many see documentation as a waste of valuable time
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Familiar?
![](images/how_to_organise_03.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Familiar? Waste. Of. Time! We can do better!
![](images/how_to_organise_03.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Who has taught you to organise a project?
- _"In practice, the principles behind organizing and documenting … are often learned on the fly"_
![](images/how_to_organise_04.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Let’s dive in...
- The following is inspired by this paper
![](images/how_to_organise_04.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project Directory Structure
![](images/how_to_organise_05.png){fig-align="center" width=90%}
- Raw Data should always be pulled from central source, never from an excel sheet someone sent to you
- You are not allowed to touch or alter the original raw data
- Make sure that every step from the raw data, to the data you use for analysis can be repeated
- Save the cleaned data and proceed from that
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project Directory Structure - With Collaborators
![](images/how_to_organise_06.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project Directory Structure - Central source data flow
![](images/how_to_organise_07.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project Directory Structure - Happy high five panda applauds you!
![](images/how_to_organise_08.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project Directory Structure - However...
![](images/how_to_organise_06.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project Directory Structure - However, you may not know the flow!?
![](images/how_to_organise_09.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project Directory Structure - Sad and tired panda is disappointed...
![](images/how_to_organise_10.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project Directory Structure - doc
![](images/how_to_organise_11.png){fig-align="center" width=90%}
- This is where your manuscript lives
- Notes, presentations, pdfs and alike pertaining to the project
- Basically anything "doc"-like...
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project Directory Structure - R (scripts)
![](images/how_to_organise_12.png){fig-align="center" width=90%}
- This is where your analysis scripts are places
- All scripts must be able to run from start-to-end with no manual intervention (We'll get back to that)
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project Directory Structure - results
![](images/how_to_organise_13.png){fig-align="center" width=90%}
- Anything considered a results
- Plots
- Text file with p-value tables
- Etc.
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project Directory Structure - tmp
![](images/how_to_organise_14.png){fig-align="center" width=90%}
- Anything you can delete without thinking about it
- Tests
- Stuff you want to check
- Temporary exploratory files
- Etc...
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Building your R (scripts) directory
- Load-clean-func-do philosophy
- First scripts takes your raw data from raw to analysis-ready
- Raw data is loaded and cleaned
- Clean data and versions hereof are saved for subsequent use
- Project specific functions are put in a separate file
- A single do file is defined capable of running the ENTIRE project and produce ALL results
- Collect the results in a markdown file
- Use GitHub for (code) collaboration/sharing, version control and backup
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## The essence of data science
![](images/data_science_cycle.png){fig-align="center" width=50%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## The essence of data science
![](images/data_science_cycle.png){fig-align="center" width=50%}
- Repeat the inner cycle until understanding is achieved
- Value generation through understanding!
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project finalisation
- Ideally,
- Once analysis has converged, a technical report should be created using markdown
- Once the paper is published the project directory should be frozen as read-only (and be on Github)
- The directory should contain everything needed to recreate all the exact figures and tables in the paper
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## In conclusion
- This is not the absolute truth
- This is my (current) take on a how-to data science
- Structure takes time in order to save time
- Am I _always_ adhering 110% to this? No, but…
- I strongly believe that striving for structure is better than abandoning it
- A picture speaks a thousand words - Let’s try to visualise it!
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Project Organisation Overview
![](images/viz_bio_data_science_project_organisation_qmd.png){fig-align="center" width=90%}
<!--# ---------------------------------------------------------------------- -->
<!--# SLIDE ---------------------------------------------------------------- -->
<!--# ---------------------------------------------------------------------- -->
## Remember, you are ALWAYS doing collaborative data science!
_Think about readability of your code. Every project you work on is fundamentally collaborative. Even if you are not working with any other person, you are always working with future you and you really do not want to be in a situation where future you has no idea what past you was thinking, because past you will not respond to any emails! [Hadley Wickham]_