-
Notifications
You must be signed in to change notification settings - Fork 1
/
03_project-organization.qmd
303 lines (197 loc) · 13.1 KB
/
03_project-organization.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
---
title: "Project Organization"
abstract: "This section covers the basics of organizing and documenting a reproducible analysis."
format:
html:
toc: true
code-line-numbers: true
---
![Architectural drawing, Canada, 1936](images/Joy_Oil_gas_station_blueprints.jpg)
```{r hidden-here-load}
#| include: false
exercise_number <- 1
```
```{r}
#| echo: false
#| warning: false
library(tidyverse)
library(gt)
source("src/motivation.R")
```
```{r}
#| label: tbl-roadmap
#| tbl-cap: "Opinionated Analysis Development"
#| echo: false
motivation |>
filter(!is.na(Section), Section == "Project Organization") |>
select(-`Analysis Feature`) |>
arrange(Section) |>
gt() |>
tab_header(
title = "Opinionated Analysis Development"
) |>
tab_footnote(
footnote = "Added by Aaron R. Williams",
locations = cells_column_labels(columns = c(Tool, Section))
) |>
tab_source_note(
source_note = md("**Source:** Parker, Hilary. n.d. “Opinionated Analysis Development.” https://doi.org/10.7287/peerj.preprints.3210v1.")
)
```
To achieve reproducibility, we need to capture every step of an analysis from loading and cleaning the data to creating figures and visualizations. Recording and documenting every step ensures that others (and future you!) has a fighting chance and re-creating the analysis. We'll start with three pillars of project organization:
1. Project-Oriented Workflow
2. Code First
3. Documentation
## 1. Project-Oriented Workflow {.unnumbered}
My name is Aaron Williams. I don't have a "tbayes" folder on my computer and I'm guessing most people don't. Yet, it's common to run into the following situation:
```{r}
#| eval: false
read_csv("tbayes/projects/example-analysis/data/data.csv")
```
```
Error: 'tbayes/projects/example-analysis/data/data.csv' does not exist in current working directory ('/Users/aaronwilliams').
```
One approach to avoid this issue is to use `setwd()`. Unfortunately, `setwd()` will require anyone attempting to reproduce an analysis to alter the analysis. Instead, we will use a [project-oriented workflow](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/).
### R Projects
R Projects, proper noun, are the best way to organize an analysis. They have several advantages:
* They make it possible to concurrently run multiple RStudio sessions.
* They allow for project-specific RStudio settings.
* They integrate well with Git version control.
* They work well with `library(renv)` for virtual environments.
* They are the “node” of relative file paths. (more on this in a second) This makes code highly portable.
:::callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
Before setting up an R Project, go to Tools > Global Options and uncheck “Restore most recently opened project at startup”.
:::
Every new analysis in R should start with an R Project. First, create a directory that holds all data, scripts, and files for the analysis. You can do this right in RStudio by clicking the "New Folder" button at the top of the "Files" tab located in the top or bottom right of RStudio. Storing files and data in a sub-directories is encouraged. For example, data can be stored in a folder called `data/`.
Next, click “New Project…” in the top right corner.
![](images/new-project.png){width=400}
When prompted, turn your recently created “Existing Directory” into a project.
![](images/existing-directory.png){width=400}
Upon completion, the name of the R Project should now be displayed in the top right corner of RStudio where it previously displayed “Project: (None)”. Once opened, .RProj files do not need to be saved. Double-clicking .Rproj files in the directory is now the best way to open RStudio. This will allow for the concurrent use of multiple R sessions and ensure the portability of file paths. Once an RStudio project is open, scripts can be opened by double-clicking individual files in the computer directory or clicking files in the “Files” tab.
:::callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
1. Create a folder called `example-analysis/` on your computer.
2. Following the previous steps, create an RStudio project for the folder.
:::
### Filepaths
Windows filepaths are usually delimited with `\`. *nix file paths are usually delimited with `/`. Never use `\` in file paths in R. `\` is an escape character in R and will complicate an analysis. Fortunately, RStudio understands `/` in file paths regardless of operating system.
Avoid spaces when naming folders and files. Spaces will complicate using Git.
Never use `setwd()` in R. It is unnecessary, it makes code unreproducible across machines, and it is rude to collaborators. R Projects create a better framework for file paths. Simply treat the directory where the R Project lives as the working directory and directories inside of that directory as sub-directories.
For example, say there’s a `.Rproj` called `starwars-analysis.Rproj` in a directory called `starwars-analysis/`. If there is a .csv in that folder called `jedi.csv`, the file can be loaded with `read_csv("jedi.csv")` instead of `read_csv("H:/alena/analyses/starwars-analysis/jedi.csv")`. If that file is in a sub-directory of starwars-analysis called data, it can be loaded with `read_csv("data/jedi.csv")`. The same concepts hold for writing data and graphics.
This simplifies code and makes it portable because all relative file paths will be identical on all computers. To share an analysis, simply send the entire directory to a collaborator or share it with GitHub.
Here’s an example directory:
![](images/directory.png){width=200}
Use leading numbers to strategically order files and `-` and `_` to give structure to names.
```{r}
#| code-fold: true
list.files(pattern = "^[01]")
```
## 2. Code First {.unnumbered}
![](images/recipe.png)
Source: [Qr189](https://en.wikipedia.org/wiki/Recipe#/media/File:Pistachio_cake.png)
The Great British Bake Off (GBBO) is a popular show about a baking competition in Great Britain. Each week contestants compete in a three-round baking competition. The second round is always the technical challenge, where bakers try to finish a challenging bake with *intentionally vague* instructions. *Data analysis should never feel like a technical challenge from the GBBO.*
::: {.callout-note}
Our analyses should be captured from start to finish by code stored in a script or scripts. If we think of a script as a recipe, we want to capture every step from buying the ingredients for a bake all the way to how we put the bake on the table. Ideally, we want to capture *every* step with code.
:::
This desire should influence our tool selection. Point-and-click tools like Excel, Tableau, and Power BI are widely used but fall short of this simple standard. Instead, we should use programming languages with robust functionality.
How we use these programming languages matters! R and Python are powerful because they support interactive data analysis that works well with the development and testing of hypotheses. The uncleaned results of interactive data analysis are rarely reproducible. Rather, we need to clean up our scripts so they run from start to finish without intervention (executable) and achieve all of the objectives of our final analysis.
Proprietary tools like SAS and Stata can be used for reproducible workflows! I have a personal preference for R, Python, and Julia because they are open source and promote accessibility because they don't cost any money for users. R and Python used to be much tougher to use. Now they are supported by Integrated Development Environments (IDE) like RStudio, [Positron](https://github.com/posit-dev/positron), and VS Code.
Adopting a code-first approach to data analysis creates many benefits!
1. Code maximizes the chance of catching mistakes when they inevitably happen.
2. Code is the clearest way to document and share an analysis.
3. Reproducible code creates a [single source of truth](https://posit.co/blog/the-advantages-of-code-first-data-science/). Any result can be mapped back to the code and data that created the result.
4. Code allows for robust version control.
5. Code can scale analyses to bigger data and bigger projects.
We'll cover more programming concepts later. For now, let's focus on three.
1. Load external dependencies at the top of an analysis.[^install.packages]
2. When possible, gather your data using code. Many data sets can be accessed through application programming interface (API). For example, if you run a data collection using Qualtrics use the Qualtrics API instead of pointing-and-clicking to download the data from the website. With R, it's possible to download data with `download.file()`. I don't recommend downloading the data every time you run your analysis. Instead, download the data only if the data aren't present.
```{r}
#| eval: false
if (!file.exists("data/data.csv")) {
download.file(url = "web-url.csv", destfile = "data/data.csv")
}
```
3. Finally, capture outputs with code. It's so tempting to jump into a tool like Excel but this undermines our objectives. `library(ggplot2)` is powerful for data visualizations and `library(gt)` is powerful for tables.
[^install.packages]: It is tempting to include `install.packages()` when using R. It can be rude to install software on someone else's computer. Instead, we'll use `library(renv)`.
During development, it's easy to run lines of code out of order or to skip capturing an important step like loading a package or dropping that pesky outlier. Restarting your computing environment early and often can promote good practices and ensure accountability. This is easy in RStudio using Session > Restart R.
:::callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
1. Add a `data/` folder to `example-analysis/`.
2. Add a .R script called `download-data.R`.
3. Download data with the following code:
```{r}
#| eval: false
file <- "data/virginia-injury-deaths.xlsx"
if (!file.exists(file)) {
download.file(
url = "https://www.countyhealthrankings.org/sites/default/files/media/document/2023%20County%20Health%20Rankings%20Virginia%20Data%20-%20v2.xlsx",
destfile = file
)
}
```
:::
## 3. Documentation {.unnumbered}
Humans are frequently the most expensive part of any data analysis. Documentation is essential to helping humans quickly and clearly understand the structure and steps of an analysis.
Next, we'll discuss commenting and READMEs.
### Comments
Comments are important for adding in-the-weeds context for code. R will interpret all text in a `.R` script as R code unless the code follows `#`, the comment symbol. Comments are essential to writing clear code in any programming language.
```{r eval = FALSE}
# Demonstrate the value of a comment and make a simple calculation
2 + 2
```
It should be obvious *what* a line of clear R code accomplishes. It isn't always obvious *why* a clear line of R code is included in a script. Comments should focus on the *why* of R code and not the *what* of R code.
The following comment isn't useful because it just restates the R code, which is clear:
```{r eval = FALSE}
# divide every value by 1.11
cost / 1.11
```
The following comment is useful because it adds context to the R code:
```{r eval = FALSE}
# convert costs from dollars to Euros using the 2020-01-13 exchange rate
cost / 1.11
```
The following is useful because it avoids [magic numbers](https://www.inf.unibz.it/~calvanese/teaching/05-06-ip/lecture-notes/uni04/node17.html).
```{r eval = FALSE}
# dollars to Euros 2020-01-13 exchange rate
exchange_rate <- 1.11
# convert costs from dollars to Euros
cost / exchange_rate
```
### README
README's are useful for providing big-picture context and instructions for a project. We will always include READMEs as a README.md file in the top level of our project directory. File ending in `.md` are [Markdown files](https://www.markdownguide.org/getting-started/).
READMEs should contain big-picture documentation for a project. Sections can include:
- Brief project description
- Instructions for contributions
- [Codes of conduct](https://github.com/UI-Research/mobility-from-poverty)
- Instructions for reproduction
- [Licenses](https://choosealicense.com/licenses/)
- Examples
Here are a few examples:
- [Boosting Upward Mobility from Poverty](https://github.com/UI-Research/mobility-from-poverty)
- [Code and results for 'The Association between Income and Life Expectancy in the United States, 2001-2014'](https://github.com/michaelstepner/healthinequality-code/tree/main)
- [Style Guide for 'The Association between Income and Life Expectancy in the United States, 2001-2014'](https://github.com/michaelstepner/healthinequality-code/tree/main/code)
:::callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
1. Add a README.md to the top level of your project directory.
2. Add a name for your project prefaced by `#` and add your name.
:::
::: {.callout-tip}
We should now have a project directory with a .Rproj and a README.
:::