-
Notifications
You must be signed in to change notification settings - Fork 0
/
research-02-complete.Rmd
216 lines (152 loc) · 9.01 KB
/
research-02-complete.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
title: "Reproducible Research in R, Lesson 2"
author: "Christy Garcia, Ph.D. and Christopher Prener, Ph.D."
date: '(`r format(Sys.time(), "%B %d, %Y")`)'
output:
github_document: default
html_notebook: default
---
## Introduction
This is the notebook for the second session of the Reproducible Research in `R` DSS seminar. This notebook covers skills related to managing file paths and importing data from various sources.
## Dependencies
This notebook requires three packages from the `tidyverse` as well as an additional package:
```{r load-packages}
# tidyverse packages
library(dplyr) # for the starwars data
library(ggplot2) # for the mpg data
library(haven) # read statistical data files
library(readr) # read flat files (csv, tsv)
library(readxl) # read excel files
# manage file paths
library(here) # manage file paths
```
## Quick Review
The following was covered during the [first lesson](http://github.com/slu-dss/research-01).
### Importing Objects
Today, we're going to work with the `mpg` data from the `ggplot2` package. Remember that we can load data by assigning it to an object. Once `ggplot2` has been opened with the `library()` function as above, we can assign it to an object in our environment:
```{r load-mpg-data}
auto <- mpg
```
Now try loading the `starwars` data from the `dplyr` package and assigning it to a new object in your global environment. Name your new object `theForce` (using camelCase):
```{r load-starwars-data}
theForce <- starwars
```
## Exploring Data
Remember that you can click on an object in the `Enviornment` tab to view it. You can also use the `str()` function to explore its contents:
```{r str-auto}
str(auto)
```
Now try using the `str()` function to explore the data stored in `theForce`:
```{r str-theForce}
str(theForce)
```
## Loading Data from an R data file
Sometimes data for `R` is distributed as a `.rda` or "R data file". We can load it using the `load()` function:
```{r load-mpg-rda}
load(file = here("data", "raw", "mpg.rda"))
```
Now you try to load `data/raw/starwars.rda`:
```{r load-starwars-rda}
load(file = here("data", "raw", "starwars.rda"))
```
## Loading Data from a csv file
We often see data distributed in plain-text formats (yay!) like "comma delimited text files", or `.csv` files. These look like spreadsheets and can be opened using Microsoft Excel and similar tools. They are lightweight, portable, and plain text, which makes them ideal for working with.
We can use the `readr::read_csv()` function to open them. `readr` is a `tidyverse` package for opening "flat" files like comma and tab separated formats. We can open the most common of these, `.csv` files, with `read_csv()`:
```{r load-mpg-csv}
auto_csv <- read_csv(file = here("data", "raw", "mpg.csv"))
```
Notice how `read_csv` parses the file and makes its best guess about the type of data in each variable. This can be optionally overridden using a technique that we'll cover below with Excel file.
Now you try to load `data/raw/starwars.csv`:
```{r load-starwars-csv}
theForce_csv <- read_csv(file = here("data", "raw", "starwars.csv"))
```
## Loading Data from Statistical Software
The `haven` package from the `tidyverse` comes with tools to load SAS, SPSS, and *older* Stata files (use `saveold`). All of the functions work in relatively identical fashion
* `read_sas()` for SAS `.sas7bdat` files
* `read_sav()` for SPSS `.sav` files
* `read_dta()` for Stata `.dta` files
```{r load-mpg-stats}
auto_sas <- read_sas(data_file = here("data", "raw", "mpg.sas7bdat"))
auto_sav <- read_sav(file = here("data", "raw", "mpg.sav"))
auto_dta <- read_dta(file = here("data", "raw", "mpg.dta"))
```
When metadata for specific variables is present, they will be attached to variables as "attributes".
Now you try to load:
* `data/raw/starwars.sas7bdat`
* `data/raw/starwars.sav`
* `data/raw/starwars.dta`
```{r load-starwars-stats}
theForce_sas <- read_sas(data_file = here("data", "raw", "starwars.sas7bdat"))
theForce_sav <- read_sav(file = here("data", "raw", "starwars.sav"))
theForce_dta <- read_dta(file = here("data", "raw", "starwars.dta"))
```
## Loading Excel Spreadsheets
There are a number of tools for reading Excel spreadsheet files. We'll use the `readxl` package from the `tidyverse`. At its most basic,
```{r load-mpg-xls}
auto_xls <- read_excel(path = here("data", "raw", "mpg.xls"))
```
For multi-sheet Excel files, we can optionally identify the sheet we wish to import. For example, we can import the `mpg` sheet:
```{r load-mpg-xlsx}
auto_xlsx <- read_excel(path = here("data", "raw", "mpg.xlsx"), sheet = "mpg")
```
Now you try importing the `mfr` sheet from `data/raw/mpg.xlsx`:
```{r load-mfr-xlsx}
mfr_xlsx <- read_excel(path = here("data", "raw", "mpg.xlsx"), sheet = "mfr")
```
There are also two sheets - `starwars` and `films` - in `data/raw/starwars.xlsx` that you can practice on as well as an older `.xls` file that has one sheet named `starwars` in `data/raw/starwars.xls`.
```{r load-starwars-excel}
theForce_xls <- read_excel(path = here("data", "raw", "starwars.xls"))
theForce_xlsx <- read_excel(path = here("data", "raw", "starwars.xlsx"), sheet = "starwars")
films_xlsx <- read_excel(path = here("data", "raw", "starwars.xlsx"), sheet = "films")
```
### Problems With Excel
#### Metadata at Top of Sheet
Often we see data released in Excel format with metadata in the first few rows - this is the case with `Sheet1` in both `data/raw/mpg_messy.xlsx` and `data/raw/starwars_messy.xlsx`. We do not want to add this in because it violates "tidy" data principles, and it also prevents `R` from parsing the spreadsheet correctly.
We can skip that metadata second row with `skip = 9`, which will start reading data in at row 10:
```{r fix-meta-mpg}
auto_meta <- read_excel(path = here("data", "raw", "mpg_messy.xlsx"), sheet = "Sheet1", skip = 9)
```
Now you try to fix the same issue with `Sheet` from `starwars_messy.xlsx`. Be sure to identify the appropriate number of rows to skip!
```{r fix-meta-starwars}
theForce_meta <- read_excel(path = here("data", "raw", "starwars_messy.xlsx"), sheet = "Sheet1", skip = 7)
```
#### A Problematic Second Row
Sometimes there is extra data appended to the second row - this is the case with `Sheet2` in both `data/raw/mpg_messy.xlsx` and `data/raw/starwars_messy.xlsx`.
We can skip that second row with `skip = 2`, which will start reading data in at row 3:
```{r fix-second-row-auto}
auto_row2 <- read_excel(path = here("data", "raw", "mpg_messy.xlsx"), sheet = "Sheet2", skip = 2)
str(auto_row2)
```
The values in row 3 get applied as the variable names, which is not ideal. We can use the `col_names` argument to specify how we want our columns named:
```{r fix-second-row-auto-2}
auto_row2 <- read_excel(path = here("data", "raw", "mpg_messy.xlsx"), sheet = "Sheet2", skip = 2,
col_names = c("mfr", "model", "displ", "year", "cyl", "trans", "drv",
"cty", "hwy", "fl", "class"))
```
Now you try. `Sheet2` of `starwars_messy.xlsx` has the same exact problem. The appropriate variable names should be:
"name", "height", "mass", "hair_color", "skin_color", "eye_color", "birth_year", "gender", "homeworld", "species"
```{r fix-second-row-starwars}
theForce_row2 <- read_excel(path = here("data", "raw", "starwars_messy.xlsx"), sheet = "Sheet2", skip = 2,
col_names = c("name", "height", "mass", "hair_color", "skin_color", "eye_color",
"birth_year", "gender", "homeworld", "species"))
```
#### A Summary Row
Sometimes, Excel users will append summary statistics to the bottom of a data table - this is the case with `Sheet3` in both `data/raw/mpg_messy.xlsx` and `data/raw/starwars_messy.xlsx`. We do not want to add this in because it violates "tidy" data principles. We can see this in
We can skip it by using the `range=` argument and specifying the upper-left corner and lower-right corner of the data within the spreadsheet we wish to include. The first cell (`A1`) and the last (`K235`) should be separated by a colon:
```{r fix-summary-mpg}
auto_sum <- read_excel(path = here("data", "raw", "mpg_messy.xlsx"), sheet = "Sheet3", range = "A1:K235")
```
Now you try - `Sheet3` in `starwars_messy.xlsx` has the same problems. Be sure to check for a different range of cells to import, however!
```{r fix-summary-starwars}
theForce_sum <- read_excel(path = here("data", "raw", "starwars_messy.xlsx"), sheet = "Sheet3", range = "A1:J88")
```
## Exporting Data
Finally, we want to export our data. We should always use plain-text, open file formats if we can. The `.csv` is the standard data type in this area, and can be created and saved to `data/clean/` with the `write_csv()` function from `readr`:
```{r write-auto}
write_csv(auto, here("data", "clean", "auto.csv"))
```
Now you try to write `theForce` to a `.csv` file in `data/clean`:
```{r write-theForce}
write_csv(theForce, here("data", "clean", "theForce.csv"))
```
The `haven` package contains exporting functions as well for SAS, Stata, and SPSS, and there are a number of tools for writing Excel spreadsheets if you must!