-
Notifications
You must be signed in to change notification settings - Fork 0
/
wrangling-02.Rmd
191 lines (131 loc) · 6.09 KB
/
wrangling-02.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
title: "Data Wrangling in R, Lesson 2"
author: "Christy Garcia, Ph.D. and Christopher Prener, Ph.D."
date: '(`r format(Sys.time(), "%B %d, %Y")`)'
output:
github_document: default
html_notebook: default
---
## Introduction
This is the notebook for the second data wrangling lesson. Today, we'll be covering how to create and manipulate variables, using various functions in the `dplyr` package.
## Dependencies
This notebook requires three packages from the `tidyverse` as well as one additional package:
```{r load-packages}
# tidyverse packages
library(dplyr) # data wrangling
library(readr) # read and write csv files
library(stringr) # manipulating string data
# manage file paths
library(here) # manage file paths
```
## Read in Data
Last time we reviewed how to read `.csv` files into `R`. Today we'll use the cleaner versions of the same two files, which are in the `data/` subdirectory of our project:
1. `mpg.csv` - the `mpg` data from `ggplot2`
2. `starwars.csv` - the `starwars` data from `dplyr`
To read `.csv` files into `R`, we'll use the `readr` package's `read_csv()` function.
```{r read-mpg}
mpg <- read_csv(here("data", "mpg.csv"))
```
Now, review the `read_csv()` syntax to import the `starwars.csv` data from the same source:
```{r read-starwars}
```
*Note: We could also call these data frames directly instead of importing them since they are available within R packages, but we wanted you to practice reading in!
## Quickly Exploring Data
Remember to quickly explore data, we can use `utils::str()`:
```{r explore}
str(mpg)
str(starwars)
```
Last time we identified and fixed several issues in these data sets. Today we are going to use the already clean data sets to look at creating new variables, combining variables, and other such mutations.
## Creating new variables
### With `mutate`
One of the basic functions of the `dplyr` package is `mutate`. It is essentially used just to create new variables. When using this function, we have to specify three things:
1. the name of the data frame to be modified
2. the name of the new variable that you want to create
3. the value you'd like to assign to this new variable
So let's say we wanted to make a new variable in the `mpg` data frame that reflects the age of the cars instead of which year they were made.
```{r mutate-mpg}
mpg_new <- mutate(mpg, age = 2022-year)
```
Another thing we might want to do is average the city and highway MPG rates for each vehicle.
```{r mutate-mpg-2}
mpg_new <- mutate(mpg, avg = (cty+hwy)/2)
```
NOTE: The `mutate` function does not automatically change the input data frame, so that is why we have to use the assignment operator to either save the change to a new data frame or the existing one.
Remember we can also do both of these steps in one by using a *pipeline*.
```{r mutate-mpg-3}
mpg %>%
mutate(
age = 2022-year,
avg = (cty/hwy)/2
) -> mpg_new
```
This reads as:
1. First, we take the `mpg` data, *then*
2. we make a new variable called `age`, *then*
3. we make another new variable called `avg`, *then*
4. we save the output to `mpg_new`.
Now try making some new variables in the `starwars` data frame using the `mutate` function.
```{r mutate-starwars}
```
### With `mutate` and `case_when`
Another function, `case_when` can be helpful when embedded in `mutate` in order to recode an existing variable, creating a new variable with different values.
```{r case_when-mpg}
mpg_new %>%
mutate(
highEfficiency = case_when(
hwy > 23 ~ TRUE,
hwy < 23 ~ FALSE
)
) -> mpg_new
```
You may have noticed that this leaves out vehicles which have a highway mpg of exactly 23. We can fix this by including an equal sign (<=), to mean 'less than or equal to':
```{r case_when-mpg-2}
mpg_new %>%
mutate(
highEfficiency = case_when(
hwy > 23 ~ TRUE,
hwy <= 23 ~ FALSE
)
) -> mpg_new
```
We can also use `case_when` to recode character variables.
```{r case_when-mpg-3}
mpg_new %>%
mutate(
origin = case_when(
manufacturer %in% c("dodge", "chevrolet", "ford", "jeep", "lincoln", "mercury", "pontiac") ~ "domestic",
manufacturer %in% c("audi", "hyundai", "honda", "land rover", "nissan", "subaru", "toyota", "volkswagen") ~ "foreign"
)
) -> mpg_new
```
Now try recoding one numeric and one character variable in the `starwars` data set using the `mutate` and `case_when` functions.
```{r case_when-starwars}
```
### Adding column ID with `mutate`
You'll notice that neither one of our data frames has a column for row IDs (although `starwars` does have character names). We can easily add this with the `mutate` function.
```{r column_ID-mpg}
mpg_new <- mutate(mpg_new, id = rownames(mpg_new))
```
Repeat the above for the `starwars` data frame:
```{r column_ID-starwars}
```
### Splitting a variable with `str_split`
If you have a character variable that is made up of multiple strings, you can split it by using the function `str_split` in the `stingr` package. For this to work, the strings within a variable must be separated by the same thing (space, comma, slash, etc.).
Within `str_split()`, we must first specify the variable that we would like to split (`name`) and then tell R how to split this column. In this case, the strings are separated by spaces, so we just put a space in the parentheses that follow. At the end of the function, we specify which string (in this case the first one) we would like to single out.
```{r str_split-starwars}
starwars_new %>%
mutate(first = str_split(name, " ", simplify = TRUE)[, 1]
) -> starwars_new
```
Now repeat the use of `str_split` to make a new variable which is the second name of each Star Wars character (if they have one).
```{r str_split-starwars-2}
```
## Write Data
We have made several modifications to our data frames, and while not strictly necessary, we might want to save these changes to a .csv to be able to access them outside of R. We use the `write_csv` function from `readr` to do this:
```{r write_csv-mpg}
write_csv(mpg_new, here("data", "mpg_new.csv"))
```
Now export the `starwars_new` data frame to the "data" folder:
```{r write_csv-starwars}
```