generated from jtr13/cctemplate
-
Notifications
You must be signed in to change notification settings - Fork 139
/
dplyr_tutorial_in_r.Rmd
306 lines (245 loc) · 17 KB
/
dplyr_tutorial_in_r.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
# Dplyr tutorial in R
Pinyi Yang
```{r, message = F, warning = F}
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("stringr")
library(dplyr)
library(ggplot2)
library(stringr)
```
## Environment Setup
In order to use dyplr package, we need to first install and library it at the beginning.
Besides the dplyr package, we will use two other packages for better demonstration of the functions in dplyr. Related documentations of the two accessory packages are provided below:
* ggplot2: https://ggplot2.tidyverse.org/reference/
* stringr: https://www.rdocumentation.org/packages/stringr/versions/1.4.0
## Introduction
![](https://www.business-science.io/assets/2020-10-15-must-know-tidyverse-features/tidyverse-icons.png){.center}
Dplyr is a package of tidyverse in R. The key role of dplyr is dataframe manipulation and enables data management. Dplyr contain numerous useful functions that allows for user-friendly data query, data cleaning, data summary, etc. Dplyr is especially important for data visualization, because, an effective and informative visualization requires the structural integrity of the dataframe being used as well as the removal of excessive redundant information and keeping only the information needed.
## Data
One of the most widely used data for tutorial use purpose in R is the 'diamond' dataset. For more information on the 'diamond' dataset, please visit the following link: https://www.rdocumentation.org/packages/ggplot2/versions/3.3.5/topics/diamonds.
The following section showcases the attributes of the 'diamond' dataset to assist the understanding of further data manipulation using dplyr.
```{r}
str(diamonds)
head(diamonds, n = 3)
summary(diamonds[,!(names(diamonds) %in% c("color", "clarity", "cut"))])
```
----
## Column and Row Manipulation
In data analysis, sometimes the original data set contains too much information, more than what we actually need for further analysis or it does not have a good structure for us to use. In order to make the data looked structured, we would use Dplyr to organize and clean Data. Dplyr can effectively help us reorganize the data set based on criteria we commanded. For example, the data set may contain all observations (i.e. rows) from the past 10 years, while we only need to analyze the trend in the recent 5 years. In this case, we need to ignore the data from the previous years in order to generate reasonable visualizations, thus the filter function in Dplyr will come in handy. Also, there might be times when the data contains numerous redundant parameters (i.e. columns) for analysis, and we need to only select those variables that are important for the goal of the study. In this case, we can use Dplyr’s select function to choose desired columns.
### Select
The 'select' in dplyr package is used to select certain columns from the data frame. We can use the 'select' function with the following syntax:
$$select(data, column\ names/column\ position , \dots)$$
#### Column Selection
In the following example, we are selecting three columns of observations only: 'carat', 'clarity', and 'price'. Noted here that we can select by its name or by its position.
```{r}
#only keeping "carat", "clarity", and "price" columns
select_result <- select(diamonds, c(carat, clarity, price))
head(select_result,n = 3)
# same result but use position
select_result <- select(diamonds, c(1,4,7))
head(select_result,n = 3)
# or use both column name and position, not recommended
select_result <- select(diamonds, c(1,4, price))
head(select_result,n = 3)
```
#### Column Deselection
We can also deselect certain columns, or a range of columns using the negate command $!$. In the following example, we want to remove (i.e. deselect) columns from x to z.
```{r}
#deselecting columns from x to z
select_result <- select(diamonds, !(x:z))
head(select_result,n = 3)
```
#### Helper Functions
there are several helper functions for selection that deal with specific cases to be used with select function:
* $contains()$: Use it to select column which contains a specific string
* $starts\_with()$: Use it to select columns which start with given string
* $ends\_with()$: Use it to select columns which end with given string
* $matches()$: Use it to select columns that match a string (often use regular expression)
* $everything()$: Select all columns
#### The matches() Function
As introduced above, the $matches()$ function can select columns with names that match a regular expression (regex). Regular expression is a way to represent certain string patterns, which is extensively used in natural language processing, linguistic analysis, etc. For details in regular expression, such as grammar, and its application, please visit the following link: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Cheatsheet.
For example, if we only want to keep columns with names that start with c, we can use the regular expression of "^c" in the $matches()$ function. (p.s. "^c" is a regular expression that defines strings that start with the letter lower-case c)
```{r}
head(select(diamonds, matches("^c")),n = 3)
```
Notice that only the columns "carat", "cut", "color", and "clarity" are kept after selection, because they all satisfy the pattern that the first letter of the string is the letter "c".
#### The contains() Function
For the sake of demonstration, we will create a pseudo-dataset with names specifically designed to include certain strings. In this case, multiple column names include the string "length"
```{r}
colors <- c("red", "blue", "yellow", "green")
pseudo_car_data <- data.frame(obs = 1:10,
cabin_length = runif(10, 5, 7.5),
door_length = runif(10, 1, 3),
wind_shield_length = runif(10, 4, 6),
color = sample(colors, 10, replace = T))
head(pseudo_car_data,,n = 3)
```
Now, we can use the $contains()$ function to select those names that contain the string "length".
```{r}
head(select(pseudo_car_data, contains("length")),n = 3)
```
### Filter
Similar to the select function that we discussed in the last section, the filter function in dplyr package is also used for selection from the existing data frame. Filter function is used to select rows that satisfies specific conditions/criteria.The second argument here must be logical value. Logical operation would be useful here. For example, the 'is.na' function would return TRUE if a value is NA and FALSE if it is not. We can also use $\&$ and $|$ operators to combine multiple logical statements.
The syntax for the filter function is as follow:
$$filter(data, condition, \dots)$$
#### Filter Numerical Values
The filter function can be used to filter both numerical and character type data. First, lets use the price parameter to demonstrate how to filter numerical values. We will filter out the observations that have diamond price over 500. In other words, we will only keep those rows of observations with $price \leq 500$.
```{r}
#the range of prices for diamonds
range(diamonds$price)
#filter observations that have price less than or equal to 500
filter_result <- filter(diamonds, price <= 500)
head(filter_result,n = 3)
```
#### Filter Character Values
Now, we can filter by character type variables. For example, we can choose those observations that have "Fair" or "Good" cuts:
```{r}
#check levels of cuts
levels(diamonds$cut)
table(diamonds$cut)
#filter by cut, keeping Fair and Good
filter_result <- filter(diamonds, cut %in% c("Fair", "Good"))
head(filter_result,n = 3)
table(filter_result$cut)
```
By checking the counts of each category, we can see that after filtering the dataframe, all observations with cut being "Very Good" or "Premium" or "Ideal" are filtered out, leaving only "Fair" or "Good" observations. Note that the table after filtration still contains the other three categories. This is because the 'cut' variable is stored as a factor in the dataframe. Thus, filtration will not eliminate all the original levels.
### Slice
The slice function in dplyr is also useful for selecting rows based on its position or order. The use of it is similar to head function in base R which would select the top or tail n rows in the data set.
The syntax for the slice function is as follow:
$$slice(data, number\ to\ be\ sliced,... )$$
```{r}
#only get the first five observation of the data frame
slice_result <- slice(diamonds, 1:5)
slice_result
```
There are also other wrapper functions for 'slice' that can be used to get the pieces of data with specified number/range of rows while also filtering with some conditions, such as slice_max,slice_min. The syntax for the slice function is as follow:
$$slice\_max(data, order\_by = column \ name, n = number\ to\ be\ sliced )$$
```{r}
# only get the first five observation of the data frame with highest price
slice_result <- slice_max(diamonds, order_by = price, n = 5)
slice_result
```
## Order Manipulation
Sometimes, we want to get a direct visualization of the data by ranking the observations using one or multiple parameters.
### Ascending Order
We can use the 'arrange' function to arrange data, based on one parameter or multiple parameters, in an ascending order, using the following syntax:
$$arrange(data, column\ for\ arrangement, \dots)$$
```{r}
head(arrange(diamonds, price),n = 3)
```
### Descending Order
We can use the 'desc' function to arrange data, based on one or multiple parameters, in a descending order, using the following syntax:
$$arrange(data, desc(column\ for\ arrangement))$$
```{r}
head(arrange(diamonds, desc(price)),n = 3)
```
## Entry Manipulation
In real life data processing, sometimes, we need to gather data in order to reduce the dimension. For example, if we want to show the mean of number of products sold across all months while the data shows daily sales. In this case, we need to gather observations by combining observations within the same month and calculate their cumulative mean and generate the desirable visualizations. This is when the grouping function and summary function in dplyr comes in handy.
### Grouping
Grouping function in dplyr groups data based on the categories in the assigned column(s). The general syntax is:
$$group\_by(data, column\ used\ for\ grouping, \dots)$$
```{r}
head(group_by(diamonds, cut),n = 3)
head(diamonds,n = 3)
```
Notice that the dataframe before and after grouping looks exactly the same! This is because the 'group_by' function does not change the dataframe itself, rather, it prepares the dataframe for further data processing. As we can see in the next section, we can summarize the data after grouping.
### Summary Statistics
Often times, we want to reduce the complexity of the data by creating summary statistics, such as the mean, median, max, etc. The 'summarise' function in dplyr do just the job. After grouping by categories in certain column(s), we can employ the 'summarise' function to create new column(s) of summary statistics, with the following syntax:
$$summarise(grouped\_data, new\_column\_name = function(column\ to\ be\ summarized), \dots)$$
For example, here we are trying to get the mean price based on the cut of diamond type.
```{r}
summarise(group_by(diamonds, cut), mean_price = mean(price))
```
#### Helper function for summarise()
There are several helper functions that can efficiently find the summary statistics within the 'summarise' function.
* $mean(), median() $: Center of the data
* $sd() $: Spread of the data
* $min(), max() $: Range of the data
### Mutation
The 'mutate' function is extremely useful for data manipulation not only because it allows for greater freedom in transforming the existing dataframe, but also because of its application in missing value imputation. The syntax for the mutate function is:
$$mutate(data, new\_column\_name = operation, \dots)$$
#### Transforming Dataframe
We can add a new parameter by performing operations on existing parameters. In the following example, we will calculate the price per carat diamond for each diamond listed in the dataset. To do so we can use the formula $\frac{price}{carat}$ for each row of observation.
```{r}
head(mutate(diamonds, price_per_carat = price/carat),n = 3)
```
Notice that the mutated dataframe has one more column at the end titled "price_per_carat". The new column is calculated by dividing price by carat within each observation.
#### Imputation -- Using Mean
For the sake of demonstration, we will create a pseudo-dataset that contains missing values to showcase the imputation power of the mutate function.
```{r}
#create new data
pseudo_missing_data <- data.frame(name = c("Tom", "Jay", "Marry", "Lexa", "Garcia"),
age = c(sample(40:80, 3, replace = F), NA, NA))
pseudo_missing_data
```
In the example, we can see that Lexa and Garcia's ages are missing. We will use mutate function to impute the missing values by using the mean of the rest of the ages (i.e. age of Tom, Jay, and Marry). The $is.na()$ function is used to check if the entry contain NA values.
```{r}
imputed_data <- mutate(pseudo_missing_data,
age = ifelse(is.na(age), mean(age, na.rm = T), age))
imputed_data
```
#### Imputation -- Removing Observations
Also, there is a more brutal way of cleaning the data -- removing observations with NA values (i.e. removing the entire row if that row contain NA value). However, this type of cleaning technique is generally not recommended when there are large quantity of missing values, because it will severely depricate the integrity of the original dataset and lose important information. But the general way of removing observations with NA values is by calling the 'na.omit' function:
```{r}
na.omit(pseudo_missing_data)
```
We can see that the latter two rows (i.e. observations of Lexa and Garcia) are completely removed!
#### Useful Helper Functions for mutate()
There are multiple functions that can be embedded in the 'mutate' function which helps perform complex tasks that otherwise requires us to define functions by ourselves:
* $pmin(), pmax()$: Element-wise min and max
* $cummin(), cummax()$: Cumulative min and max
* $between()$: If the value is in a range
* $cumsum(), cumprod()$: Cumulative sum and cumulative product
* $cummean()$: Cumulative mean
example:
```{r}
helper_example <- mutate(diamonds,
#use cummin to find the minimum value up to the current observation
cummin_price = cummin(price),
#use between to determine if the price is in range
in_range = between(price, 330, 340))
head(helper_example)[,c("price", "cummin_price", "in_range")]
```
### Distinct
The 'distinct' function is used to check if there is any duplicates in the data set. Because we wouldn't want our data to have excessive duplicated observations. The distinct function can help us to delete the replicated rows and simplify the dataframe.
$$distinct(data, \dots)$$
```{r}
repeated_diamonds <- slice(diamonds,1:3,1:2)
repeated_diamonds
distinct(repeated_diamonds )
```
## Piping - Tying Everything Together
Piping is an effective way to construct data processing pipelines, especially when the process involves multiple manipulation steps. The grammar used to pipe down commands is the '%>%'. The direction where the arrow is pointing towards is the way that data processing proceeds.
For example, we want to first filter out the observations that have 'Good' cut; then only keep the parameters that are numerical (i.e. 'carat', 'price', 'depth', 'table', 'x', 'y', 'z'), as well as 'clarity'; then, summarize the data by taking the mean of price grouping by 'clarity'.
```{r}
#pass down to the next manipulation command using piping
diamonds %>%
#filter only those observations with Good cuts
filter(cut == "Good") %>%
#select all columns except for cut and color
select(-c(cut, color)) %>%
#summarize the price when grouping by clarity
group_by(clarity) %>%
summarise(mean_price = mean(price)) %>%
#order the mean prices in descending order of clarity
arrange(desc(mean_price))
```
\textbf{NOTE:} be careful when using piping with ggplot2. We know from ggplot2 that different layers of graphical components are connected by $+$ sign. Even though the piping operator $%>%$ has the same function as the plus sign, they are used under different conditions. Don't mix them up!
## Using Dplyr with GGplot
The essence of dplyr is to clean the data before it is used to perform model construction or data visualization. In this section, we will use a simple data visualization example that combines dplyr and ggplot2 and shows how we can tie the functions we have introduced above with graphic design using ggplot2.
```{r, message=FALSE}
diamonds %>%
#data manipulation using dplyr
filter(cut %in% c("Fair", "Good", "Very Good")) %>%
select(cut, carat, color, price) %>%
group_by(cut, color) %>%
summarise(mean_price = mean(price),
mean_carat = mean(carat)) %>%
#data visualization using ggplot2
ggplot(mapping = aes(x = mean_carat, y = mean_price)) +
geom_point(mapping = aes(color = color)) +
ylab("Mean Price") + xlab("Mean Carat") +
labs(title = "Mean Price vs. Mean Carat of Diamonds") +
theme(plot.title = element_text(hjust = 0.5))
```