-
Notifications
You must be signed in to change notification settings - Fork 20
/
2_07_dplyr_summaries_groups.Rmd
156 lines (118 loc) · 13 KB
/
2_07_dplyr_summaries_groups.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# Grouping and summarising data
This chapter will explore the `summarise` and `group_by` verbs. These two verbs are considered together because they are often used together, and their usage is quite distinct from the other **dplyr** verbs we've encountered:
- The `group_by` function adds information into a data object (e.g. a data frame or tibble), which makes subsequent calculations happen on a group-specific basis.
- The `summarise` function is a data reduction function calculates single-number summaries of one or more variables, respecting the group structure if present.
### Getting ready
We can start a new script by loading and attaching the **dplyr** package:
```{r, eval=FALSE}
library("dplyr")
```
We're going to use both the `storms` and `iris` data sets in the **nasaweather** and **datasets** packages, respectively. The **datasets** package ships is automatically loaded and attached at start up, so we need to make the **nasaweather** package available:
```{r, eval=FALSE}
library("nasaweather")
```
Finally, let's convert both data sets to a tibble so they print to the Console cleanly:
```{r}
storms_tbl <- tbl_df(storms)
iris_tbl <- tbl_df(iris)
```
## Summarising variables with `summarise`
We use `summarise` to __calculate summaries of variables__ in an object containing our data. We do this kind of calculation all the time when analysing data. In terms of pseudo-code, usage of `summarise` looks like this:
```{r, eval=FALSE}
summarise(data_set, <expression1>, <expression2>, ...)
```
The first argument, `data_set`, must be the name of the data frame or tibble containing our data. We then include a series of one or more additional arguments, each of these is a valid R expression involving at least one variable in `data_set`. These are given by the pseudo-code placeholder `<expression1>, <expression2>, ...`, where `<expression1>` and `<expression2>` represent the first two expressions, and the `...` is acting as placeholder for the remaining expressions. These expressions can be any calculation involving R functions. The only constraint is that they must generate __a single value__ when evaluated.
That last sentence was important. It's easy to use `summarise` if can we remember one thing: `summarise` is designed to work with functions that take a vector as their input and return a single value (i.e. a vector of length one). Any calculation that does this can be used with `summarise`.
The `summarise` verb is best understood by example. The R function called `mean` takes a vector of numbers (several numbers) and calculates their arithmetic mean (one number). We can use `mean` with `summarise` to calculate the mean of the `Petal.Length` and `Petal.Width` variables in `iris_tbl` like this:
```{r}
summarise(iris_tbl, mean(Petal.Length), mean(Petal.Width))
```
Notice what kind of object `summarise` returns: it's a tibble with only one row and two columns. There are two columns because we calculated two means, and there is one row containing these means. Simple. There are a few other things to note about how `summarise` works:
* As with all **dplyr** functions, the expression that performs the required summary calculation is not surrounded by quotes because it is an expression that it "does some calculations".
* The order of the expression in the resulting tibble is the same as the order in which they were used as arguments.
* Even though the dimensions of the output object have changed, `summarise` returns the same kind of data object as its input. It returns a data frame if our data was originally in a data frame, or a tibble if it was in a tibble.
Notice that `summarise` used the expressions to name the variables. Variable names like `mean(Petal.Length)` and `mean(Petal.Width)` are not very helpful. They're quite long for one. More problematically, they contain special reserved characters like `(`, which makes referring to columns in the resulting tibble more difficult than it needs to be:
```{r}
# make a summary tibble an assign it a name
iris_means <- summarise(iris_tbl, mean(Petal.Length), mean(Petal.Width))
# extract the mean petal length
iris_means$`mean(Petal.Length)`
```
We have to place 'back ticks' (as above) or ordinary quotes around the name to extract the new column when it includes special characters.
It's better to avoid using the default names. The `summarise` function can name the new variables at the same time as they are created. Predictably, we do this by naming the arguments using `=`, placing the name we require on the left hand side. For example:
```{r}
summarise(iris_tbl, Mean_PL = mean(Petal.Length), Mean_PW = mean(Petal.Width))
```
There are very many base R functions that can be used with `summarise`. A few useful ones for calculating summaries of numeric variables are:
* `min` and `max` calculate the minimum and maximum values of a vector.
* `mean` and `median` calculate the mean and median of a numeric vector.
* `sd` and `var` calculate the standard deviation and variance of a numeric vector.
We can combine more than one function in a `summarise` expression as long as it returns a single number. This means we can do arbitrarily complicated calculations in a single step. For example, if we need to know the ratio of the mean and median values of petal length and petal width in `iris_tbl`, we use:
```{r}
summarise(iris_tbl,
Mn_Md_PL = mean(Petal.Length) / median(Petal.Length),
Mn_Md_PW = mean(Petal.Width) / median(Petal.Width))
```
Notice that we placed each argument on a separate line in this example. This is just a style issue---we don't have to do this, but since R doesn't care about white space, we can use new lines and spaces to keep everything a bit more more human-readable. It pays to organise `summarise` calculations like this as they become longer. It allows us to see the logic of the calculations more easily, and helps us spot potential errors when they occur.
### Helper functions
There are a small number **dplyr** helper functions that can be used with `summarise`. These generally provide summaries that aren't available directly using base R functions. For example, the `n_distinct` function is used to calculate the number of distinct values in a variable:
```{r}
summarise(iris_tbl, Num.PL.Vals = n_distinct(Petal.Length))
```
This tells us that there are 43 unique values of `Petal.Length`. We won't explore any others here. The [handy cheat sheat](http://www.rstudio.com/resources/cheatsheets/) is worth looking over to see what additional options are available.
## Grouped operations using `group_by`
Performing a calculation with one or more variables over the whole data set is useful, but very often we also need to carry out an operation on different subsets of our data. For example, it's probably more useful to know how the mean sepal and petal traits vary among the different species in the `iris_tbl` data set, rather than knowing the overall mean of these traits. We could calculate separate means by using `filter` to create different subsets of `iris_tbl`, and then using `summary` on each of these to calculate the relevant means. This would get the job done, but it's not very efficient and very soon becomes tiresome when we have to work with many groups.
The `group_by` function provides a more elegant solution to this kind of problem. It doesn't do all that much on its own though. All the `group_by` function does is add a bit of grouping information to a tibble or data frame. In effect, it defines subsets of data on the basis of one or more __grouping variables__. The magic happens when the grouped object is used with a **dplyr** verb like `summarise` or `mutate`. Once a data frame or tibble has been tagged with grouping information, operations that involve these (and other) verbs are carried out on separate subsets of the data, where the subsets correspond to the different values of the grouping variable(s).
Basic usage of `group_by` looks like this:
```{r, eval=FALSE}
group_by(data_set, vname1, vname2, ...)
```
The first argument, `data_set` ("data object"), must be the name of the object containing our data. We then have to include one or more additional arguments, where each of these is the name of a variable in `data_set`. I have expressed this as `vname1, vname2, ...`, where `vname1` and `vname2` are names of the first two variables, and the `...` is acting as placeholder for the remaining variables.
As usual, it's much easier to understand how `group_by` works once we've seen it in action. We'll illustrate `group_by` by using it alongside `summarise` with the `storms_tbl` data set. We're aiming to calculate the mean wind speed for every type of storm. The first step is to use `group_by` to add grouping information to `storms_tbl`:
```{r}
group_by(storms_tbl, type)
```
Compare this to the output produced when we print the original `storms_tbl` data set:
```{r}
storms_tbl
```
There is almost no change in the printed information---`group_by` really doesn't do much on its own. The main change is that the tibble resulting from the `group_by` operation has a little bit of additional information printed at the top: `Groups: type [4]`. The `Groups: type` part of this tells us that the tibble is grouped by the `type` variable and nothing else. The `[4]` part tells us that there are 4 different groups.
The only thing `group_by` did was add this grouping information to a copy of `storms_tbl`. The original `storms_tbl` object was not altered in any way. If we actually want to do anything useful useful with the result we need to assign it a name so that we can work with it:
```{r}
storms_grouped <- group_by(storms_tbl, type)
```
Now we have a grouped tibble called `storms_grouped`, where the groups are defined by the values of `type`. Any operations on this tibble will now be performed on a "by group" basis. To see this in action, we use `summarise` to calculate the mean wind speed:
```{r}
summarise(storms_grouped, mean.wind = mean(wind))
```
When we used `summarise` on an ungrouped tibble the result was a tibble with one row: the overall global mean. Now the resulting tibble has four rows, one for each value of `type`: The `type` variable in the new tibble tells us what these values are; the `mean.wind` variable shows the mean wind speed for each value.
### More than one grouping variable
What if we need to calculate summaries using more than one grouping variable? The workflow is unchanged. Let's assume we want to know the mean wind speed and atmospheric pressure associated with each storm type in each year. We first make a grouped copy of the data set with the appropriate grouping variables:
```{r}
# group the storms_tbl data by storm year + assign the result a name
storms_grouped <- group_by(storms_tbl, type, year)
#
storms_grouped
```
We grouped the `storms_tbl` data by `type` and `year` and assigned the grouped tibble the name `storms_grouped`. When we print this to the Console we see `Groups: type, year [24]` near the top, which tells us that the tibble is grouped by two variables with 24 unique combinations of values. We then calculate the mean wind speed and pressure of each storm type in each year:
```{r}
summarise(storms_grouped,
mean_wind = mean(wind),
mean_pressure = mean(pressure))
```
This calculates mean wind speed and atmospheric pressure for different combination of `type` and `year`. The first line shows us that the mean wind speed and pressure associated with extra-tropical storms in 1995 was 38.7 mph and 995 millibars, the second line shows us that the mean wind speed and pressure associated with extra-tropical storms in 1995 was 40.4 mph and 991 millibars, and so on. There are 24 rows in total because there were 24 unique combinations of `type` and `year` in the original `storms_tbl`.
### Using `group_by` with other verbs
The `summarise` function is the only **dplyr** verb we'll use with grouped tibbles in this book. However, all the main verbs alter their behaviour to respect group identity when used with tibbles with grouping information. When `mutate` or `transmute` are used with a grouped object they still add new variables, but now the calculations occur "by group". Here's an example using `transmute`:
```{r}
# group the storms data by storm name + assign the result a name
storms_grouped <- group_by(storms_tbl, name)
# create a data set 'mean centred' wind speed variable
transmute(storms_grouped, wind_centred = wind - mean(wind))
```
In this example we calculated the "group mean-centered" version of the wind variable. The new `wind_centred` variable contains the difference between the wind speed and the mean of whichever storm type is associated with the observation.
## Removing grouping information
On occasion it's necessary to remove grouping information from a data object. This is most often done when working with "pipes" (the topic of the next chapter) when we need to revert back to operating on the whole data set. The `ungroup` function removes grouping information:
```{r}
ungroup(storms_grouped)
```
Looking at the top right of the printed summary, we can see that the `Group:` part is now gone---the `ungroup` function effectively created a copy of `storms_grouped` that is identical to the original `storms_tbl` tibble.