-
Notifications
You must be signed in to change notification settings - Fork 1
/
Plots_and_Charts.Rmd
255 lines (195 loc) · 13.6 KB
/
Plots_and_Charts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
title: "Data Visualisation Masterclass: Plots & Charts"
output:
html_notebook:
toc: true
toc_float: true
---
```{r echo=F}
knitr::opts_chunk$set(message=FALSE, warning=FALSE, fig.width = 4, fig.height = 3)
```
This part of the [Data Visualisation Masterclass](datavismasterclass.org) is intended to discuss some very basic techniques to visualise tabular data, i.e. data arranged in a table. We will refer to columns in the table as dimensions or variables, and to rows of the table as items or data points. To see the code that produces all figures, hit the dropdown box at the top right of this document and pick 'Download Rmd'.
## Setup
To setup, we need to load the required libraries and setup our example data:
```{r Setup, message=FALSE, warning=FALSE}
library(ggplot2)
theme_set(theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank()))
library(dplyr)
data <- mtcars %>%
select(Economy = mpg, Cylinders = cyl, Displacement = disp, Horsepower = hp, Weight = wt, Acceleration = qsec, Transmission = am) %>%
mutate(Transmission = plyr::revalue(factor(Transmission), c("0" = "Automatic", "1" = "Manual"))) %>%
mutate(Name = as.factor(rownames(mtcars)))
rownames(data) <- rownames(mtcars)
```
Our toy data for this course is the classical car data, slightly modified to be more human readable:
```{r}
knitr::kable(data)
```
## Dotplot
The dotplot is probably the most straight-forward plot for a single, quantitative variable and simply maps a point to every data value along a common scale:
```{r Dotplot, fig.height=1, fig.width=4}
dotplot <- ggplot(data, aes(x = Acceleration, y = 5)) + geom_point(show.legend = F) + theme(axis.ticks.y = element_blank(), axis.text.y = element_blank(), axis.title.y = element_blank()) + xlim(14,23)
dotplot
```
## Boxplot
Boxplot or Box-and-whisker-plots show some basic statistical properties of samples from a variable: the median is mapped to a line, the inter-quartile range to the length of a rectangle, and outliers to individual points:
```{r Boxplot, fig.height=1, fig.width=4}
boxplot <- ggplot(data, aes(x = "", y = Acceleration)) + geom_boxplot() + ylim(14,23) + theme(axis.title.x = element_blank(), axis.ticks.y = element_blank(), axis.title = element_blank(), axis.text.y = element_blank()) + coord_flip()
boxplot
```
Boxplots are very useful to visually summarise the *distribution* of values in our sample.
## Histogram
Probably the most common way to visualise the counts and distributions of variables is a histogram. Here, a user-defined number of equally sized bins is used to count all occurences of values in those bins, and visualise the *count* mapped to the height of a bar:
```{r message=FALSE, warning=FALSE}
histogram_default <- ggplot(data, aes(Acceleration)) + geom_histogram(binwidth = 1) + scale_x_continuous(breaks = c(14,16,18,20,24)) + xlim(14,24)
histogram_default
```
This is a very effective way to visualise quantitative data, since humans can most accurately compare values if mapped to a *length on a common scale*. However, it's important to understand that not only the bin-width, but also the position or shift of the bins is crucial in determining the final pattern - and its interpretation:
```{r message=F, warning=F}
histogram <- ggplot(data, aes(Acceleration)) + geom_histogram(breaks = c(seq(14,24))) + scale_x_continuous(breaks = c(14,16,18,20,24)) + xlim(14,24)
histogram
```
## Density Plot
Decreasing the binsize to infinitely narrow bin, we can obtain a continuous model of a density distribution, and map its values to a smooth line:
```{r message=F, warning=F}
histogram_scaled <- ggplot(data, aes(Acceleration, ..density..)) + geom_histogram(breaks = c(seq(14,24))) + scale_x_continuous(breaks = c(14,16,18,20,24)) + xlim(14,24)
histogram_density <- histogram_scaled + geom_density(aes(color = "red"), show.legend = F)
histogram_density
```
These plots are often referred to as *kernel density estimates* (KDEs), because - similar to the binsize - they require a *kernel* (typically Gaussian) to model individual samples as continuous distributions and compute the final density from all samples. In fact, the histogram using rectangular bins is a special case of a KDE, using a rectangular kernel.
## Violin Plot
By mirroring the smooth density distribution along the horizontal axis, we obtain a *violin plot*:
```{r}
violin <- ggplot(data, aes(x = "", y = Acceleration)) + geom_violin() + ylim(14,24) + theme(axis.title.x = element_blank(), axis.ticks.y = element_blank(), axis.title = element_blank(), axis.text.y = element_blank()) + coord_flip()
violin
```
While the example above is a very basic one, you can imagine adding more layers of visual marks for the median or outlieres, or even a dotplot to show the original data. Then, violin plots are more informative than boxplots, as they can be used to show the same information as a boxplot, *plus* the full density distribution.
## Bar charts
Bar charts can be used to visualise one quantitative and one categorical variable, by using spatial regions along the horizontal to seperate bars into the different categories. Here, we compute the average economy of cars by their type of transmission:
```{r, fig.height=3, fig.width=4, message=FALSE, warning=FALSE}
economy_by_transmission <- data %>%
group_by(Transmission) %>%
summarise(Economy = mean(Economy))
bar_chart_transmission <- ggplot(economy_by_transmission, aes(Transmission, Economy)) + geom_col()
bar_chart_transmission
```
## Stacked bar chart
For a single quantitative and two categorical variables, bars can be stacked according the second categorical variable (here further mapped to colour hue):
```{r}
count_by_transmission_and_Cylinders <- data %>%
mutate(Cylinders = factor(Cylinders))
stacked_bar_chart_count <- ggplot(count_by_transmission_and_Cylinders, aes(Transmission)) + geom_bar(aes(fill = Cylinders))
stacked_bar_chart_count
```
The bottom-most bar (blue) is most effective in visual comparison tasks, because their length is mapped to a common scale. It is more difficult to accurately judge the difference in height of the green and orange bars.
Also note that stacked bar charts make no sense for stats that cannot be summed, such as the average economy of cars:
```{r}
economy_by_transmission_and_Cylinders <- data %>%
mutate(Cylinders = factor(Cylinders)) %>%
group_by(Transmission, Cylinders) %>%
summarise(Economy = mean(Economy))
stacked_bar_chart_economy <- ggplot(economy_by_transmission_and_Cylinders, aes(Transmission, Economy)) + geom_col(aes(fill = Cylinders))
stacked_bar_chart_economy
```
In order to get a better sense of the *proportions* of categories, the height of the bars can be normalized to create normalized stacked bar charts:
```{r}
normalized_stacked_bar_chart <- ggplot(count_by_transmission_and_Cylinders, aes(Transmission)) + geom_bar(aes(fill = Cylinders), position = "fill")
normalized_stacked_bar_chart
```
Note that each normalized stacked bar corresponds to a Cartesian version of a pie chart (but is more effective in conveying the data):
```{r}
pie_chart_automatic <- ggplot(filter(count_by_transmission_and_Cylinders, Transmission == "Automatic"), aes(x = factor(1), fill = Cylinders)) + geom_bar(width=1) + coord_polar(theta = "y") + theme(axis.title.y = element_blank(), axis.ticks.y = element_blank(), axis.text.y = element_blank())
pie_chart_automatic
```
```{r}
pie_chart_manual <- ggplot(filter(count_by_transmission_and_Cylinders, Transmission == "Manual"), aes(x = factor(1), fill = Cylinders)) + geom_bar(width=1) + coord_polar(theta = "y") + theme(axis.title.y = element_blank(), axis.ticks.y = element_blank(), axis.text.y = element_blank())
pie_chart_manual
```
```{r}
economy_by_weight <- data %>%
mutate(Weight = cut(Weight, c(seq(1,6)))) %>%
group_by(Weight) %>%
summarise(Economy = mean(Economy))
bar_chart_weight <- ggplot(economy_by_weight, aes(Weight, Economy)) + geom_col()
bar_chart_weight
```
## Line Charts
```{r}
line_chart_transmission <- ggplot(economy_by_transmission, aes(Transmission, Economy, group=1)) + geom_line() + ylim(0,25)
line_chart_transmission
```
```{r}
line_chart_weight <- ggplot(economy_by_weight, aes(Weight, Economy, group=1)) + geom_line()
line_chart_weight
```
## Scatterplot
```{r}
scatterplot <- ggplot(data, aes(Acceleration, Weight)) + geom_point()
scatterplot
```
```{r}
contour <- scatterplot + geom_density2d()
contour
```
## Bubble Charts
```{r}
bubble_size <- ggplot(data, aes(Acceleration, Weight)) + geom_point(aes(size = Economy, fill = Transmission))
bubble_size
```
```{r}
bubble <- ggplot(data, aes(Acceleration, Weight)) + geom_point(aes(size = Economy, color = Transmission))
bubble
```
```{r}
```
## Faceting
Faceting is a technique for the repetitive use of visualisations, for example to visualise multiple variables. By juxtaposing dotplots next to each other, one for each variable in the dataset, we can compare the distribution of samples between variables:
```{r}
data_faceting <- reshape2::melt(select(data, -Transmission), id.vars = c("Name"))
data_faceting$value <- as.numeric(data_faceting$value)
dotplot_facets <- ggplot(data_faceting, aes(x = 0, y = value)) + geom_point(show.legend = F) + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(), axis.title.x = element_blank()) + facet_wrap(~variable, scales = "free", nrow = 1)
dotplot_facets
```
Here, we can see in an instance that there is only three values for `Cylinders`, the outlier in `Horsepower` and `Acceleration`, etc. Obviously we can replace the dotplot with any of the one-dimensional visualisations discussed earlier and obtain a set of boxplots or violins:
```{r}
boxplot_facets <- ggplot(data_faceting, aes(x = 0, y = value)) + geom_boxplot(show.legend = F) + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(), axis.title.x = element_blank()) + facet_wrap(~variable, scales = "free", nrow = 1)
boxplot_facets
```
```{r}
violin_facets <- ggplot(data_faceting, aes(x = 0, y = value)) + geom_violin(show.legend = F) + theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(), axis.title.x = element_blank()) + facet_wrap(~variable, scales = "free", nrow = 1)
violin_facets
```
Note that these visualisation corresponds a projection of our six-dimensional data to six one-dimensional representations, so we can't see any relations between dimensions.
## Scatterplot Matrix (SPLOM)
The classical approach to inspect *all* pairwise relations between a set of variables is the scatterplotmatrix (SPLOM). Here, scatterplots are faceted into a symmetric matrix, where every row and column correspond to a single variable, and each cell is composed of a scatterplot of the corresponding variables. Since it is a symmetric matrix, the diagonal would normally show a dotplot of the variable and is often replaced with an explicit, line-based visualisation of the sample distribution. Similarly, only the lower or upper triangle of the matrix is typically used for scatterplots, because the other triangle would show mirrored versions of all scatterplots.
```{r}
splom <- GGally::scatmat(data) + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
splom
```
Let's store all the plots as pdf files on disk:
```{r}
ggsave(filename = "plots/dotplot.pdf", plot = dotplot, width = 4, height = 1, units = "in")
ggsave(filename = "plots/boxplot.pdf", plot = boxplot, width = 4, height = 1, units = "in")
ggsave(filename = "plots/bar_chart_transmission.pdf", plot = bar_chart_transmission, width = 4, height = 3, units = "in")
ggsave(filename = "plots/bar_chart_weight.pdf", plot = bar_chart_weight, width = 4, height = 3, units = "in")
ggsave(filename = "plots/stacked_bar_chart_economy.pdf", plot = stacked_bar_chart_economy, width = 4, height = 3, units = "in")
ggsave(filename = "plots/stacked_bar_chart_count.pdf", plot = stacked_bar_chart_count, width = 4, height = 3, units = "in")
ggsave(filename = "plots/normalized_stacked_bar_chart.pdf", plot = normalized_stacked_bar_chart, width = 4, height = 3, units = "in")
ggsave(filename = "plots/pie_chart_manual.pdf", plot = pie_chart_manual, width = 4, height = 3, units = "in")
ggsave(filename = "plots/pie_chart_automatic.pdf", plot = pie_chart_automatic, width = 4, height = 3, units = "in")
ggsave(filename = "plots/line_chart_transmission.pdf", plot = line_chart_transmission, width = 4, height = 3, units = "in")
ggsave(filename = "plots/line_chart_weight.pdf", plot = line_chart_weight, width = 4, height = 3, units = "in")
ggsave(filename = "plots/histogram_default.pdf", plot = histogram_default, width = 4, height = 3, units = "in")
ggsave(filename = "plots/histogram.pdf", plot = histogram, width = 4, height = 3, units = "in")
ggsave(filename = "plots/histogram_scaled.pdf", plot = histogram_scaled, width = 4, height = 3, units = "in")
ggsave(filename = "plots/histogram_density.pdf", plot = histogram_density, width = 4, height = 3, units = "in")
ggsave(filename = "plots/violin.pdf", plot = violin, width = 4, height = 3, units = "in")
ggsave(filename = "plots/scatterplot.pdf", plot = scatterplot, width = 4, height = 3, units = "in")
ggsave(filename = "plots/bubble.pdf", plot = bubble, width = 4, height = 3, units = "in")
ggsave(filename = "plots/bubble_size.pdf", plot = bubble_size, width = 4, height = 3, units = "in")
ggsave(filename = "plots/contour.pdf", plot = contour, width = 4, height = 3, units = "in")
ggsave(filename = "plots/dotplot_factes.pdf", plot = dotplot_facets, width = 10, height = 3, units = "in")
ggsave(filename = "plots/splom.pdf", plot = splom, width = 8, height = 6, units = "in")
ggsave(filename = "plots/boxplot_facets.pdf", plot = boxplot_facets, width = 10, height = 3, units = "in")
ggsave(filename = "plots/violin_facets.pdf", plot = violin_facets, width = 10, height = 3, units = "in")
```