forked from PsyTeachR/ads-v1
-
Notifications
You must be signed in to change notification settings - Fork 5
/
03-viz.qmd
1240 lines (918 loc) · 59.2 KB
/
03-viz.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Data Visualisation {#sec-viz}
## Intended Learning Outcomes {#sec-ilo-viz .unnumbered}
* Be able to identify categorical versus continuous data
* Be able to create plots in layers using ggplot
* Be able to choose appropriate plots for data
## Walkthrough video {#sec-walkthrough-viz .unnumbered}
There is a walkthrough video of this chapter available via [Echo360](https://echo360.org.uk/media/457312c7-ae4f-4506-8016-a29df4f47462/public). Please note that there may have been minor edits to the book since the video was recorded. Where there are differences, the book should always take precedence.
## Set-up {#sec-setup-viz}
Create a new project for the work we'll do in this chapter:
- <if>File > New Project...</if>
- Name the project `r path("03-visualisation")`
- Save it inside your ADS directory (**not** inside another project)
Then, create and save a new `r glossary("R Markdown")` document named `plots.Rmd`, get rid of the default template text, and load the packages in the set-up code `r glossary("chunk")`. You should have all of these packages installed already, but if you get the message `Error in library(x) : there is no package called ‘x’`, please refer to @sec-install-package.
```{r setup-viz, message=FALSE, verbatim="r setup, include=FALSE"}
library(tidyverse) # includes ggplot2
library(patchwork) # for multi-part plots
library(ggthemes) # for plot themes
library(lubridate) # for manipulating dates
```
We'd recommend making a new code chunk for each different activity, and using the white space to make notes on any errors you make, things you find interesting, or questions you'd like to ask the course team.
Download the [ggplot2 cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf).
## Variable types
If a spreadsheet is in a `r glossary("tidy data")` format, each row is an `r glossary("observation")`, each column is a `r glossary("variable")`, and the information in each cell is a single `r glossary("value")`. We'll learn more about how to get our data into this format in @sec-tidy, but to get started we'll use datasets with the right format.
For example, the table below lists pets owned by members of the psyTeachR team. Each row is an observation of one pet. There are 6 variables for each pet, their `name`, `owner`, `species`, `birthdate`, `weight` (in kg), and `rating` (on a 5-point scale from "very evil" to "very good").
```{r, echo = FALSE}
pets <- tribble(
~name, ~owner, ~species, ~birthdate, ~weight, ~rating,
"Darwin", "Lisa", "ferret", "1998-04-02", 1.2, "a little evil",
"Oy", "Lisa", "ferret", NA , 2.9, "very good",
"Khaleesi", "Emily", "cat", "2014-10-01", 4.5, "very good",
"Bernie", "Phil", "dog", "2017-06-01", 32.0, "very good"
) %>%
mutate(species = factor(species, c("dog", "cat", "ferret")),
birthdate = as.Date(birthdate),
rating = factor(rating, c("very evil",
"a little evil",
"neutral",
"mostly good",
"very good")))
pets
```
Variables can be classified as `r glossary("continuous")` (numbers) or `r glossary("categorical")` (labels). When you're plotting data, it's important to know what kind of variables you have, which can help you decide what types of plots are most appropriate. Each variable also has a `r glossary("data type")`, such as `r glossary("numeric")` (numbers), `r glossary("character")` (text), or `r glossary("logical")` (TRUE/FALSE values). Some plots can only work on some data types. Make sure you have watched the mini-lecture on types of data from last week before you work through this chapter. Additionally, @sec-data-types has more details, as this concept will be relevant repeatedly.
```{r excel-format-cells, echo = FALSE, fig.cap="Data types are like the categories when you format cells in Excel."}
include_graphics("images/appx/excel-format-cells.png")
```
### Continuous
`r glossary("Continuous")` variables are properties you can measure, like weight. You can use continuous variables in mathematical operations, like calculating the sum total of a column of prices or the average number of social media likes per day. They may be rounded to the nearest whole number, but it should make sense to have a measurement halfway between.
Continuous variables always have a `r glossary("numeric")` data type. They are either `r glossary("integer", "integers")` like `42` or `r glossary("double", "doubles")` like `3.14159`.
### Categorical
`r glossary("Categorical")` variables are properties you can count, like the species of pet. Categorical variables can be `r glossary("nominal")`, where the categories don't really have an order, like cats, dogs and ferrets (even though ferrets are obviously best), or `r glossary("ordinal")`, where they have a clear order but the distance between the categories isn't something you could exactly equate, like points on a `r glossary("Likert")` rating scale. Even if a data table uses numbers like 1-7 to represent ordinal variables, you shouldn't treat them like continuous variables.
Categorical data can have a `r glossary("character")` data type, also called `r glossary("string", "strings")`. These are made by putting text inside of quotes. That text can be letters, punctuation, or even numbers. For example, `"January"` is a character string, but so is `"1"` if you put it in quotes. The character data type is best for variables that can have a lot of different values that you can't predict ahead of time.
Categorical data can also be `r glossary("factor", "factors")`, a specific type of integer that lets you specify the category names and their order. This is useful for making plots display with categories in the order you want (otherwise they default to alphabetical order). The factor data type is best for categories that have a specific number of levels.
::: {.callout-caution}
## Do not factor numbers
If you factor numeric data, it gets converted to the integers 1 to the number of unique values, no matter what the values are. Additionally, you can no longer use the values as numbers, such as calculating the mean.
```{r, warning=TRUE, filename="Example"}
x <- c(-3, 0, .5) # numeric vector
f <- factor(x) # convert to factor
x == as.numeric(f) # does not convert back to numeric
```
```{r, warning=TRUE, filename="You cannot average a factor"}
m <- mean(f)
```
:::
Sometimes people represent categorical variables with numbers that correspond to names, like 0 = "no" and 1 = "yes", but values in between don't have a clear interpretation. If you have control over how the data are recorded, it's better to use the character names for clarity. You'll learn how to recode columns in @sec-wrangle.
### Dates and times
Dates and times are a special case of variable. They can act like categorical or continuous variables, and there are special ways to plot them. Dates and times can be hard to work with, but the [<pkg>lubridate</pkg>(https://lubridate.tidyverse.org/) package provides functions to help you with this.
```{r}
# the current date
lubridate::today()
```
```{r}
# the current date and time in the GMT timezone
lubridate::now(tzone = "GMT")
```
::: {.callout-note .try}
## Test your understanding
Coming back to the pets dataset, what type of variable is in each column? You can use the function `glimpse()` to show a list of the column names, their data types, and the first few values in each column - here is the output of running `glimpse()` on the pets dataset.
```{r}
glimpse(pets)
```
```{r, include = FALSE}
num <- c(answer = "numeric", x = "character", x = "factor", x = "date")
chr <- c(x = "numeric", answer = "character", x = "factor", x = "date")
fctr <- c(x = "numeric", x = "character", answer = "factor", x = "date")
date <- c(x = "numeric", x = "character", x = "factor", answer = "date")
cont <- c(answer = "continuous", x = "nominal", x = "ordinal", x = "date")
nom <- c(x = "continuous", answer = "nominal", x = "ordinal", x = "date")
ord <- c(x = "continuous", x = "nominal", answer = "ordinal", x = "date")
date <- c(x = "continuous", x = "nominal", x = "ordinal", answer = "date")
```
| Column | Variable type | Data type |
|:------------|:--------------|:--------------|
| `name` | `r mcq(nom)` | `r mcq(chr)` |
| `owner` | `r mcq(nom)` | `r mcq(chr)` |
| `species` | `r mcq(nom)` | `r mcq(fctr)` |
| `birthdate` | `r mcq(date)` | `r mcq(date)` |
| `weight` | `r mcq(cont)` | `r mcq(num)` |
| `rating` | `r mcq(ord)` | `r mcq(fctr)` |
:::
## Building plots
```{r sim-survey, include = FALSE, eval = FALSE}
# code for simulating the data used in this chapter
# hidden from students and not run on every knit, just here for reference
library(faux)
set.seed(8765309)
issues <- c(
tech = 0,
sales = 2,
returns = 1,
other = 1
)
survey_data <- add_random(employee_id = 10) %>%
add_random(caller_id = sample(50:100, 10),
.nested_in = "employee_id") %>%
add_between("caller_id", issue_category = names(issues),
.prob = c(.4, .1, .3, .1)) %>%
add_ranef("caller_id",
wait_time = 1,
call_time = 1,
.cors = 0.5) %>%
add_ranef("employee_id",
employee_quality = 1,
employee_time = 1,
.cors = -.5) %>%
mutate(caller_id = gsub("caller_id", "C", caller_id),
employee_id = gsub("employee_id", "E", employee_id)) %>%
add_ranef(error = 1) %>%
mutate(call_start = runif(nrow(.), 2020, 2021) %>% date_decimal()) %>%
mutate(wait_time = norm2beta(wait_time, 2, 4, ncp=10) * 5,
call_time = norm2beta(call_time + employee_time, 2, 4, ncp=0) * 2 + 0.1,
# round and add outliers
wait_time = round(wait_time * 60) +
sample(c(0, 100), nrow(.), T, c(99, 1)),
call_time = round(call_time * 60) +
sample(c(0, 100), nrow(.), T, c(99, 1))
) %>%
mutate(satisfaction = (employee_quality +
recode(issue_category, !!!issues) -
(wait_time * .1) +
(month(call_start) * -.05) +
error) %>% norm2likert(prob = c(1,3,4,5,2))) %>%
select(caller_id, employee_id, call_start, wait_time, call_time,
issue_category, satisfaction)
write_csv(survey_data, "data/survey_data.csv")
```
There are multiple approaches to data visualisation in R; in this course we will use the popular package <pkg>ggplot2</pkg>, which is part of the larger `tidyverse` collection of packages. A grammar of graphics (the "gg" in "ggplot") is a standardised way to describe the components of a graphic. <pkg>ggplot2</pkg> uses a layered grammar of graphics, in which plots are built up in a series of layers. It may be helpful to think about any picture as having multiple elements that sit semi-transparently over each other. A good analogy is old Disney movies where artists would create a background and then add moveable elements on top of the background via transparencies.
@fig-layers displays the evolution of a simple scatterplot using this layered approach. First, the plot space is built (layer 1); the variables are specified (layer 2); the type of visualisation (known as a `geom`) that is desired for these variables is specified (layer 3) - in this case `geom_point()` is called to visualise individual data points; a second geom is added to include a line of best fit (layer 4), the axis labels are edited for readability (layer 5), and finally, a theme is applied to change the overall appearance of the plot (layer 6).
```{r fig-layers, fig.cap="Evolution of a layered plot", echo = FALSE, message=FALSE}
survey_data <- read_csv(file = "data/survey_data.csv",
show_col_types = FALSE)
x_breaks <- seq(from = 0, to = 600, by = 60)
y_breaks <- seq(from = 0, to = 600, by = 30)
a <- ggplot() + labs(subtitle = "Layer 1")
b <- ggplot(survey_data, aes(x = wait_time, y = call_time)) +
labs(subtitle = "Layer 2")
c <- b + geom_point(alpha = 0.2, color = "dodgerblue") +
labs(subtitle = "Layer 3")
d <- c + geom_smooth(method = "lm", color = rgb(0, .5, .8)) +
labs(subtitle = "Layer 4")
e <- d + scale_x_continuous(name = "Wait Time (seconds)", breaks = x_breaks) +
scale_y_continuous(name = "Call time (seconds)", breaks = y_breaks) +
coord_cartesian(xlim = c(0, 360), ylim = c(0, 180)) +
labs(subtitle = "Layer 5")
f <- e + ggthemes::theme_gdocs(base_size = 10) +
theme(axis.line.x = element_blank(),
plot.background = element_blank()) +
labs(subtitle = "Layer 6") +
theme(plot.subtitle = element_text(color = "black"))
a + b + c + d + e + f + plot_layout(nrow = 2)
```
Importantly, each layer is independent and independently customisable. For example, the size, colour and position of each component can be adjusted, or one could, for example, remove the first geom (the data points) to only visualise the line of best fit, simply by removing the layer that draws the data points (@fig-remove-layer). The use of layers makes it easy to build up complex plots step-by-step, and to adapt or extend plots from existing code.
```{r fig-remove-layer, fig.cap="Final plot with scatterplot layer removed.", echo = FALSE}
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
#geom_point(alpha = 0.15, color = "dodgerblue") +
geom_smooth(method = "lm", formula = y~x, color = rgb(0, .5, .8)) +
scale_x_continuous(name = "Wait Time (seconds)", breaks = seq(from = 0, to = 600, by = 60)) +
scale_y_continuous(name = "Call time (seconds)", breaks = seq(from = 0, to = 600, by = 30)) +
coord_cartesian(xlim = c(0, 360), ylim = c(0, 180)) +
ggthemes::theme_gdocs(base_size = 11) +
theme(axis.line.x = element_blank(),
plot.background = element_blank())
```
### Plot Data {#sec-plots-loading-data}
Let's build up the plot above, layer by layer. First we need to get the data. We'll learn how to load data from different sources in @sec-data, but this time we'll use the same method as we did in @sec-loading-online and load it from an online source.
When you load the data, `read_csv()` will produce a message that gives you information about the data it has imported and what assumptions it has made. The "column specification" tells you what each column is named and what type of data R has categorised each variable as. The abbreviation "chr" is for `r glossary("character")` columns, "dbl" is for `r glossary("double")` columns, and "dttm" is a date/time column.
```{r}
survey_data <- read_csv("https://psyteachr.github.io/ads-v2/data/survey_data.csv")
```
This data is simulated data for a call centre customer satisfaction survey. The first thing you should do when you need to plot data is to get familiar with what all of the rows (observations) and columns (variables) mean. Sometimes this is obvious, and sometimes it requires help from the data provider. Here, each row represents one call to the centre.
* `caller_id` is a unique ID for each caller
* `employee_id` is a unique ID for each employee taking calls
* `call_start` is the date and time that the call arrived
* `wait_time` is the number of seconds the caller had to wait
* `call_time` is the number of seconds the call lasted after the employee picked up
* `issue_category` is whether the issue was tech, sales, returns, or other
* `satisfaction` is the customer satisfaction rating on a scale from 1 (very unsatisfied) to 5 (very satisfied)
Unless you specify the column types, data importing functions will just guess the types and usually default to double for columns with numbers and character for columns with letters. Use the function `spec()` to find out all of the column types and edit them if needed.
```{r}
spec(survey_data)
```
Let's set `issue_category` as a factor and set the order of the levels. By default, R will order the levels of a factor alphanumerically, however in many cases you will want or need to set your own order. For example, in this data, it makes most sense for the category "other" to come at the end of the list. After you update the column types, you have to re-import the data by adjusting the `read_csv()` code to set the `col_types` argument to the new column types.
::: {.callout-note}
## Define objects before you use them
Because `read_csv()` is going to use the object `survey_col_types`, you must create `survey_col_types` before you run the adjusted `read_csv()` code. If you ever need to adjust your code, try to think about the order that the code will run in if you start from scratch and make sure it's organised appropriately.
:::
```{r}
# updated column types
survey_col_types <- cols(
caller_id = col_character(),
employee_id = col_character(),
call_start = col_datetime(format = ""),
wait_time = col_double(),
call_time = col_double(),
issue_category = col_factor(levels = c("tech", "sales", "returns", "other")),
satisfaction = col_integer()
)
# re-import data with correct column types
survey_data <- read_csv("https://psyteachr.github.io/ads-v2/data/survey_data.csv",
col_types = survey_col_types)
```
### Plot setup
#### Default theme
Plots in this book use the black-and-white theme, not the default grey theme, so set your default theme to the same so your plots will look like the examples below. At the top of your script, in the setup chunk after you've loaded the tidyverse package, add the following code and run it. You'll learn more ways to customise your theme in @sec-themes and @sec-themes-appendix.
```{r}
theme_set(theme_bw()) # set the default theme
```
#### Data {#sec-plot-setup-data}
Every plot starts with the `ggplot()` function and a data table. If your data are not loaded or you have a typo in your code, this will give you an error message. It's best to check your plot after each step, so that you can figure out where errors are more easily.
```{r fig-build-plot-setup, fig.cap = "A blank ggplot."}
ggplot(data = survey_data)
```
#### Mapping
The next `r glossary("argument")` to `ggplot()` is the `mapping`. This tells the plot which columns in the data should be represented by, or "mapped" to, different aspects of the plot, such as the x-axis, y-axis, line colour, object fill, or line style. These aspects, or "aesthetics", are listed inside the `aes()` function.
Set the arguments `x` and `y` to the names of the columns you want to be plotted on those axes. Here, we want to plot the wait time on the x-axis and the call time on the y-axis.
```{r fig-build-plot-mapping, fig.cap = "A blank plot with x- and y- axes mapped."}
# set up the plot with mapping
ggplot(
data = survey_data,
mapping = aes(x = wait_time, y = call_time)
)
```
::: {.callout-note}
## ggplot argument names
In the example above, we wrote out the names of the `r glossary("argument", "arguments")` `data` and `mapping`, but in practice, almost everyone omits them. Just make sure you put the data and mapping in the right order.
```{r, eval = FALSE}
ggplot(survey_data, aes(x = wait_time, y = call_time))
```
:::
#### Geoms
Now we can add our plot elements in layers. These are referred to as `r glossary("geom", "geoms")` and their functions start with `geom_`. You **add** layers onto the base plot created by `ggplot()` with a plus (`+`).
```{r fig-build-plot-geoms, fig.cap="Adding a scatterplot with geom_point()."}
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_point() # scatterplot
```
::: {.callout-warning collapse="true"}
## Location of the +
Somewhat annoyingly, the plus has to be on the end of the previous line, not at the start of the next line. If you do make this mistake, it will run the first line of code to produce the base layer but then you will get the following error message rather than adding on `geom_point()`.
```{r, error = TRUE}
ggplot(survey_data, aes(x = wait_time, y = call_time))
+ geom_point() # scatterplot
```
:::
#### Multiple geoms
Part of the power of <pkg>ggplot2</pkg> is that you can add more than one geom to a plot by adding on extra layers and so it quickly becomes possible to make complex and informative visualisation. Importantly, the layers display in the order you set them up. The code below uses the same geoms to produce a scatterplot with a line of best fit, but orders them differently.
```{r fig-build-plot-geom2-code, eval = FALSE}
# Points first
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_point() + # scatterplot
geom_smooth(method = lm) # line of best fit
# Line first
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_smooth(method = lm) + # line of best fit
geom_point() # scatterplot
```
```{r fig-build-plot-geom2, fig.cap="Points first versus line first.", message = FALSE, echo = FALSE}
point_first <-
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_point() + # scatterplot
geom_smooth(method = lm) # line of best fit
line_first <-
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_smooth(method = lm) + # line of best fit
geom_point() # scatterplot
# add plots together in 1 row
point_first + line_first + plot_layout(nrow = 1)
```
#### Saving plots
Just like you can save numbers and data tables to objects, you can also save the output of `ggplot()`. The code below produces the same plots we created above but saves them to objects named `point_first` and `line_first`. If you run just this code, the plots won't display like they have done before. Instead, you'll see the object names appear in the environment pane.
```{r, message = FALSE}
point_first <-
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_point() + # scatterplot
geom_smooth(method = lm) # line of best fit
line_first <-
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_smooth(method = lm) + # line of best fit
geom_point() # scatterplot
```
To view the plots, call the objects by name. This will output each plot separately.
```{r, eval = FALSE}
point_first # view first plot
line_first # view second plot
```
#### Combining plots
One of the reasons to save your plots to objects is so that you can combine multiple plots using functions from the `patchwork` package. The below code produces the above plot by combining the two plots with `+` and then specifying that we want the plots produced on a single row with the `nrow` argument in `plot_layout()`.
```{r, fig-build-plot-geom2b, fig.cap="Combining plots with patchwork.", message = FALSE}
# add plots together in 1 row
point_first + line_first + plot_layout(nrow = 1)
```
::: {.callout-note .try}
## Try changing nrow to 2
:::
### Customising plots
There are nearly endless ways to customise ggplots. We'll cover a few of the basic customisations here.
#### Styling geoms
We should definitely put the line in front of the points, but the points are still a bit dark. If you want to change the overall style of a geom, you can set the arguments `colour`, `alpha`, `shape`, `size` and `linetype` inside the geom function. There are many different values that you can set these to; @sec-plotstyle) gives details of these. Play around with different values below and figure out what the `r glossary("default value", "default values")` are for `shape` and `size`.
```{r fig-build-plot-style, fig.cap="Changing geom styles."}
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_point(colour = "dodgerblue",
alpha = 0.2, # 20% transparency
shape = 18, # solid diamond
size = 2) +
geom_smooth(method = lm,
formula = y~x, # formula used to draw line,
# setting method & formula avoids an annoying message
colour = rgb(0, .5, .8),
linetype = 3)
```
::: {.callout-warning}
## Setting aesthetics overall versus by category
This method is only for changing the style of *all* the shapes made with that geom. If you want, for example, points to have different colours depending on which issue category they are from, you set the argument `colour = issue_category` inside the `aes()` function for the mapping. You can customise the colours used with `scale_` functions, which you will learn about below and in @sec-plotstyle).
:::
#### Format axes
Now we need to make the axes look neater. There are several functions you can use to change the axis labels, but the most powerful ones are the `scale_` functions. You need to use a scale function that matches the data you're plotting on that axis and this is where it becomes particularly important to know what type of data you're working with. Both of the axes here are `r glossary("continuous")`, so we'll use `scale_x_continuous()` and `scale_y_continuous()`.
The `name` argument changes the axis label. The `breaks` argument sets the major units and needs a `r glossary("vector")` of possible values, which can extend beyond the range of the data (e.g., `wait time` only goes up to 350, but we can specify breaks up to 600 to make the maths easier or anticipate updates to the data). The `seq()` function creates a sequence of numbers `from` one `to` another `by` specified steps.
```{r, filename="Example of seq()"}
seq(from = 0, to = 600, by = 60)
```
```{r fig-build-plot-axes, fig.cap="Formatting plot axes with scale_ functions."}
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_point(colour = "dodgerblue",
alpha = 0.2) +
geom_smooth(method = lm,
formula = y~x,
colour = rgb(0, .5, .8)) +
# customise axis labels and breaks
scale_x_continuous(name = "Wait Time (seconds)",
breaks = seq(from = 0, to = 600, by = 60)) +
scale_y_continuous(name = "Call time (seconds)",
breaks = seq(from = 0, to = 600, by = 30))
```
::: {.callout-note .try}
## Minor breaks
Check the help for `?scale_x_continuous` to see how you would set the minor units or specify how many breaks you want instead.
:::
#### Axis limits
If you want to change the minimum and maximum values on an axis, use the `coord_cartesian()` function. Many plots make more sense if the minimum and maximum values represent the range of possible values, even if those values aren't present in the data. Here, wait and call times can't be less than 0 seconds, so we'll set the minimum values to 0 and the maximum values to the first break above the highest value.
```{r fig-build-plot-limits, fig.cap="Changing the axis limits."}
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_point(colour = "dodgerblue",
alpha = 0.2) +
geom_smooth(method = lm,
formula = y~x,
colour = rgb(0, .5, .8)) +
scale_x_continuous(name = "Wait Time (seconds)",
breaks = seq(from = 0, to = 600, by = 60)) +
scale_y_continuous(name = "Call time (seconds)",
breaks = seq(from = 0, to = 600, by = 30)) +
# set axis limits
coord_cartesian(xlim = c(0, 360),
ylim = c(0, 180))
```
::: {.callout-caution}
## Setting limits with the scale_ function
You can also set the `limits` argument inside the `scale_` functions, but this actually removes any data that falls outside these limits, rather than cropping your plot, and this can change the appearance of certain types of plots like violin plots and density plots.
:::
#### Themes {#sec-themes}
<pkg>ggplot2</pkg> comes with several built-in themes, such as `theme_minimal()` and `theme_bw()`, but the [<pkg>ggthemes</pkg>](https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/) package provides even more themes to match different software, such as GoogleDocs or Stata, or publications, such as the Economist or the Wall Street Journal. Let's add the GoogleDocs theme, but change the font size to 20 with the `base_size` argument.
It's also worth highlighting that this code is starting to look quite complicated because of the number of layers, but because we've built it up slowly it should (hopefully!) make sense. If you see examples of <pkg>ggplot2</pkg> code online that you'd like to adapt, build the plot up layer by layer and it will make it easier to understand what each layer adds.
```{r fig-build-plot-theme, fig.cap="Changing the theme to the Google Docs style."}
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_point(colour = "dodgerblue",
alpha = 0.2) +
geom_smooth(method = lm,
formula = y~x,
colour = rgb(0, .5, .8)) +
scale_x_continuous(name = "Wait Time (seconds)",
breaks = seq(from = 0, to = 600, by = 60)) +
scale_y_continuous(name = "Call time (seconds)",
breaks = seq(from = 0, to = 600, by = 30)) +
coord_cartesian(xlim = c(0, 360),
ylim = c(0, 180)) +
# change the theme
ggthemes::theme_gdocs(base_size = 20)
```
#### Theme tweaks
If you're still not quite happy with a theme, you can customise it even further with the `themes()` function. Check the help for this function to see all of the possible options. The most common thing you'll want to do is to remove an element entirely. You do this by setting the relevant argument to `element_blank()`. Below, we're getting rid of the x-axis line and the plot background, which removes the line around the plot.
```{r fig-build-plot-custom-theme, fig.cap="Customising the theme to remove the x-axis line and background outline."}
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_point(colour = "dodgerblue",
alpha = 0.2) +
geom_smooth(method = lm,
formula = y~x,
colour = rgb(0, .5, .8)) +
scale_x_continuous(name = "Wait Time (seconds)",
breaks = seq(from = 0, to = 600, by = 60)) +
scale_y_continuous(name = "Call time (seconds)",
breaks = seq(from = 0, to = 600, by = 30)) +
coord_cartesian(xlim = c(0, 360),
ylim = c(0, 180)) +
theme_gdocs(base_size = 11) +
# customise theme elements
theme(axis.line.x = element_blank(),
plot.background = element_blank())
```
### Figure captions {#sec-captions}
You can add a caption directly to the image using the `labs()` function, which also allows you to add or edit the title, subtitle, and axis labels.
```{r fig-caption, fig.cap="Adding a title, subtitle, and caption."}
ggplot(survey_data, aes(x = wait_time, y = call_time)) +
geom_point(colour = "dodgerblue",
alpha = 0.2) +
geom_smooth(method = lm,
formula = y~x,
colour = rgb(0, .5, .8)) +
scale_x_continuous(name = "Wait Time (seconds)",
breaks = seq(from = 0, to = 600, by = 60)) +
scale_y_continuous(name = "Call time (seconds)",
breaks = seq(from = 0, to = 600, by = 30)) +
coord_cartesian(xlim = c(0, 360),
ylim = c(0, 180)) +
theme_gdocs(base_size = 11) +
theme(axis.line.x = element_blank(),
plot.background = element_blank()) +
labs(title = "The relationship between wait time and call time",
subtitle = "2020 Call Data",
caption = "Figure 1. As wait time increases, call time increases.")
```
However, it is more accessible to include this sort of information in plain text for screen readers. You can add a text caption in the chunk header, and some document types will even automatically number figures for you (you'll learn about this in @sec-linked-docs). You can also add alt-text descriptions for screen readers that describe the image.
```{r, eval = FALSE, verbatim='r fig-wait-vs-call, fig.cap="As wait time increases, call time increases.", fig.alt="A scatterplot showing wait time on the x-axis (range 0-360 seconds) and call time on the y-axis (range 0-180 seconds) with a trend line showing that as wait time increases, call time increases from about 60 wait/30 call to about 300 wait/65 call."'}
# figure code here
```
## Appropriate plots
Now that you know how to build up a plot by layers and customise its appearance, you're ready to learn about some more plot types. Different types of data require different types of plots, so this section is organised by data type.
The [ggplot2 cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf) is a great resource to help you find plots appropriate to your data, based on how many variables you're plotting and what type they are. The examples below all use the same customer satisfaction data, but each plot communicates something different.
We don't expect you to memorise all of the plot types or the methods for customising them, but it will be helpful to try out the code in the examples below for yourself, changing values to test your understanding.
### Counting categories
#### Bar plot
If you want to count the number of things per category, you can use `geom_bar()`. You only need to provide a `x` mapping to `geom_bar()` because by default `geom_bar()` uses the number of observations in each group of `x` as the value for `y`, so you don't need to tell it what to put on the y-axis.
```{r fig-bar, fig.cap="A basic bar plot."}
ggplot(survey_data, aes(x = issue_category)) +
geom_bar()
```
::: {.callout-note .try}
## Customising bar plot appearance
You probably want to customise some things, like the colours, order of the columns, and their labels. Inspect the code below and try running it layer by layer to figure out where these things change. The functions `scale_fill_manual()` and `scale_x_discrete()` are new, but work in the same way as the other `scale_` functions. You'll learn more about this in @sec-custom-viz.
```{r custom-bar, webex.hide = "Code"}
ggplot(survey_data, aes(x = issue_category,
fill = issue_category)) +
geom_bar() +
scale_x_discrete(
# change axis title
name = "Issue Category",
# change order
limits = c("tech", "returns", "sales", "other"),
# change labels
labels = c("Technical", "Returns", "Sales", "Other")
) +
scale_fill_manual(
# change colours
values = c(tech = "goldenrod",
returns = "darkgreen",
sales = "dodgerblue3",
other = "purple3"),
# remove the legend
guide = "none"
) +
scale_y_continuous(
name = "", # remove axis title
# remove the space above and below the y-axis
expand = expansion(add = 0)
) +
# minimum = 0, maximum = 350
coord_cartesian(ylim = c(0, 350)) +
ggtitle("Number of issues per category") # add a title
```
:::
#### Column plot
If your data already have a column with the number you want to plot, you can use `geom_col()` to plot it. We can use the `count()` function to make a table with a row for each `issue_category` and a column called `n` with the number of observations in that category.
```{r}
count_data <- count(survey_data, issue_category)
```
`r kable(count_data)`
The mapping for `geom_col()` requires you to set both the `x` and `y` aesthetics. Set `y = n` because we want to plot the number of issues in each category, and that information is in the column called `n`.
```{r fig-col, fig.cap="A basic column plot."}
ggplot(count_data, aes(x = issue_category, y = n)) +
geom_col()
```
#### Pie chart
Pie charts are a [misleading form of data visualisation](https://www.data-to-viz.com/caveat/pie.html){target="_blank"}, so we won't cover them. We'll cover options for visualising proportions, like waffle, lollipop and treemap plots, in @sec-other-plots.
::: {.callout-note .try}
## Test your understanding
```{r, include = FALSE}
bar <- c(answer = "geom_bar", x = "geom_col")
col <- c(x = "geom_bar", answer = "geom_col")
```
Here is a small data table.
country | population | island
:-------------------|----------------:|:--------------
Northern Ireland | 1,895,510 | Ireland
Wales | 3,169,586 | Great Britain
Republic of Ireland | 4,937,786 | Ireland
Scotland | 5,466,000 | Great Britain
England | 56,550,138 | Great Britain
* What geom would you use to plot the population for each of the 5 countries? `r mcq(col)`
```{r test-counting-categories, echo = FALSE, results='asis'}
opt <- c(answer = "aes(x = country, y = population)",
x = "aes(x = population, y = country)",
x = "aes(x = country)",
x = "aes(x = island)",
x = "aes(y = population)")
cat("* What mapping would you use? ", longmcq(opt))
```
* What geom would you use to plot the number of countries on each island? `r mcq(bar)`
```{r, echo = FALSE, results='asis'}
opt <- c(x = "aes(x = country, y = population)",
x = "aes(x = population, y = country)",
x = "aes(x = country)",
answer = "aes(x = island)",
x = "aes(y = population)")
cat("* What mapping would you use? ", longmcq(opt))
```
:::
### One continuous variable {#sec-histogram}
If you have a continuous variable, like the number of seconds callers have to wait, you can use `geom_histogram()` to show the distribution. Just like `geom_bar()` you are only required to specify the `x` variable.
A histogram splits the data into "bins" along the x-axis and shows the count of how many observations are in each bin along the y-axis.
```{r fig-histogram, fig.cap="Histogram of wait times."}
ggplot(survey_data, aes(x = wait_time)) +
geom_histogram()
```
You should always set the `binwidth` or number of `bins` to something meaningful for your data (otherwise you get the annoying message above). You might need to try a few options before you find something that looks good and conveys the meaning of your plot -- try changing the values of `binwidth` and `bins` below to see what works best.
```{r eval = FALSE}
# adjust width of each bar
ggplot(survey_data, aes(x = wait_time)) +
geom_histogram(binwidth = 30)
# adjust number of bars
ggplot(survey_data, aes(x = wait_time)) +
geom_histogram(bins = 5)
```
By default, the bars start *centered* on 0, so if `binwidth` is set to 30, the first bar would include -15 to 15 seconds, which doesn't make much sense. We can set `boundary = 0` so that each bar represents increments of 30 seconds *starting* from 0.
```{r fig-histogram-boundary0, fig.cap="A histogram with the boundary set to 0."}
ggplot(survey_data, aes(x = wait_time)) +
geom_histogram(binwidth = 30, boundary = 0)
```
Finally, the default style of grey bars is ugly, so you can change that by setting the `fill` and `colour`, as well as using `scale_x_continuous()` to update the axis labels.
```{r fig-histogram-custom, fig.cap="Histogram with custom styles."}
ggplot(survey_data, aes(x = wait_time)) +
geom_histogram(binwidth = 15,
boundary = 0,
fill = "white",
color = "black") +
scale_x_continuous(name = "Wait time (seconds)",
breaks = seq(0, 600, 60))
```
::: {.callout-note .try}
## Test your understanding
Imagine you have a table of the [population for each country in the world](https://population.un.org/wpp/Download/Standard/Population/){target="_blank"} with the columns `country` and `population`. We'll just look at the 76 countries with populations of less than a million.
```{r test-one-continuous, echo = FALSE}
# load data
pop_data <- readxl::read_excel("data/WPP2019_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.xlsx", skip = 16) %>%
filter(Type == "Country/Area") %>%
select(country = 3, population = `2020`) %>%
mutate(population = round(as.numeric(population) * 1000)) %>%
filter(population < 1e6)
# make plots
ggplot(pop_data, aes(x = population)) +
scale_x_continuous(breaks = seq(0, 1e6, 1e5),
labels = c(paste0(0:9*100, "K"), "1M")) +
scale_y_continuous(name = "Number of countries") +
geom_histogram(binwidth = 1e5, boundary = 0, fill = "white", color = "black")
```
```{r, echo = FALSE, results='asis'}
opts <- c(x = "aes(x = country, y = population)",
x = "aes(x = population, y = country)",
answer = "aes(x = population)",
x = "aes(x = population, y = count)")
cat("* How would you set the mapping for this plot? ", longmcq(opts))
```
* What is the `binwidth` of the histogram? `r mcq(c("1", "100", answer = "100K", "1M"))`
:::
::: {.callout-tip collapse="true"}
## Axis label customisation
If you're curious how we got the x-axis labels to read "100K" instead of "100000", you just need to add a vector of `labels` the same length as `breaks`.
```{r, eval = FALSE}
scale_x_continuous(breaks = seq(0, 1e6, 1e5),
labels = c(paste0(0:9*100, "K"), "1M"))
```
:::
### Grouped continuous variables
There are several ways to compare continuous data across groups. Which you choose depends on what point you are trying to make with the plot.
#### Stacked histogram
In previous plots, we have used `fill` purely for visual reasons, e.g., we changed the colour of the histogram bars to make them look nicer. However, you can also use `fill` to represent another variable so that the colours become meaningful.
Setting the `fill` aesthetic **in the mapping** will produce different coloured bars for each category of the `fill` variable, in this case `issue_category`.
```{r, fig.cap="Histogram with categories represented by fill."}
ggplot(survey_data, aes(x = wait_time, fill = issue_category)) +
geom_histogram(boundary = 0,
binwidth = 15,
color = "black")
```
::: {.callout-warning}
## Arguments inside aes()
When you set an aspect to represent the data, you do this inside the `aes()` function for the mapping, not as an argument to the geom. If you try to set this in a geom, you'll get the following error (unless you coincidentally have an object named `issue_category` that is a colour word).
```{r, error = TRUE}
ggplot(survey_data, aes(x = wait_time)) +
geom_histogram(boundary = 0,
binwidth = 15,
color = "black",
fill = issue_category)
```
:::
::: {.callout-tip collapse="true"}
## Area plot alternative
The function `geom_area()` gives a similar effect when `stat = "bin"`.
```{r, fig.cap="Stacked area plot."}
# area plot
ggplot(survey_data, mapping = aes(x = wait_time, fill = issue_category)) +
geom_area(stat = "bin",
boundary = 0,
binwidth = 15,
color = "black")
```
:::
#### Dodged histogram
By default, the categories are positioned stacked on top of each other. If you want to compare more than one distribution, you can set the `position` argument of `geom_histogram()` to "dodge" to put the bars for each group next to each other instead of stacking them. However, this can look confusing with several categories.
```{r fig-histogram-dodge, fig.cap = "A histogram with multiple groups."}
# dodged histogram
ggplot(survey_data, aes(x = wait_time,
fill = issue_category,
colour = issue_category))+
geom_histogram(boundary = 0,
binwidth = 15,
position = "dodge") +
scale_x_continuous(name = "Wait time (seconds)",
breaks = seq(0, 600, 60))
```
::: {.callout-tip collapse="true"}
## Frequency plot alternative
Alternatively, you can use `geom_freqpoly()` to plot a line connecting the top of each bin (see @sec-freqpoly).
```{r fig-groups-freqpoly, fig.cap = "A frequency plot with multiple groups."}
# frequency plot
ggplot(survey_data, aes(x = wait_time,
colour = issue_category)) +
geom_freqpoly(binwidth = 15,
boundary = 0,
size = 1) +
scale_x_continuous(name = "Wait time (seconds)",
breaks = seq(0, 600, 60))
```
:::
#### Violin plot
Another way to compare groups of continuous variables is the violin plot. This is like a density plot, but rotated 90 degrees and mirrored - the fatter the violin, the larger proportion of data points there are at that value.
```{r fig-violin-plot, fig.width = 8, fig.height = 2.5, fig.cap = "The default violin plot gives each shape the same area. Set scale='count' to make the size proportional to the number of observations."}
violin_area <-
ggplot(survey_data, aes(x = issue_category, y = wait_time)) +
geom_violin() +
ggtitle('scale = "area"')
violin_count <-
ggplot(survey_data, aes(x = issue_category, y = wait_time)) +
geom_violin(scale = "count") +
ggtitle('scale = "count"')
violin_area + violin_count
```
#### Boxplot
Boxplots serve a similar purpose to violin plots (without the giggles from the back row). They don't show you the shape of the distribution, but rather some statistics about it. The middle line represents the `r glossary("median")`; half the data are above this line and half below it. The box encloses the 25th to 75th percentiles of the data, so 50% of the data falls inside the box. The "whiskers" extending above and below the box extend 1.5 times the height of the box, although you can change this with the `coef` argument. The points show `r glossary("outlier", "outliers")` -- individual data points that fall outside of this range.
Boxplots can be horizontal if you swap to x and y columns, and there are many other customisations you can apply.
```{r fig-box-plot, fig.width = 8, fig.height = 2.5, fig.cap = "Boxplots."}
boxplot <- ggplot(survey_data, aes(x = issue_category, y = wait_time)) +
geom_boxplot() +
ggtitle("Default vertical boxplot")
custom <- ggplot(survey_data, aes(y = issue_category,x = wait_time)) +
geom_boxplot(fill = "grey80",
outlier.colour = "red",
outlier.shape = 8,
coef = 1, # length of whiskers relative to box
varwidth = TRUE, # set width proportional to sample size
notch = TRUE) +
ggtitle("Customised horizontal boxplot")
boxplot + custom
```
#### Combo plots
Violin plots are frequently layered with other geoms that represent the mean or median values in the data. This is a lot of code; to help your understanding, run it layer by layer to see how it builds up and change the values throughout the code.
```{r fig-violin-combos, fig.cap="Violin plots combined with different methods to represent means and medians."}
# add fill and colour to the mapping
ggplot(survey_data, aes(x = issue_category,
y = wait_time,
fill = issue_category,
colour = issue_category)) +
scale_x_discrete(name = "Issue Category") +
scale_y_continuous(name = "Wait Time (seconds)",
breaks = seq(0, 600, 60)) +
coord_cartesian(ylim = c(0, 360)) +
guides(fill = "none", colour = "none") +
# add a violin plot
geom_violin(draw_quantiles = 0.5, # adds a line at median (50%) score
alpha = 0.4) +
# add a boxplot
geom_boxplot(width = 0.25,
fill = "white",
alpha = 0.75,
fatten = 0, # removes the median line
outlier.alpha = 0) +
# add a point that represents the mean
stat_summary(fun = mean,
geom = "point",
size = 2) +
ggtitle("ViolinBox")
```
::: {.callout-caution collapse="true"}
## Misleading Bar Charts
A very common type of plot is to produce a bar chart of means, however, the example below demonstrates just how misleading this is. It communicates the mean value for each category, but the bars hide the distribution of the actual data. You can't tell if most wait times are close to 3 minutes, or spread from 0 to 6 minutes, or if the vast majority are less than 2 minutes, but the mean is pulled up by some very high outliers.
Column plots can also be very misleading. The plot on the left starts the y-axis at 0, which makes the bar heights proportional, showing almost no difference in average wait times. Since the differences are hard to see, you may be tempted to start the y-axis higher, but that makes it look like the average wait time for returns is double that for tech.
```{r fig-col-plot-bad, fig.height = 2.5, fig.width = 8, message=FALSE, echo = FALSE, fig.cap="Don't plot continuous data with column plots. They are only appropriate for count data."}
tall_col <- ggplot(survey_data, aes(x = issue_category,
y = wait_time,
fill = issue_category)) +
scale_x_discrete(name = "Issue Category") +
scale_y_continuous(name = "Wait Time (seconds)",
breaks = seq(0, 600, 60)) +
guides(fill = "none", colour = "none") +
stat_summary(fun = "mean",
geom = "col") # draws a column representing the mean
short_col <- tall_col +
scale_y_continuous(name = "Wait Time (seconds)",
breaks = seq(0, 600, 1)) +
coord_cartesian(ylim = c(185, 189))
tall_col + short_col
```
:::
::: {.callout-note .try}
## Test your understanding
```{r test-grouped-continuous, echo=FALSE, fig.height = 2.5}
box <- c(x = "geom_box()",
answer = "geom_boxplot()",
x = "geom_violin()",
x = "geom_violinplot()")
violin <- c(x = "geom_box()",
x = "geom_boxplot()",
answer = "geom_violin()",