-
Notifications
You must be signed in to change notification settings - Fork 176
/
Copy pathch06.Rmd
856 lines (595 loc) · 41.5 KB
/
ch06.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
---
output:
bookdown::html_document2:
fig_caption: yes
editor_options:
chunk_output_type: console
---
```{r echo = FALSE, cache = FALSE}
source("utils.R", local = TRUE)
```
Summarized Data Distributions {#CHAPTER-DISTRIBUTION}
=============================
This chapter explores how to visualize summarized distributions of data.
Making a Basic Histogram {#RECIPE-DISTRIBUTION-BASIC-HIST}
------------------------
### Problem
You want to make a histogram.
### Solution
Use `geom_histogram()` and map a continuous variable to x (Figure \@ref(fig:FIG-DISTRIBUTION-HIST-BASIC)):
```{r FIG-DISTRIBUTION-HIST-BASIC, fig.cap="A basic histogram", message=FALSE}
ggplot(faithful, aes(x = waiting)) +
geom_histogram()
```
### Discussion
All `geom_histogram()` requires is one column from a data frame or a single vector of data. For this example we'll use the `faithful` data set, which contains two columns with data about the Old Faithful geyser: `eruptions`, which is the length of each eruption, and `waiting`, which is the length of time to the next eruption. We'll only use the `waiting` variable in this example:
```{r}
faithful
```
If you just want to get a quick look at some data that isn't in a data frame, you can get the same result by passing in `NULL` for the data frame and giving `ggplot()` a vector of values. This would have the same result as the previous code:
```{r eval=FALSE}
# Store the values in a simple vector
w <- faithful$waiting
ggplot(NULL, aes(x = w)) +
geom_histogram()
```
By default, the data is grouped into 30 bins. This number of bins is an arbitrary default value, and may be too fine or too coarse for your data. You can change the size of the bins by specifying the `binwidth`, or you can divide the range of the data into a specific number of bins.
In addition, the default colors -- a dark fill without an outline -- can make it difficult to see which bar corresponds to which value, so we'll also change the colors, as shown in Figure \@ref(fig:FIG-DISTRIBUTION-HIST-WIDTH).
```{r FIG-DISTRIBUTION-HIST-WIDTH, fig.show="hold", fig.cap="Histogram with binwidth = 5 and with different colors (left); With 15 bins (right)"}
# Set the width of each bin to 5 (each bin will span 5 x-axis units)
ggplot(faithful, aes(x = waiting)) +
geom_histogram(binwidth = 5, fill = "white", colour = "black")
# Divide the x range into 15 bins
binsize <- diff(range(faithful$waiting))/15
ggplot(faithful, aes(x = waiting)) +
geom_histogram(binwidth = binsize, fill = "white", colour = "black")
```
Sometimes the appearance of the histogram will be very dependent on the width of the bins and where the boundary points between the bins are. In Figure \@ref(fig:FIG-DISTRIBUTION-HIST-BOUNDARY), we'll use a bin width of 8. In the version on the left, we'll use the origin parameter to put boundaries at 31, 39, 47, etc., while in the version on the right, we'll shift it over by 4, putting boundaries at 35, 43, 51, etc.:
```{r FIG-DISTRIBUTION-HIST-BOUNDARY, fig.show="hold", fig.cap="Different appearance of histograms with the origin at 31 and 35"}
# Save a base plot
faithful_p <- ggplot(faithful, aes(x = waiting))
faithful_p +
geom_histogram(binwidth = 8, fill = "white", colour = "black", boundary = 31)
faithful_p +
geom_histogram(binwidth = 8, fill = "white", colour = "black", boundary = 35)
```
The results look quite different, even though they have the same bin size. The `faithful` data set is not particularly small, with 272 observations; with smaller data sets, this can be even more of an issue. When visualizing your data, it's a good idea to experiment with different bin sizes and boundary points.
If your data has discrete values, it may matter that the histogram bins are asymmetrical. They are *closed* on the lower bound and *open* on the upper bound. If you have bin boundaries at 1, 2, 3, etc., then the bins will be [1, 2), [2, 3), and so on. In other words, the first bin contains 1 but not 2, and the second bin contains 2 but not 3.
### See Also
Frequency polygons provide a better way of visualizing multiple distributions without the bars interfering with each other. See Recipe \@ref(RECIPE-DISTRIBUTION-FREQPOLY).
Making Multiple Histograms from Grouped Data {#RECIPE-DISTRIBUTION-MULTI-HIST}
--------------------------------------------
### Problem
You have grouped data and want to simultaneously make histograms for each data group.
### Solution
Use `geom_histogram()` and use facets for each group, as shown in Figure \@ref(fig:FIG-DISTRIBUTION-MULTI-HISTOGRAM-FACET):
```{r FIG-DISTRIBUTION-MULTI-HISTOGRAM-FACET, fig.cap="Two histograms with facets (left); With different facet labels (right)", message=FALSE}
library(MASS) # Load MASS for the birthwt data set
# Use smoke as the faceting variable
ggplot(birthwt, aes(x = bwt)) +
geom_histogram(fill = "white", colour = "black") +
facet_grid(smoke ~ .)
```
### Discussion
To make multiple histograms from grouped data, the data must all be in one data frame, with one column containing a categorical variable used for grouping.
For this example, we used the `birthwt` data set. It contains data about birth weights and a number of risk factors for low birth weight:
```{r}
birthwt
```
One problem with the faceted graph is that the facet labels are just 0 and 1, and there's no label indicating that those values are for whether or not smoking is a risk factor that is present. To change the labels, we change the names of the factor levels. First we'll take a look at the factor levels, then we'll assign new factor level names in the same order, and save this new data set as `birthwt_mod`:
```{r}
birthwt_mod <- birthwt
# Convert smoke to a factor and reassign new names
birthwt_mod$smoke <- recode_factor(birthwt_mod$smoke, '0' = 'No Smoke', '1' = 'Smoke')
```
Now when we plot our modified data frame, our desired labels appear (Figure \@ref(fig:FIG-DISTRIBUTION-MULTI-HISTOGRAM-FACET-LABELS)).
```{r FIG-DISTRIBUTION-MULTI-HISTOGRAM-FACET-LABELS, fig.cap = "Histograms with new facet labels", message=FALSE}
ggplot(birthwt_mod, aes(x = bwt)) +
geom_histogram(fill = "white", colour = "black") +
facet_grid(smoke ~ .)
```
With facets, the axes have the same *y* scaling in each facet. If your groups have different sizes, it might be hard to compare the *shapes* of the distributions of each one. For example, see what happens when we facet the birth weights by `race` (Figure \@ref(fig:FIG-DISTRIBUTION-MULTI-HISTOGRAM-FACET-SCALESFREE), left):
```{r FIG-DISTRIBUTION-MULTI-HISTOGRAM-FACET-SCALESFREE-1, eval=FALSE}
ggplot(birthwt, aes(x = bwt)) +
geom_histogram(fill = "white", colour = "black") +
facet_grid(race ~ .)
```
To allow the *y* scales to be resized independently (Figure \@ref(fig:FIG-DISTRIBUTION-MULTI-HISTOGRAM-FACET-SCALESFREE), right), use `scales = "free"`. Note that this will only allow the *y* scales to be free -- the *x* scales will still be fixed because the histograms are aligned with respect to that axis:
```{r FIG-DISTRIBUTION-MULTI-HISTOGRAM-FACET-SCALESFREE-2, eval=FALSE}
ggplot(birthwt, aes(x = bwt)) +
geom_histogram(fill = "white", colour = "black") +
facet_grid(race ~ ., scales = "free")
```
```{r FIG-DISTRIBUTION-MULTI-HISTOGRAM-FACET-SCALESFREE, ref.label=c("FIG-DISTRIBUTION-MULTI-HISTOGRAM-FACET-SCALESFREE-1", "FIG-DISTRIBUTION-MULTI-HISTOGRAM-FACET-SCALESFREE-2"), echo=FALSE, fig.show="hold", fig.cap='Histograms with the default fixed scales (left); With scales = "free" (right)', fig.width=4, fig.height=4, message=FALSE}
```
Another approach is to map the grouping variable to `fill`, as shown in Figure \@ref(fig:FIG-DISTRIBUTION-MULTI-HISTOGRAM-FILL). The grouping variable must be a factor or a character vector. In the `birthwt` data set, the desired grouping variable, `smoke`, is stored as a number, so we’ll use the `birthwt_mod` data set we created above, in which smoke is a factor:
```{r FIG-DISTRIBUTION-MULTI-HISTOGRAM-FILL, fig.cap="Multiple histograms with different fill colors", message=FALSE}
# Map smoke to fill, make the bars NOT stacked, and make them semitransparent
ggplot(birthwt_mod, aes(x = bwt, fill = smoke)) +
geom_histogram(position = "identity", alpha = 0.4)
```
Specifying `position = "identity"` is important. Without it, ggplot will stack the histogram bars on top of each other vertically, making it much more difficult to see the distribution of each group.
Making a Density Curve {#RECIPE-DISTRIBUTION-BASIC-DENSITY}
----------------------
### Problem
You want to make a kernel density estimate curve.
### Solution
Use `geom_density()` and map a continuous variable to x (Figure \@ref(fig:FIG-DISTRIBUTION-DENSITY-BASIC)):
```{r FIG-DISTRIBUTION-DENSITY-BASIC-1, eval=FALSE}
ggplot(faithful, aes(x = waiting)) +
geom_density()
```
If you don't like the lines along the side and bottom, you can use `geom_line(stat = "density")` (see Figure \@ref(fig:FIG-DISTRIBUTION-DENSITY-BASIC), right):
```{r FIG-DISTRIBUTION-DENSITY-BASIC-2, eval=FALSE}
# expand_limits() increases the y range to include the value 0
ggplot(faithful, aes(x = waiting)) +
geom_line(stat = "density") +
expand_limits(y = 0)
```
(ref:cap-FIG-DISTRIBUTION-DENSITY-BASIC) A kernel density estimate curve with `geom_density()` (left); With `geom_line()` (right)
```{r FIG-DISTRIBUTION-DENSITY-BASIC, ref.label=c("FIG-DISTRIBUTION-DENSITY-BASIC-1", "FIG-DISTRIBUTION-DENSITY-BASIC-2"), echo=FALSE, fig.show="hold", fig.cap="(ref:cap-FIG-DISTRIBUTION-DENSITY-BASIC)", fig.width=4, fig.height=4}
```
### Discussion
Like `geom_histogram()`, `geom_density()` requires just one column from a data frame. For this example, we’ll use the `faithful` data set, which contains two columns of data about the Old Faithful geyser: `eruptions`, which is the length of each eruption, and `waiting`, which is the length of time until the next eruption. We’ll only use the `waiting` column in this example:
```{r}
faithful
```
The second method of using `geom_line(stat = "density")` tells `geom_line()` to use the "density" statistical transformation. This is essentially the same as the first method, using `geom_density()`, except the former draws it with a closed polygon.
As with `geom_histogram()`, if you just want to get a quick look at data that isn't in a data frame, you can get the same result by passing in `NULL` for the data and giving ggplot a vector of values. This would have the same result as the first solution:
```{r eval=FALSE}
# Store the values in a simple vector
w <- faithful$waiting
ggplot(NULL, aes(x = w)) +
geom_density()
```
A kernel density curve is an estimate of the population distribution, based on the sample data. The amount of smoothing depends on the *kernel bandwidth*: the larger the bandwidth, the more smoothing there is. The bandwidth can be set with the `adjust` parameter, which has a default value of 1. Figure \@ref(fig:FIG-DISTRIBUTION-DENSITY-ADJUST) shows what happens with a smaller and larger value of `adjust`:
```{r FIG-DISTRIBUTION-DENSITY-ADJUST, fig.cap="Density curves with adjust set to .25 (red), default value of 1 (black), and 2 (blue)", fig.width=4, fig.height=4}
ggplot(faithful, aes(x = waiting)) +
geom_line(stat = "density") +
geom_line(stat = "density", adjust = .25, colour = "red") +
geom_line(stat = "density", adjust = 2, colour = "blue")
```
In this example, the *x* range is automatically set so that it contains the data, but this results in the edge of the curve getting clipped. To show more of the curve, set the *x* limits (Figure \@ref(fig:FIG-DISTRIBUTION-DENSITY-WIDTH)). We'll also add an 80% transparent fill, with `alpha = .2`:
(ref:cap-FIG-DISTRIBUTION-DENSITY-WIDTH) Density curve with wider x limits and a semitransparent fill (left); In two parts, with `geom_density()` and `geom_line()` (right)
```{r FIG-DISTRIBUTION-DENSITY-WIDTH, fig.show="hold", fig.cap="(ref:cap-FIG-DISTRIBUTION-DENSITY-WIDTH)", fig.width=4, fig.height=4}
ggplot(faithful, aes(x = waiting)) +
geom_density(fill = "blue", alpha = .2) +
xlim(35, 105)
# This draws a blue polygon with geom_density(), then adds a line on top
ggplot(faithful, aes(x = waiting)) +
geom_density(fill = "blue", alpha = .2, colour = NA) +
xlim(35, 105) +
geom_line(stat = "density")
```
If this edge-clipping happens with your data, it might mean that your curve is too smooth. If the curve is much wider than your data, it might not be the best model of your data, or it could be because you have a small data set.
To compare the theoretical and observed distributions of your data, you can overlay the density curve with the histogram. Since the *y* values for the density curve are small (the area under the curve always sums to 1), it would be barely visible if you overlaid it on a histogram without any transformation. To solve this problem, you can scale down the histogram to match the density curve with the mapping `y = ..density..`. Here we'll add `geom_histogram()` first, and then layer `geom_density()` on top (Figure \@ref(fig:FIG-DISTRIBUTION-DENSITY-HIST)):
```{r FIG-DISTRIBUTION-DENSITY-HIST, fig.cap="Density curve overlaid on a histogram", message=FALSE, fig.width=4, fig.height=4}
ggplot(faithful, aes(x = waiting, y = ..density..)) +
geom_histogram(fill = "cornsilk", colour = "grey60", size = .2) +
geom_density() +
xlim(35, 105)
```
### See Also
See Recipe \@ref(RECIPE-DISTRIBUTION-VIOLIN) for information on violin plots, which are another way of representing density curves and may be more appropriate for comparing multiple distributions.
Making Multiple Density Curves from Grouped Data {#RECIPE-DISTRIBUTION-MULTI-DENSITY}
------------------------------------------------
### Problem
You want to make density curves of multiple groups of data.
### Solution
Use `geom_density()`, and map the grouping variable to an aesthetic like `colour` or `fill`, as shown in Figure \@ref(fig:FIG-DISTRIBUTION-MULTI-DENSITY). The grouping variable must be a factor or a character vector. In the `birthwt` data set, the desired grouping variable, `smoke`, is stored as a number, so we have to convert it to a factor first.
```{r FIG-DISTRIBUTION-MULTI-DENSITY, fig.show="hold", fig.cap="Different line colors for each group (left); Different semitransparent fill colors for each group (right)"}
library(MASS) # Load MASS for the birthwt data set
birthwt_mod <- birthwt %>%
mutate(smoke = as.factor(smoke)) # Convert smoke to a factor
# Map smoke to colour
ggplot(birthwt_mod, aes(x = bwt, colour = smoke)) +
geom_density()
# Map smoke to fill and make the fill semitransparent by setting alpha
ggplot(birthwt_mod, aes(x = bwt, fill = smoke)) +
geom_density(alpha = .3)
```
### Discussion
To make these plots, the data must all be in one data frame, with one column containing a categorical variable used for grouping.
For this example, we used the `birthwt` data set. It contains data about birth weights and a number of risk factors for low birth weight:
```{r}
birthwt
```
We looked at the relationship between `smoke` (smoking) and `bwt` (birth weight in grams). The value of `smoke` is either 0 or 1, but since it's stored as a numeric vector, ggplot doesn't know that it should be treated as a categorical variable. To make it so ggplot knows to treat `smoke` as categorical, we can either convert that column of the data frame to a factor, or tell ggplot to treat it as a factor by using `factor(smoke)` inside of the `aes()` statement. For these examples, we converted `smoke` to a factor.
Another method for visualizing the distributions is to use facets, as shown in Figure \@ref(fig:FIG-DISTRIBUTION-MULTI-DENSITY-FACET). We can align the facets vertically or horizontally. Here we'll align them vertically so that it's easy to compare the two distributions:
```{r FIG-DISTRIBUTION-MULTI-DENSITY-FACET-1, eval=FALSE}
ggplot(birthwt_mod, aes(x = bwt)) +
geom_density() +
facet_grid(smoke ~ .)
```
One problem with the faceted graph is that the facet labels are just 0 and 1, and there's no label indicating that those values are for smoke. To change the labels, we need to change the names of the factor levels. First we'll take a look at the factor levels, then we'll assign new factor level names:
```{r FIG-DISTRIBUTION-MULTI-DENSITY-FACET-2, eval=FALSE}
levels(birthwt_mod$smoke)
#> [1] "0" "1"
birthwt_mod$smoke <- recode(birthwt_mod$smoke, '0' = 'No Smoke', '1' = 'Smoke')
```
Now when we plot our modified data frame, our desired labels appear (Figure
\@ref(fig:FIG-DISTRIBUTION-MULTI-DENSITY-FACET), right):
```{r FIG-DISTRIBUTION-MULTI-DENSITY-FACET-3, eval=FALSE}
ggplot(birthwt_mod, aes(x = bwt)) +
geom_density() +
facet_grid(smoke ~ .)
```
```{r FIG-DISTRIBUTION-MULTI-DENSITY-FACET, ref.label=c("FIG-DISTRIBUTION-MULTI-DENSITY-FACET-1", "FIG-DISTRIBUTION-MULTI-DENSITY-FACET-2", "FIG-DISTRIBUTION-MULTI-DENSITY-FACET-3"), echo=FALSE, results = "hide", fig.show="hold", fig.cap="Density curves with facets (left); With different facet labels (right)", fig.width=4, fig.height=4}
```
If you want to see the histograms along with the density curves, the best option is to use facets, since other methods of visualizing both histograms in a single graph can be difficult to interpret. To do this, map `y = ..density..`, so that the histogram is scaled down to the height of the density curves. In this example, we'll also make the histogram bars a little less prominent by changing the colors (Figure \@ref(fig:FIG-DISTRIBUTION-MULTI-DENSITY-HIST)):
```{r FIG-DISTRIBUTION-MULTI-DENSITY-HIST, fig.cap="Density curves overlaid on histograms", fig.width=4, fig.height=4}
ggplot(birthwt_mod, aes(x = bwt, y = ..density..)) +
geom_histogram(binwidth = 200, fill = "cornsilk", colour = "grey60", size = .2) +
geom_density() +
facet_grid(smoke ~ .)
```
Making a Frequency Polygon {#RECIPE-DISTRIBUTION-FREQPOLY}
--------------------------
### Problem
You want to make a frequency polygon.
### Solution
Use geom_`freqpoly()` (Figure \@ref(fig:FIG-DISTRIBUTION-FREQPOLY)):
```{r FIG-DISTRIBUTION-FREQPOLY-1, eval=FALSE}
ggplot(faithful, aes(x=waiting)) +
geom_freqpoly()
```
### Discussion
A frequency polygon appears similar to a kernel density estimate curve, but it shows the same information as a histogram. That is, like a histogram, it shows what is in the data, whereas a kernel density estimate is just that -- an estimate -- and requires you to pick some value for the bandwidth.
Like with a histogram, you can control the bin width for the frequency polygon (Figure \@ref(fig:FIG-DISTRIBUTION-FREQPOLY), right):
```{r FIG-DISTRIBUTION-FREQPOLY-2, eval=FALSE}
ggplot(faithful, aes(x = waiting)) +
geom_freqpoly(binwidth = 4)
```
```{r FIG-DISTRIBUTION-FREQPOLY, ref.label=c("FIG-DISTRIBUTION-FREQPOLY-1", "FIG-DISTRIBUTION-FREQPOLY-2"), echo=FALSE, fig.show="hold", fig.cap="A frequency polygon (left); With wider bins (right)", fig.width=4, fig.height=4, message=FALSE}
```
Or, instead of setting the width of each bin directly, you can divide the *x* range into a particular number of bins:
```{r eval=FALSE}
# Divide the x-axis range into 15 bins
binsize <- diff(range(faithful$waiting))/15
ggplot(faithful, aes(x = waiting)) +
geom_freqpoly(binwidth = binsize)
```
### See Also
Histograms display the same information, but with bars instead of lines. See Recipe \@ref(RECIPE-DISTRIBUTION-BASIC-HIST).
Making a Basic Box Plot {#RECIPE-DISTRIBUTION-BASIC-BOXPLOT}
-----------------------
### Problem
You want to make a box (or box-and-whiskers) plot.
### Solution
Use `geom_boxplot()`, mapping a continuous variable to y and a discrete variable to x (Figure \@ref(fig:FIG-DISTRIBUTION-BOXPLOT-BASIC)):
```{r FIG-DISTRIBUTION-BOXPLOT-BASIC, fig.cap="A box plot"}
library(MASS) # Load MASS for the birthwt data set
# Use factor() to convert a numeric variable into a discrete variable
ggplot(birthwt, aes(x = factor(race), y = bwt)) +
geom_boxplot()
```
### Discussion
For this example, we used the `birthwt` data set from the `MASS` package. This data set contains data about birth weights (`bwt`) and a number of risk factors for low birth weight:
```{r}
birthwt
```
In Figure \@ref(fig:FIG-DISTRIBUTION-BOXPLOT-BASIC) we have visualized the distributions of `bwt` by each `race` group. Because `race` is stored as a numeric vector with the values of 1, 2, or 3, ggplot doesn't know how to use this numeric version of `race` as a grouping variable. To make this work, we can modify the data frame by converting `race` to a factor, or by telling ggplot to treat `race` as a factor by using `factor(race)` inside of the `aes()` statement. In the preceding example, we used `factor(race)`.
A box plot consists of a box and "whiskers." The box goes from the 25th percentile to the 75th percentile of the data, also known as the *inter-quartile range* (IQR). There's a line indicating the median, or the 50th percentile of the data. The whiskers start from the edge of the box and extend to the furthest data point that is within 1.5 times the IQR. Any data points that are past the ends of the whiskers are considered outliers and displayed with dots. Figure \@ref(fig:FIG-DISTRIBUTION-BOXPLOT-DIAGRAM) shows the relationship between a histogram, a density curve, and a box plot, using a skewed data set.
```{r FIG-DISTRIBUTION-BOXPLOT-DIAGRAM, echo=FALSE, fig.cap="Box plot compared to histogram and density curve", figh.width=7, fig.height=3, warning=FALSE}
set.seed(122)
# Generate skewed data
ds <- data.frame(x = rnorm(1000, mean = 10, sd = 2)^3)
min <- -500
max <- max(ds$x)
sumx <- summary(ds$x)
iqr <- sumx[["3rd Qu."]] - sumx[["1st Qu."]]
p1 <- ggplot(ds, aes(x = x)) +
geom_histogram(aes(y = ..count../140), binwidth = 200, colour = "grey80", fill = "cornsilk", alpha = .5) +
geom_density(aes(y = ..scaled..), adjust = 1.5, colour = "grey70") +
geom_vline(aes(xintercept = sumx[["1st Qu."]]), colour = "grey50") +
geom_vline(aes(xintercept = sumx[["3rd Qu."]]), colour = "grey50") +
geom_vline(aes(xintercept = sumx[["Min."]]), colour = "grey50") +
geom_vline(aes(xintercept = sumx[["3rd Qu."]] + 1.5 * iqr), colour = "grey50") +
geom_vline(aes(xintercept = sumx[["Median"]]), colour = "grey50") +
annotate(
"text", x = sumx[["Min."]], y = 0, label = "Minimum",
angle = 90, vjust = -0.2, hjust = 0, size = 4
) +
annotate(
"text", x = sumx[["1st Qu."]], y = 0, label = "25th percentile",
angle = 90, vjust = -0.2, hjust = 0, size = 4
) +
annotate(
"text", x = sumx[["Median"]], y = 0, label = "Median",
angle = 90, vjust = -0.2, hjust = 0, size = 4
) +
annotate(
"text", x = sumx[["3rd Qu."]], y = 0, label = "75th percentile",
angle = 90, vjust = -0.2, hjust = 0, size = 4
) +
geom_segment(
aes(x = sumx[["Min."]], xend = sumx[["1st Qu."]], y = .75, yend = .75),
size = .2, arrow = arrow(ends = "both", length = unit(0.2,"cm"))
) +
annotate(
"text", x = mean(c(sumx[["Min."]], sumx[["1st Qu."]])), y = .75, label = "To minimum",
vjust = -0.2, size = 4, lineheight = .8
) +
geom_segment(
aes(x = sumx[["1st Qu."]], xend = sumx[["3rd Qu."]], y = .85, yend = .85),
size = .2, arrow = arrow(ends = "both", length = unit(0.2,"cm"))
) +
annotate(
"text", x = mean(c(sumx[["1st Qu."]], sumx[["3rd Qu."]])), y = .85, label = "IQR",
vjust = -0.2, size = 4
) +
geom_segment(
aes(x = sumx[["3rd Qu."]], xend = sumx[["3rd Qu."]] + 1.5 * iqr, y = .75, yend = .75),
size = .2, arrow = arrow(ends = "both", length = unit(0.2,"cm"))
) +
annotate(
"text", x = sumx[["3rd Qu."]] + .75*iqr, y = .75, vjust = -0.2, size = 4, label = "1.5 x IQR"
) +
theme_bw() +
scale_x_continuous(breaks = NULL, limits = c(0,max(ds$x))) +
scale_y_continuous(breaks = NULL) +
theme(axis.title.x = element_blank()) +
theme(axis.title.y = element_blank()) +
theme(panel.border = element_rect(fill = NA, colour = NA)) +
theme(plot.margin = unit(c(0,0,0,0), "lines"))
p2 <- ggplot(ds, aes(x = 1, y = x)) +
geom_boxplot(width = .5, outlier.size = 1.5) +
coord_flip() +
theme_bw() +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL, limits = c(0,max(ds$x))) +
theme(axis.title.x = element_blank()) +
theme(axis.title.y = element_blank()) +
theme(panel.border = element_rect(fill = NA, colour = NA)) +
theme(plot.margin = unit(c(0,0,0,0), "lines"))
library(grid)
grid.newpage()
pushViewport(viewport(layout = grid.layout(4, 1)))
vplayout <- function(x, y)
viewport(layout.pos.row = x, layout.pos.col = y)
print(p1, vp = vplayout(1:3, c(1,1,1)))
print(p2, vp = vplayout(4, 1))
```
To change the width of the boxes, you can set width (Figure
\@ref(fig:FIG-DISTRIBUTION-BOXPLOT-WIDTH-POINT), left):
```{r FIG-DISTRIBUTION-BOXPLOT-WIDTH-POINT-1, eval=FALSE}
ggplot(birthwt, aes(x = factor(race), y = bwt)) +
geom_boxplot(width = .5)
```
If there are many outliers and there is overplotting, you can change the size and shape of the outlier points with `outlier.size` and `outlier.shape`. The default size is 2 and the default shape is 16. This will use smaller points, and hollow circles (Figure \@ref(fig:FIG-DISTRIBUTION-BOXPLOT-WIDTH-POINT), right):
```{r FIG-DISTRIBUTION-BOXPLOT-WIDTH-POINT-2, eval=FALSE}
ggplot(birthwt, aes(x = factor(race), y = bwt)) +
geom_boxplot(outlier.size = 1.5, outlier.shape = 21)
```
```{r FIG-DISTRIBUTION-BOXPLOT-WIDTH-POINT, ref.label=c("FIG-DISTRIBUTION-BOXPLOT-WIDTH-POINT-1", "FIG-DISTRIBUTION-BOXPLOT-WIDTH-POINT-2"), echo=FALSE, fig.show="hold", fig.cap="Box plot with narrower boxes (left); With smaller, hollow outlier points (right)", fig.width=3.5, fig.height=3.5}
```
To make a box plot of just a single group, we have to provide some arbitrary value for x; otherwise, ggplot won't know what *x* coordinate to use for the box plot. In this case, we'll set it to 1 and remove the x-axis tick markers and label (Figure \@ref(fig:FIG-DISTRIBUTION-BOXPLOT-SINGLE)):
```{r FIG-DISTRIBUTION-BOXPLOT-SINGLE, fig.cap="Box plot of a single group", fig.width=3, fig.height=3.5}
ggplot(birthwt, aes(x = 1, y = bwt)) +
geom_boxplot() +
scale_x_continuous(breaks = NULL) +
theme(axis.title.x = element_blank())
```
> **Note**
>
> The calculation of quantiles works slightly differently from the `boxplot()` function in base R. This can sometimes be noticeable for small sample sizes. See `?geom_boxplot` for detailed information about how the calculations differ.
Adding Notches to a Box Plot {#RECIPE-DISTRIBUTION-BOXPLOT-NOTCH}
----------------------------
### Problem
You want to add notches to a box plot to assess whether the medians are different.
### Solution
Use `geom_boxplot()` and set `notch = TRUE` (Figure
\@ref(fig:FIG-DISTRIBUTION-BOXPLOT-NOTCH)):
```{r FIG-DISTRIBUTION-BOXPLOT-NOTCH, fig.cap="A notched box plot", message=FALSE}
library(MASS) # Load MASS for the birthwt data set
ggplot(birthwt, aes(x = factor(race), y = bwt)) +
geom_boxplot(notch = TRUE)
```
### Discussion
Notches are used in box plots to help visually assess whether the medians of distributions differ. If the notches do not overlap, this is evidence that the medians are different.
With this particular data set, you'll see the following message:
```
Notch went outside hinges. Try setting notch=FALSE.
```
This means that the confidence region (the notch) went past the bounds (or hinges) of one of the boxes. In this case, the upper part of the notch in the middle box goes just barely outside the box body, but it's by such a small amount that you can't see it in the final output. There's nothing inherently wrong with a notch going outside the hinges, but it can look strange in more extreme cases.
Adding Means to a Box Plot {#RECIPE-DISTRIBUTION-BOXPLOT-MEAN}
--------------------------
### Problem
You want to add markers for the mean to a box plot.
### Solution
Use `stat_summary()`. The mean is often shown with a diamond, so we'll use shape 23 with a white fill. We'll also make the diamond slightly larger by setting `size = 3` (Figure \@ref(fig:FIG-DISTRIBUTION-BOXPLOT-MEAN)):
```{r FIG-DISTRIBUTION-BOXPLOT-MEAN, fig.cap="Mean markers on a box plot"}
library(MASS) # Load MASS for the birthwt data set
ggplot(birthwt, aes(x = factor(race), y = bwt)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape = 23, size = 3, fill = "white")
```
### Discussion
The horizontal line in the middle of a box plot displays the median, not the mean. For data that is normally distributed, the median and mean will be about the same, but for skewed data these values will differ.
Making a Violin Plot {#RECIPE-DISTRIBUTION-VIOLIN}
--------------------
### Problem
You want to make a violin plot to compare density estimates of different groups.
### Solution
Use `geom_violin()` (Figure \@ref(fig:FIG-DISTRIBUTION-VIOLIN-BASIC)):
```{r FIG-DISTRIBUTION-VIOLIN-BASIC, fig.cap="A violin plot", fig.width=3.5}
library(gcookbook) # Load gcookbook for the heightweight data set
# Create a base plot using the heightweight data set
hw_p <- ggplot(heightweight, aes(x = sex, y = heightIn))
hw_p +
geom_violin()
```
### Discussion
Violin plots are a way of comparing multiple data distributions. With ordinary density curves, it is difficult to compare more than just a few distributions because the lines visually interfere with each other. With a violin plot, it's easier to compare several distributions since they're placed side by side.
A violin plot is a kernel density estimate, mirrored so that it forms a symmetrical shape. Traditionally, they also have narrow box plots overlaid, with a white dot at the median, as shown in Figure \@ref(fig:FIG-DISTRIBUTION-VIOLIN-BOXPLOT). Additionally, the box plot outliers are not displayed, which we do by setting `outlier.colour = NA`:
```{r FIG-DISTRIBUTION-VIOLIN-BOXPLOT, fig.cap="A violin plot with box plot overlaid on it", fig.width=3.5}
hw_p +
geom_violin() +
geom_boxplot(width = .1, fill = "black", outlier.colour = NA) +
stat_summary(fun.y = median, geom = "point", fill = "white", shape = 21, size = 2.5)
```
In this example we layered the objects from the bottom up, starting with the violin, then the box plot, then the white dot at the median, which is calculated using `stat_summary()`.
The default range goes from the minimum to maximum data values; the flat ends of the violins are at the extremes of the data. It's possible to keep the tails, by setting `trim = FALSE` (Figure \@ref(fig:FIG-DISTRIBUTION-VIOLIN-TAIL)):
```{r FIG-DISTRIBUTION-VIOLIN-TAIL, fig.cap="A violin plot with tails", fig.width=3.5}
hw_p +
geom_violin(trim = FALSE)
```
By default, the violins are scaled so that the total area of each one is the same (if `trim = TRUE`, then it scales what the area *would be* including the tails). Instead of equal areas, you can use `scale = "count"` to scale the areas proportionally to the number of observations in each group (Figure \@ref(fig:FIG-DISTRIBUTION-VIOLIN-SCALECOUNT)). In this example, there are slightly fewer females than males, so the female violin becomes slightly narrower than before:
```{r FIG-DISTRIBUTION-VIOLIN-SCALECOUNT, fig.cap="Violin plot with area proportional to number of observations", fig.width=3.5}
# Scaled area proportional to number of observations
hw_p +
geom_violin(scale = "count")
```
To change the amount of smoothing, use the adjust parameter, as described in Recipe \@ref(RECIPE-DISTRIBUTION-BASIC-DENSITY). The default value is 1; use larger values for more smoothing and smaller values for less smoothing (Figure \@ref(fig:FIG-DISTRIBUTION-VIOLIN-ADJUST)):
```{r FIG-DISTRIBUTION-VIOLIN-ADJUST, fig.show="hold", fig.cap="Violin plot with more smoothing (left); With less smoothing (right)", fig.width=3.5}
# More smoothing
hw_p +
geom_violin(adjust = 2)
# Less smoothing
hw_p +
geom_violin(adjust = .5)
```
### See Also
To create a traditional density curve, see Recipe \@ref(RECIPE-DISTRIBUTION-BASIC-DENSITY).
To use different point shapes, see Recipe \@ref(RECIPE-LINE-GRAPH-POINT-APPEARANCE).
Making a Dot Plot {#RECIPE-DISTRIBUTION-DOT-PLOT}
-----------------
### Problem
You want to make a Wilkinson dot plot, which shows each data point.
### Solution
Use `geom_dotplot()`. For this example (Figure \@ref(fig:FIG-DISTRIBUTION-DOTPLOT-BASIC)), we'll use a subset of the `countries` data set:
```{r FIG-DISTRIBUTION-DOTPLOT-BASIC, fig.cap="A dot plot", message=FALSE}
library(gcookbook) # Load gcookbook for the countries data set
library(dplyr)
# Save a modified data set that only includes 2009 data for countries that
# spent > 2000 USD per capita
c2009 <- countries %>%
filter(Year == 2009 & healthexp > 2000)
# Create a base ggplot object using `c2009`, called `c2009_p` (for c2009 plot)
c2009_p <- ggplot(c2009, aes(x = infmortality))
c2009_p +
geom_dotplot()
```
### Discussion
This kind of dot plot is sometimes called a *Wilkinson* dot plot. It's different from the Cleveland dot plots shown in Recipe \@ref(RECIPE-BAR-GRAPH-DOT-PLOT). In these Wilkinson dot plots, the placement of the bins depends on the data, and the width of each dot corresponds to the maximum width of each bin. The maximum bin size defaults to 1/30 of the range of the data, but it can be changed with binwidth.
By default, `geom_dotplot()` bins the data along the x-axis and stacks on the y-axis. The dots are stacked visually, and due to technical limitations of ggplot2, the resulting graph has y-axis tick marks that aren't meaningful. The y-axis labels can be removed by using `scale_y_continuous()`. In this example, we'll also use `geom_rug()` to show exactly where each data point is (Figure \@ref(fig:FIG-DISTRIBUTION-DOTPLOT-NO-Y-RUG)):
```{r FIG-DISTRIBUTION-DOTPLOT-NO-Y-RUG, fig.cap="Dot plot with no y labels, max bin size of .25, and a rug showing each data point"}
c2009_p +
geom_dotplot(binwidth = .25) +
geom_rug() +
scale_y_continuous(breaks = NULL) + # Remove tick markers
theme(axis.title.y = element_blank()) # Remove axis label
```
You may notice that the stacks aren't regularly spaced in the horizontal direction. With the default dotdensity binning algorithm, the position of each stack is centered above the set of data points that it represents. To use bins that are arranged with a fixed, regular spacing, like a histogram, use `method = "histodot"`. In Figure \@ref(fig:FIG-DISTRIBUTION-DOTPLOT-HISTODOT), you'll notice that the stacks *aren't* centered above the data:
```{r FIG-DISTRIBUTION-DOTPLOT-HISTODOT, fig.cap="Dot plot with histodot (fixed-width) binning"}
c2009_p +
geom_dotplot(method = "histodot", binwidth = .25) +
geom_rug() +
scale_y_continuous(breaks = NULL) +
theme(axis.title.y = element_blank())
```
The dots can also be stacked centered, or centered in such a way that stacks with even and odd quantities stay aligned. This can by done by setting `stackdir = "center"` or `stackdir = "centerwhole"`, as illustrated in Figure \@ref(fig:FIG-DISTRIBUTION-DOTPLOT-CENTER):
```{r FIG-DISTRIBUTION-DOTPLOT-CENTER, fig.show="hold", fig.cap='Dot plot with stackdir = "center" (left); With stackdir = "centerwhole" (right)', fig.width=3.5, fig.height=3.5}
c2009_p +
geom_dotplot(binwidth = .25, stackdir = "center") +
scale_y_continuous(breaks = NULL) +
theme(axis.title.y = element_blank())
c2009_p +
geom_dotplot(binwidth = .25, stackdir = "centerwhole") +
scale_y_continuous(breaks = NULL) +
theme(axis.title.y = element_blank())
```
### See Also
Leland Wilkinson, "Dot Plots," *The American Statistician* 53 (1999): 276–281,
<https://www.cs.uic.edu/~wilkinson/Publications/dotplots.pdf>.
Making Multiple Dot Plots for Grouped Data {#RECIPE-DISTRIBUTION-DOT-PLOT-MULTI}
------------------------------------------
### Problem
You want to make multiple dot plots from grouped data.
### Solution
To compare multiple groups, it's possible to stack the dots along the y-axis, and group them along the x-axis, by setting `binaxis = "y"`. For this example, we'll use the heightweight data set (Figure \@ref(fig:FIG-DISTRIBUTION-DOTPLOT-MULTI)):
```{r FIG-DISTRIBUTION-DOTPLOT-MULTI, fig.cap="Dot plot of multiple groups, binning along the y-axis"}
library(gcookbook) # Load gcookbook for the heightweight data set
ggplot(heightweight, aes(x = sex, y = heightIn)) +
geom_dotplot(binaxis = "y", binwidth = .5, stackdir = "center")
```
### Discussion
Dot plots are sometimes overlaid on box plots. In these cases, it may be helpful to make the dots hollow and have the box plots *not* show outliers, since the outlier points will appear to be part of the dot plot (Figure \@ref(fig:FIG-DISTRIBUTION-DOTPLOT-MULTI-BOXPLOT)):
```{r FIG-DISTRIBUTION-DOTPLOT-MULTI-BOXPLOT, fig.cap="Dot plot overlaid on box plot"}
ggplot(heightweight, aes(x = sex, y = heightIn)) +
geom_boxplot(outlier.colour = NA, width = .4) +
geom_dotplot(binaxis = "y", binwidth = .5, stackdir = "center", fill = NA)
```
It's also possible to show the dot plots next to the box plots, as shown in Figure \@ref(fig:FIG-DISTRIBUTION-DOTPLOT-MULTI-SIDE). This requires using a bit of a hack, by treating the *x* variable as a numeric variable and then subtracting or adding a small quantity to shift the box plots and dot plots left and right. When the *x* variable is treated as numeric you must also specify the group, or else the data will be treated as a single group, with just one box plot and dot plot. Finally, since the x-axis is treated as numeric, it will by default show numbers for the x-axis tick labels; they must be modified with `scale_x_continuous()` to show *x* tick labels as text corresponding to the factor levels:
```{r FIG-DISTRIBUTION-DOTPLOT-MULTI-SIDE, fig.cap="Dot plot next to box plot"}
ggplot(heightweight, aes(x = sex, y = heightIn)) +
geom_boxplot(aes(x = as.numeric(sex) + .2, group = sex), width = .25) +
geom_dotplot(
aes(x = as.numeric(sex) - .2, group = sex),
binaxis = "y",
binwidth = .5,
stackdir = "center"
) +
scale_x_continuous(
breaks = 1:nlevels(heightweight$sex),
labels = levels(heightweight$sex)
)
```
Making a Density Plot of Two-Dimensional Data {#RECIPE-DISTRIBUTION-DENSITY2D}
---------------------------------------------
### Problem
You want to plot the density of two-dimensional data.
### Solution
Use `stat_density2d()`. This makes a 2D kernel density estimate from the data. First we'll plot the density contour along with the data points (Figure \@ref(fig:FIG-DISTRIBUTION-DENSITY2D), left):
```{r FIG-DISTRIBUTION-DENSITY2D-1, eval=FALSE}
# Save a base plot object
faithful_p <- ggplot(faithful, aes(x = eruptions, y = waiting))
faithful_p +
geom_point() +
stat_density2d()
```
It's also possible to map the *height* of the density curve to the color of the contour lines, by using `..level..` (Figure \@ref(fig:FIG-DISTRIBUTION-DENSITY2D), right):
```{r FIG-DISTRIBUTION-DENSITY2D-2, eval=FALSE}
# Contour lines, with "height" mapped to color
faithful_p +
stat_density2d(aes(colour = ..level..))
```
```{r FIG-DISTRIBUTION-DENSITY2D, echo=FALSE, fig.show="hold", fig.cap="Points and density contour (left); With ..level.. mapped to color (right)", fig.width=10}
faithful_p <- ggplot(faithful, aes(x = eruptions, y = waiting))
p1 <- faithful_p +
geom_point() +
stat_density2d()
p2 <- faithful_p +
stat_density2d(aes(colour = ..level..))
library(patchwork)
p1 + plot_spacer() + p2 + plot_layout(widths = c(5, 1, 5))
```
### Discussion
The two-dimensional kernel density estimate is analogous to the one-dimensional density estimate generated by `stat_density()`, but of course, it needs to be viewed in a different way. The default is to use contour lines, but it's also possible to use tiles and to map the density estimate to the fill color, or to the transparency of the tiles, as shown in Figure \@ref(fig:FIG-DISTRIBUTION-DENSITY2D-TILE):
(ref:cap-FIG-DISTRIBUTION-DENSITY2D-TILE) With `..density..` mapped to fill (left); With points, and ..density.. mapped to alpha (right)
```{r FIG-DISTRIBUTION-DENSITY2D-TILE, fig.show="hold", fig.cap="(ref:cap-FIG-DISTRIBUTION-DENSITY2D-TILE)", fig.width=5, fig.height=4}
# Map density estimate to fill color
faithful_p +
stat_density2d(aes(fill = ..density..), geom = "raster", contour = FALSE)
# With points, and map density estimate to alpha
faithful_p +
geom_point() +
stat_density2d(aes(alpha = ..density..), geom = "tile", contour = FALSE)
```
> **Note**
>
> We used `geom = "raster"` in the first of the preceding examples and `geom = "tile"` in the second. The main difference is that the raster geom renders more efficiently than the tile geom. In theory they *should* appear the same, but in practice they often do not. If you are writing to a PDF file, the appearance depends on the PDF viewer. On some viewers, when tile is used there may be faint lines between the tiles, and when raster is used the edges of the tiles may appear blurry (although it doesn't matter in this particular case).
As with the one-dimensional density estimate, you can control the bandwidth of the estimate. To do this, pass a vector for the *x* and *y* bandwidths to `h`. This argument gets passed on to the function that actually generates the density estimate, `kde2d()`. In this example (Figure \@ref(fig:FIG-DISTRIBUTION-DENSITY2D-BANDWIDTH)), we'll use a smaller bandwidth in the *x* and *y* directions, so that the density estimate is more closely fitted (perhaps overfitted) to the data:
```{r FIG-DISTRIBUTION-DENSITY2D-BANDWIDTH, fig.cap="Density plot with a smaller bandwidth in the x and y directions"}
faithful_p +
stat_density2d(
aes(fill = ..density..),
geom = "raster",
contour = FALSE,
h = c(.5, 5)
)
```
### See Also
The relationship between `stat_density2d()` and `stat_bin2d()` is the same as the relationship between their one-dimensional counterparts, the density curve and the histogram. The density curve is an *estimate* of the distribution under certain assumptions, while the binned visualization represents the observed data directly. See Recipe \@ref(RECIPE-SCATTER-OVERPLOT) for more about binning data.
If you want to use a different color palette, see Recipe \@ref(RECIPE-COLORS-PALETTE-CONTINUOUS).
`stat_density2d()` passes options to `kde2d()`; see `?kde2d` for information on the available options.