-
Notifications
You must be signed in to change notification settings - Fork 9
/
quantile-dotplots.Rmd
executable file
·164 lines (129 loc) · 7.39 KB
/
quantile-dotplots.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: "Quantile dotplots"
output:
github_document:
toc: true
---
## Introduction
This document describes *quantile dotplots*, the basic motivation behind them, and how to generate them.
Please cite:
Matthew Kay, Tara Kola, Jessica Hullman, Sean Munson. _When (ish) is My Bus?
User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems_.
CHI 2016. DOI: [10.1145/2858036.2858558](http://dx.doi.org/10.1145/2858036.2858558).
## Setup
### Required libraries
If you are missing any of the packages below, use `install.packages("packagename")` to install them.
The `import::` syntax requires the `import` package to be installed, and provides a simple way to
import specific functions from a package without polluting your entire namespace (unlike `library()`)
```{r setup, results="hide", message=FALSE}
library(ggplot2)
import::from(magrittr, `%>%`, `%<>%`, `%$%`)
import::from(dplyr,
transmute, group_by, mutate, filter, select, data_frame,
left_join, summarise, one_of, arrange, do, ungroup)
```
### Ggplot theme
```{r setup_ggplot}
theme_set(theme_light() + theme(
panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
axis.line=element_line(color="black"),
text=element_text(size=14),
axis.text=element_text(size=rel(15/16)),
axis.ticks.length=unit(8, "points"),
line=element_line(size=.75)
))
```
## Representing a continuous probability distribution as discrete outcomes
We might like to represent a continuous probability distribution (say, a prediction for when a bus is going to arrive) as discrete outcomes, since frequency-based (or "discrete outcome") presentations can be easier for people to interpret. How should we do that?
### A problematic solution: random draws
A first pass at representing a continous probability distribution as discrete outcomes might be to generate random draws from that distribution. However, especially for a small number of samples, this tends not to work well. The problem is that any given random sample of (say) 20 isn't always representative.
For example, a log-normal distribution might look like this:
```{r reference_dist, fig.width=7, fig.height=2.5}
mu = log(11.4)
sigma = 0.2
x = seq(from=.01, to=30, length=10001)
d = dlnorm(x, mu, sigma)
density_df = data_frame(x, d)
density_df %>%
ggplot(aes(x = x, y = d)) +
geom_line() +
geom_vline(xintercept = 11, color="red") +
xlim(0, 30)
```
And we could generate random samples of a set size, say 20, to visualize this. But every time we
do that it looks different:
```{r random_draws, fig.width=7, fig.height=11, warning=FALSE}
runs = 5
samples = 20
binwidth = 1.25
data.frame(
run = factor(1:runs),
x = rlnorm(runs * samples, mu, sigma)
) %>%
ggplot(aes(x=x)) +
geom_dotplot(binwidth=binwidth) +
facet_grid(run ~ .) +
xlim(0, 30)
```
Sometimes the distribution shape is obscured, the location shifted, or tails over- or under-represented.
### A consistent solution: a quantile dotplot
One way to get a more consistent representation is to use the quantile function of the distribution instead of random draws. In essence, this encodes the cumulative distribution function
one-dimensionally, and then turns it into a Wilkinsonian dotplot. (Wilkinson originally described dotplots for displaying sample data in [Wilkinson, L, Dot Plots, _The American Statistician_, 1999](dx.doi.org/10.1080/00031305.1999.10474474); we re-use them here for displaying predictive quantiles, hence _quantile dotplot_).
First, we generate evenly-space quantiles in probability space (i.e. from 0 to 1) depending on the number of `samples` we want. If you are familiar with Q-Q plots, this is essentially the same approach used to generate representative quantiles for a Q-Q plot (a variant of this method is implemented in R in the `ppoints` function, but spelled out here for completeness).
```{r quantiles}
quantiles = data_frame(
p_less_than_x = seq(from = 1/samples / 2, to = 1 - (1/samples / 2), length=samples),
x = qlnorm(p_less_than_x, mu, sigma)
)
quantiles
```
The table above shows the probability of drawing a value less than x (i.e. `P(X < x)`) and the corresponding value of x to achieve that probability on the underlying distribution. We generate 20 (or whatever value of `samples` you like) values of `p_less_than_x` evenly-spaced in `(0,1)` (i.e. not including 0 or 1), and then find `x` using the inverse CDF (aka the quantile function) of the predictive distribution at each value of `p_less_than_x`. If we then take those values of `x` and plot them as a dotplot, we get a quantile dotplot:
Visualized with the CDF, that looks like this:
```{r quantile_dotplot_generation, fig.width=7, fig.height=3, fig.show="hold"}
x = seq(0.01, 30, length=1001)
p_less_than_x = plnorm(x, mu, sigma)
qf_df = data_frame(x, p_less_than_x)
cdf_plot = ggplot(quantiles, aes(x=x, y=p_less_than_x)) +
geom_segment(aes(xend=x, yend=p_less_than_x), x = 0, color="gray75", linetype="dashed") +
geom_segment(aes(xend=x, yend=p_less_than_x), y = -0.05, color="gray75", linetype="dashed",
arrow = arrow(ends="first", length=unit(5, "pt"), type="closed")) +
geom_line(data = qf_df, color="red", size=1) +
annotate("text", x = 30, y = 1, label = "cumulative distribution function", color="red", hjust=1, vjust=1.5) +
xlim(c(-3.5, 30))
cdf_plot
quantile_dotplot = quantiles %>%
ggplot(aes(x = x)) +
geom_vline(aes(xintercept = x), color="gray75", linetype="dashed") +
geom_dotplot(binwidth = 1.25, color=NA) +
xlim(-3.5,30)
quantile_dotplot
```
The evenly-spaced probabilities on the y axis are turned into representative quantiles from the distribution on the x axis.
## Properties of the quantile dotplot
### Shape approximates density
The shape a quantile dotplot approximates the shape of the density:
```{r quantile_dotplot_with_density, fig.width=7, fig.height=2}
quantile_dotplot +
geom_line(aes(y = d * 5.1), data = density_df, color="red")
```
### Finding probability intervals reduces to counting
Inferences about one-sided probability intervals on the distribution can
be made through counting. If our example distribution represents a probability
distribution over predicted time to arrive for my bus, and I am willing to miss
the bus 2/20 times, I can count up 2 dots ("busses") from the left to get the
time I should arrive at the bus stop. This works because the x-value for
`P(X < x) = 2/20` will be between the second and third dot. More generally, the x-value for
`P(X < x) = k/20` will be between the `k`th and `k + 1`th dot). Here's how this works
on the CDF and the dotplot:
```{r counting_on_quantile_dotplot, fig.width=7, fig.height=3, fig.show="hold"}
cdf_plot +
geom_segment(x = 0, xend = qlnorm(.1, mu, sigma), y = .1, yend = .1, color = "purple", size = 1) +
geom_segment(x = qlnorm(.1, mu, sigma), xend = qlnorm(.1, mu, sigma), y = -0.05, yend = .1,
color = "purple", size = 1, arrow = arrow(ends="first", length=unit(5, "pt"), type="closed")) +
annotate("text", x = 0, y = 0.1, label = "2/20 = 0.1", color = "purple", hjust = 1.1)
quantile_dotplot +
geom_dotplot(binwidth = 1.25, color=NA, fill="purple", data=filter(quantiles, p_less_than_x < .1)) +
geom_vline(xintercept = qlnorm(.1, mu, sigma), color="purple", size=1)
```
Consequently, this purple line represents the point of time I should arrive at my bus if I want to catch it about 18 times out of 20 (or miss it 2 times out of 20).