generated from jtr13/cctemplate
-
Notifications
You must be signed in to change notification settings - Fork 78
/
short_causal_inference_tutorial.Rmd
498 lines (318 loc) · 19.5 KB
/
short_causal_inference_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
# Causal Inference Tutorial: A Quick Glance at Interventional Treatment and Simpson's Paradox
Yizhan Niu
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,warning = FALSE, message = FALSE)
```
```{r}
# remotes::install_github("allisonhorst/palmerpenguins")
library(palmerpenguins)
# remotes::install_github("hrbrmstr/hrbrthemes")
library(hrbrthemes)
library(gridExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
```
## Motivation
Causal inference has been more and more recognized in the data science world. One primary reason is that data scientists would like to analyze and estimate the causal effect on an event (Y) from another event (X), while mere statistics can hardly be capable of satisfying such need to find out causal effects. This short tutorial aims at giving readers a quick glance at some applications of causal inference in interpreting and analyzing the data. I chose one tool in causal inference called intervention treatment and a famous problem, Simpson's Paradox, to show how causal inference can be applied to practice to solve problems using R.
## Intro to Causal Graphs
To represent causal effects between pairs of different observed variables, we need some mathematical language to describe these causal relationships. In this case, graph theory becomes the best candidate. Typically, a directed acyclic graph (DAG) will be our choice for representing causality among multiple variables. A famous example is as shown following:
![Causal Graph](resources/causal_graphs/Causal_Graph.jpg)
This example is famous both as a Bayesian Network and as a DAG for representing causal relationships. It describes the relationships among the season (X1), whether the rain fall (X2), whether the sprinkler is turned on (X3), whether the floor is wet (X4), and whether the floor is slippery (X5).
In the context of causal inference, each causal graph like this will induce a probability distribution, i.e., a joint distribution of all observed variables, which we call observations. One of our goals is to use these observations to determine the causal effects between one variable and the other. The causal effects can be accomplished by intervening on the variable (X) we would like to test, which will be elaborated on below.
## Intro to Interventional
We may hear from time to time the sentence that "correlation is not causation." With the increase in ice cream consumption (X), the shark attacks (Y) also rise. It seems they have some correlation, but it is hard to imagine any causal effects of eating ice cream on shark attacks.
One explanation is that hot weather (Z) causes people to eat ice cream and also causes people to go to the beaches, increasing the possibility of being attacked by sharks. Here, hot weather is a common cause of both ice cream consumption and shark attacks, but there are intuitively no causal relationships between the two variables we are curious about.
To eliminate the effects of that common cause, one thing we can do is intervene on people's ice cream consumption. We can choose a group of people to have them not consume ice cream and observe whether the shark attack rate rises in that period, leaving any other variable the same. It is one example of intervention, as we are deliberately intervening on one variable to see how the other variable will behave. Even though this intervention is not realistic all the time, we have other techniques such as back-door adjustment and do-calculus using non-experimental data, while we will not elaborate on them here.
We will then show some examples below which connects causal diagrams and intervention operation.
### First Causal Graph Example
![Graph1](resources/causal_graphs/Graph1.jpg)
This causal graph represents the ice cream and shark attack case, where Z is the hot weather, X is the ice cream consumption, and Y is the shark attack rate.
```{r}
set.seed(1)
n <- 1000
z <- rnorm(n, 0, 1)
x <- z + rnorm(n, 0, 0.1)
y <- z + rnorm(n, 0, 0.1)
data <- cbind(x,y,z)
corr_xy <- cor(x, y, method = "pearson")
corr_xz <- cor(x, z, method = "pearson")
corr_yz <- cor(y, z, method = "pearson")
par(mfrow = c(1, 3))
scatter.smooth(x,y)
scatter.smooth(z,y)
scatter.smooth(z,x)
```
We can easily find out that these variables are all highly correlated with each other. We would like to test the causal effects of intervening on people's ice cream consumption.
### First Causal Graph After Intervention
![Graph1 Intervention](resources/causal_graphs/Graph1_Intervention.jpg)
The right hand side shows the graph after intervention. (Z has no effects on X, because we already have people to eat a fixed number of ice creams.)
```{r}
# Simulate data from the SCM where do(X = x)
intervene_x <- function(x, n = 1000) {
z <- rnorm(n, 0, 1)
y <- z + rnorm(n, 0, 0.1)
cbind(x, y, z)
}
set.seed(1)
x_do1 <- intervene_x(x=2022)
par(mfrow = c(1, 3))
hist(
y, xlab = 'Y', breaks = 30,
xlim = c(-6, 6), col = 'gray76', main = 'P(Y)'
)
hist(
x_do1[, 2], xlab = 'Y', breaks = 20,
xlim = c(-6, 6), col = 'gray76', main = 'P(Y | do(X = 2022))'
)
```
As we can see from the two graphs, having people eating 2022 ice creams during the period does not affect the distribution of the likelihood of their being attacked by sharks. Thus, if it were done in the real world, we can conclude that ice cream consumption has some correlations with shark attacks but has no causation on that event.
### Second Causal Graph Example
It is still the example of ice cream consumption and shark attacks, but we imagine a world where ice cream can make people smell more delicious, thus causes sharks to attack people.
![Graph2](resources/causal_graphs/Graph2.jpg)
We can find that there is an arrow from X to Y now.
```{r}
set.seed(1)
n <- 1000
z <- rnorm(n, 0, 1)
x <- z + rnorm(n, 0, 0.1)
y <- x + z + rnorm(n, 0, 0.1)
data <- cbind(x,y,z)
corr_xy <- cor(x, y, method = "pearson")
corr_xz <- cor(x, z, method = "pearson")
corr_yz <- cor(y, z, method = "pearson")
par(mfrow = c(1, 3))
scatter.smooth(x,y)
scatter.smooth(z,y)
scatter.smooth(z,x)
```
Still, we would like to test on the causal effects of intervening on people's ice cream consumption, which causes the following graph.
### Second Causal Graph After Intervention
![Graph2 Intervention](resources/causal_graphs/Graph2_Intervention.jpg)
```{r}
# Simulate data from the SCM where do(X = x)
intervene_x <- function(x, n = 1000) {
z <- rnorm(n, 0, 1)
y <- x + z + rnorm(n, 0, 0.1)
cbind(x, y, z)
}
set.seed(1)
x_do1 <- intervene_x(x=2)
par(mfrow = c(1, 3))
hist(
y, xlab = 'Y', breaks = 30,
xlim = c(-6, 6), col = 'gray76', main = 'P(Y)'
)
hist(
x_do1[, 2], xlab = 'Y', breaks = 20,
xlim = c(-6, 6), col = 'gray76', main = 'P(Y | do(X = 2))'
)
```
Now, even though we let people eat 2 ice creams, do(X=2) , which is slightly more than their previous consumption of zero ice creams, the probability distribution shifts to the right, and sharks are more likely to attack people who consumer more ice creams. In this case, we cannot eliminate the causality between ice cream consumption and shark attack rate.
## Intro to Simpson's Paradox
Now, I would like you to shift our mind to another famous topic in statistics: Simpson's Paradox. Basically, Simpson's Paradox describes a situation where data shows some positive (or negative) correlations, but this correlation flips to negative (or positive) if we separate the data based on another variable. We will use the Palmer Penguins dataset to show the occurrence of Simpson's Paradox and how we use causal inference to interpret this phenomenon.
### Example of Palmer Penguins Dataset
![Cute Penguins](resources/causal_graphs/Penguin_image.jpg)
The dataset was collected and published by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, which was originally for species research purposes. This dataset has different features of a penguin in each row: species, island, bill_length_mm, bill_depth_mm, flipper_length_, body_mass_g, sex, and so on. In our case, we will only look at bill lengths (X), bill depths (Y), and species (Z), as Simpson's Paradox was found among these three variables.
Here is a brief overview of what the dataset looks like.
```{r}
set.seed(9450)
penguin_df<-
palmerpenguins::penguins %>%
na.omit()
DT::datatable(head(penguin_df))
```
### Relationship Between Bill Length (X) And Bill Depth (Y)
First, not taking species into consideration, we would like to see the correlation between bill lengths and bill depths of penguins in the whole dataset.
```{r}
#typeof(as.numeric(df["bill_length_mm"][[1]]))
lin_reg <- lm(bill_depth_mm ~ bill_length_mm, data=penguin_df)
penguin_df %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm)) +
geom_point() +
geom_abline(slope = lin_reg$coefficients[[2]],
intercept = lin_reg$coefficients[[1]],
color="red") +
labs(x="Length", y="Depth",
title="Regression of Depth as a function of Length") +
theme_classic()
```
We found a clear negative correlation shown by the red line.
### Relationship Between Bill Length (X) And Bill Depth (Y) Separated By Species
Now, if we separated these data into three species, Adelie, Chinstrap, and Gentoo, we will find something different.
```{r}
chin<-
penguin_df %>%
filter(species == "Chinstrap")
adelie<-
penguin_df %>%
filter(species == "Adelie")
gentoo<-
penguin_df %>%
filter(species == "Gentoo")
lm_chin<- lm(data=chin, bill_depth_mm ~ bill_length_mm)
lm_adelie<- lm(data=adelie, bill_depth_mm ~ bill_length_mm)
lm_gentoo<- lm(data=gentoo, bill_depth_mm ~ bill_length_mm)
penguin_df %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm,
color=species)) +
geom_point() +
geom_abline(slope = lm_chin$coefficients[[2]],
intercept = lm_chin$coefficients[[1]],
color="black") +
geom_abline(slope = lm_adelie$coefficients[[2]],
intercept = lm_adelie$coefficients[[1]],
color="black") +
geom_abline(slope = lm_gentoo$coefficients[[2]],
intercept = lm_gentoo$coefficients[[1]],
color="black") +
labs(x="Length", y="Depth",
title="Regression of Depth as a function of Length") +
theme_classic()
```
Surprisingly, the negative correlation on aggregated data flipped if we separated these data by species. It shows a case of Simpson's Paradox here.
The following graphs are plotted separately based on different species.
```{r}
# Given the same Z, there exists some positive correlation to X and Y.
lin_reg_adelie <- lm(bill_depth_mm ~ bill_length_mm, data=adelie)
std_adelie <- sd(adelie$bill_depth_mm)
adelie_plot <- adelie %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm)) +
geom_point(color="red") +
geom_abline(slope = lin_reg_adelie$coefficients[[2]],
intercept = lin_reg_adelie$coefficients[[1]],
color="red") +
labs(x="Length", y="Depth",
title="Regression of Depth as a function of Length (Adelie)") +
theme_classic()
lin_reg_chin <- lm(bill_depth_mm ~ bill_length_mm, data=chin)
std_chin <- sd(chin$bill_depth_mm)
chin_plot <- chin %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm)) +
geom_point(color="green") +
geom_abline(slope = lin_reg_chin$coefficients[[2]],
intercept = lin_reg_chin$coefficients[[1]],
color="red") +
labs(x="Length", y="Depth",
title="Regression of Depth as a function of Length (Chin)") +
theme_classic()
lin_reg_gentoo <- lm(bill_depth_mm ~ bill_length_mm, data=gentoo)
std_gentoo <- sd(gentoo$bill_depth_mm)
gentoo_plot <- gentoo %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm)) +
geom_point(color="blue") +
geom_abline(slope = lin_reg_gentoo$coefficients[[2]],
intercept = lin_reg_gentoo$coefficients[[1]],
color="red") +
labs(x="Length", y="Depth",
title="Regression of Depth as a function of Length (Gentoo)") +
theme_classic()
grid.arrange(adelie_plot, chin_plot, gentoo_plot, nrow=3, ncol = 1,widths = 1,heights=unit(c(1.5,1.5,1.5),c("in","in","in")))
```
### Take a closer look at the distribution given X = 41 and X = 42.
Now, to better understand the patterns of the data, we now take a closer look at the distribution of bill depths, given a certain bill length. We choose the bill lengths of 41 and 42 millimeters.
```{r}
a = 40
b = 41.9
c = 42
d = 43.9
adelie_x0 <- adelie %>%
filter(between(adelie$bill_length_mm,a,b))
chin_x0 <- chin %>%
filter(between(chin$bill_length_mm,a,b))
gentoo_x0 <- gentoo %>%
filter(between(gentoo$bill_length_mm,a,b))
adelie_x1 <- adelie %>%
filter(between(adelie$bill_length_mm,c,d))
chin_x1 <- chin %>%
filter(between(chin$bill_length_mm,c,d))
gentoo_x1 <- gentoo %>%
filter(between(gentoo$bill_length_mm,c,d))
data_x0 <- data.frame(
type = c( rep("Adelie_x0", count(adelie_x0)), rep("Chin_x0", count(chin_x0)),rep("Gentoo_x0", count(gentoo_x0))),
value = c( adelie_x0$bill_depth_mm, chin_x0$bill_depth_mm , gentoo_x0$bill_depth_mm)
)
data_x1 <- data.frame(
type = c( rep("Adelie_x1", count(adelie_x1)), rep("Chin_x1", count(chin_x1)),rep("Gentoo_x1", count(gentoo_x1))),
value = c( adelie_x1$bill_depth_mm, chin_x1$bill_depth_mm , gentoo_x1$bill_depth_mm)
)
p1 <- data_x0 %>%
ggplot( aes(x=value, fill=type)) +
geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
scale_fill_manual(values=c("#69b3a2", "#404080","#add61d")) +
theme_ipsum() +
scale_x_continuous(limits = c(13, 22))+
geom_vline(aes(xintercept = mean(value)),col='red',size=2)+
labs(fill="")
p2 <- data_x1 %>%
ggplot( aes(x=value, fill=type)) +
geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
scale_fill_manual(values=c("#69b3a2", "#404080","#add61d")) +
theme_ipsum() +
scale_x_continuous(limits = c(13, 22))+
geom_vline(aes(xintercept = mean(value)),col='red',size=2)+
labs(fill="")
grid.arrange(p1, p2, ncol = 1)
```
X0: bill_length_mm between 40 and 41.9
X1: bill_length_mm between 42 and 43.9
With these two histograms, we can somehow guess how Simpson's Paradox occurs based on the data distribution.
Given a certain species such as Gentoo, the data is not balanced in the two graphs. This imbalance of data within one specie may directly lead to the occurrence of Simpson's paradox. Even though a lower bill length of 41 millimeters leads to a higher average bill depth of penguins, it is because more penguins in the 41-mm group are Adelie, who have deeper bills than other penguins, while more penguins in the 42-mm group are Gentoo, who have shallower bills.
Thus, such imbalanced data among different species makes the data aggregation less reliable if we do not taking the species type into account.
### Connecting Palmer Penguins Example With Causal Intervention
Now it is necessary to think about the intervention operation we just learned before.
We are assuming that the causal graph follows that X <- Z -> Y, where there are no causal effects from X to Y.
We can use the back-door adjustment to calculate the do(X=X0) to avoid the necessity to directly intervene on penguin's bills, which will not be elaborated in this tutorial.
We now need to find a way to represent a data generating function to formulate the results after the intervention operation. Based on the linear regression, we conducted before and the back-door adjustment equation we have, we can write out the data generating function. Also, I assume there exists some random noise on penguin's bill lengths and depths, where the error follows the normal distribution.
![Causal Graph Assumption](resources/causal_graphs/Penguin_assumption.jpg)
```{r}
n <- dim(penguin_df)[1]
prior_adelie <- sum(penguin_df$species=="Adelie")/n
prior_chin <- sum(penguin_df$species=="Chinstrap")/n
prior_gentoo <- sum(penguin_df$species=="Gentoo")/n
intervene_x <- function(x, n = 300) {
y <- (lin_reg_adelie$coefficients[2]*x+lin_reg_adelie$coefficients[1] + rnorm(n, 0, std_adelie))*prior_adelie+(lin_reg_chin$coefficients[2]*x+lin_reg_chin$coefficients[1] + rnorm(n, 0, std_chin))*prior_chin+(lin_reg_gentoo$coefficients[2]*x+lin_reg_gentoo$coefficients[1] + rnorm(n, 0, std_gentoo))*prior_gentoo
cbind(x, y)
}
set.seed(1)
x_do1 <- intervene_x(x=mean(penguin_df$bill_length_mm))
x_do2 <- intervene_x(x=60)
par(mfrow = c(1, 3))
hist(
penguin_df$bill_depth_mm, xlab = 'Y', breaks = 20,
xlim = c(10, 25), col = 'gray76', main = 'P(Y)',
prob = TRUE
)
lines(density(penguin_df$bill_depth_mm), col="red", lwd=2) # add a density estimate with defaults
abline(v=mean(penguin_df$bill_depth_mm),col="blue",lwd=2)
hist(
x_do1[, 2], xlab = 'Y', breaks = 20,
xlim = c(10, 25), col = 'gray76', main = 'P(Y | do(X = 44))',
prob = TRUE
)
lines(density(x_do1[,2]), col="red", lwd=2)
abline(v=mean(x_do1[,2]),col="blue",lwd=2)
hist(
x_do2[, 2], xlab = 'Y', breaks = 20,
xlim = c(10, 25), col = 'gray76', main = 'P(Y | do(X = 60))',
prob = TRUE
)
lines(density(x_do2[,2]), col="red", lwd=2)
abline(v=mean(x_do2[,2]),col="blue",lwd=2)
```
There are three deductions we can make based on these graphs:
(1). After we plot the histogram for the bill depths of penguins based on its original distribution, distribution after intervening the bill length to be 44, and distribution after intervening the bill length to be 60, we find out that the result varies as we intervening on the bill length, i.e., do(X=x). It implies some causal relationship exists between the bill length and the bill depth of those penguins.
(2). Looking at the blue line in the graphs, we know Y has a positive correlation with X rather than what we initially thought. This discovery also proved our assumption that the variable species determines both the penguins' bill lengths and their bill depths. (Species (Z) is a confounder!)
(3). Finally, we can return to our original causal graph assumption. As we found some correlation between bill length and bill depth even after our intervention, we can say there should be some causal relationship between bill length and bill depth, represented as a direct path. The corrected graph is shown below:
![Causal Graph Inferred](resources/causal_graphs/Penguin_graph.jpg)
## Final Sentences:
The causal inference is increasingly recognized in the industry and academia of data science. It is valuable and meaningful to learn some causal inference techniques to broaden our skill sets.
Even though I am still taking the course, I would like to share with you some knowledge of what I have just learned. There might also be some typos in the tutorial, but the concepts in causality such as intervention and causal diagrams are very interesting. If you would like to learn more about such content, please read the works by Pearl. Books such as "Causality" and "The book of Why" are really good materials to learn more about causal inference.
## Reference:
1. An introduction to Causal inference: https://www.r-bloggers.com/2019/11/an-introduction-to-causal-inference/
2. Simpson's Paradox by RStudio: https://rpubs.com/shampjeff/blog_post_2
3. Pearl, J., Glymour, M., & Jewell, N. P. (2019). Causal inference in statistics: A Primer. John Wiley & Sons.
4. Pearl, J. (2009). Causality. Cambridge University Press.
5. Bareinboim, E., Correa, J. D., Ibeling, D., & Icard, T. (2022). On Pearl’s hierarchy and the foundations of causal inference. Probabilistic and Causal Inference, 507–556. https://doi.org/10.1145/3501714.3501743
6. Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081
7. palmerpenguins dataset: https://cran.r-project.org/web/packages/palmerpenguins/readme/README.html