generated from r4ds/bookclub-template
-
Notifications
You must be signed in to change notification settings - Fork 3
/
11_ceteris-paribus-oscillations.Rmd
186 lines (126 loc) · 6.69 KB
/
11_ceteris-paribus-oscillations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
# Ceteris-paribus Oscillations
**Learning objectives:**
- Describe a measure to identify the most interesting or important profiles.
## Basic idea {-}
1. If an explanatory variable has a **large influence** on prediction for a particular instance, then its corresponding CP profile must present **large fluctuations**.
2. If an explanatory variable has **little influence** on prediction for a particular instance, then its corresponding CP profile must present **bare fluctuations** and be close to the **original prediction of the model**.
- The **sum of differences** between the **profile** and the **prediction** _(across all possible values of the explanatory variable)_ should be close to zero.
## Graphical representation {-}
The sum of differences can be represented by the **area** between the **profile** and the **line at prediction level**.
<br>
![Source: Figure 11.1](img/11-cp-oscilations/01-visual-example.png){width=75% height=75%}
## Method {-}
**Remember**, one-dimensional CP profile for all possible values $z$ of the explanatory variable $j$ based the interest observation $\underline{x}_*$.
$$
h^{j}_{\underline{x}_*}(z) \equiv f\left(\underline{x}_*^{j|=z}\right).
$$
$vip_{CP}^j(\underline{x}_*)$ is the **expected absolute deviation** of the **CP profile** $h^{j}_{\underline{x}_*}()$ from the **model's prediction** $f(\underline{x}_*)$, computed over the distribution $g^j(z)$ of the $j$-th explanatory variable.
$$
vip_{CP}^j(\underline{x}_*) = \int_{\mathcal R} |h^{j}_{\underline{x}_*}(z) - f(\underline{x}_*)| g^j(z)dz=E_{X^j}\left\{|h^{j}_{\underline{x}_*}(X^j) - f(\underline{x}_*)|\right\}.
$$
## Method challenge {-}
The **true distribution** of $j$-th explanatory variable is **unknown** and we have 2 alternatives:
1. To assume that $g^j(z)$ is a **uniform distribution** over the range of variable $X^j$ for $k$ selected values _(all unique or equidistant grid)_ of the $j$-th explanatory variable.
$$
\widehat{vip}_{CP}^{j,uni}(\underline{x}_*) = \frac 1k \sum_{l=1}^k |h^{j}_{x_*}(z_l) - f(\underline{x}_*)|
$$
2. To use **all observations** in the dataset $n$ to estimate the **empirical distribution** of $X^{j}$, despite it might require more **computation time**.
$$
\widehat{vip}_{CP}^{j,emp}(\underline{x}_*) = \frac 1n \sum_{i=1}^n |h^{j}_{\underline{x}_*}(x^{j}_i) - f(\underline{x}_*)|
$$
## Local vs global importance {-}
Let's assume a simple model described as the interaction of $X^1$ and $X^1$ in values from 0 to 1 $[0,1]$.
$$
f(x^1, x^2) = x^1 * x^2
$$
- Globally both variables are **equally important**, because the model is symmetrical.
- But if the instance we want to explain $\underline{x}_*$ has $x^1 = 0$ and $x^2 = 1$. Then the importance of $X^1$ is **larger** than $X^2$:
- $h^1_{x_*}(z) = z$ as $x^2 = 1$ for any value of $z$.
- $h^2_{x_*}(z) = 0$ as $x^1 = 0$ for any value of $z$.
## Example: Henry - random forest model {-}
Both alternatives suggest that the most important variables are **gender** and **age**, followed by **class**.
![Source: Figure 11.2](img/11-cp-oscilations/02-henry-results.png){width=75% height=75%}
## Example: Henry - random forest model {-}
The **sibsp** variable gains in relative importance for estimator $\widehat{vip}_{CP}^{j,uni}(\underline{x}_*)$ as it has a very **skewed distribution**.
![Source: Figure 4.3](img/04-datasets-models/01-sibsp-distribution.png){width=40% height=40%}
## Example: Henry - random forest model {-}
It does not describe how do the variables influence the prediction.
- If Henry were older, this would significantly lower his probability of survival.
- Henry not travelling alone, this would increase his chances of survival.
![Source: Figure 10.4](img/04-datasets-models/02-CP-RF-numeric.png){width=80% height=80%}
## Pros and cons {-}
|**Pros**|**BD Plots**|**iBD plots**|**Shapley values**|**LIME**|**CP profiles + oscillations**|
|:----------------------------------------------------|:-:|:-:|:-:|:-:|:-:|
|Good for correlated explanatory variables | | | | | |
|Not time-consuming for large models |X| | |X|-|
|Sum up to the instance prediction |X|X|X| | |
|Good for models including interactions | |X| | | |
|Helps to avoid false-positive findings | | |X| |X|
|Easy to understand with large number of variables | | | |X|X|
|Useful tool for sensitivity analysis | | | | |X|
## Loading the data and model {-}
```{r}
titanic_imputed <- archivist::aread("pbiecek/models/27e5c")
titanic_rf <- archivist:: aread("pbiecek/models/4e0fc")
(henry <- archivist::aread("pbiecek/models/a6538"))
```
## Creating the explainer {-}
```{r, message=FALSE, warning=FALSE}
library("randomForest")
library("DALEX")
explain_rf <- DALEX::explain(model = titanic_rf,
data = titanic_imputed[, -9],
y = titanic_imputed$survived == "yes",
label = "Random Forest")
predict(explain_rf, henry)
```
## Creating oscillations uniform {-}
```{r}
oscillations_uniform <- predict_parts(explainer = explain_rf,
new_observation = henry,
type = "oscillations_uni")
oscillations_uniform
```
## Plotting uniform results {-}
```{r}
oscillations_uniform$`_ids_` <- "Henry"
plot(oscillations_uniform) +
ggplot2::ggtitle("Ceteris-paribus Oscillations",
"Expectation over uniform distribution (unique values)")
```
## Plotting empirical results {-}
```{r}
predict_parts(explainer = explain_rf,
new_observation = henry,
type = "oscillations_emp") |>
dplyr::mutate(`_ids_` = "Henry") |>
plot() +
ggplot2::ggtitle("Ceteris-paribus Oscillations",
"Expectation over empirical distribution")
```
## Applying a custom grid {-}
```{r}
oscillations_equidist <- predict_parts(explain_rf, henry,
variable_splits = list(age = seq(0, 65, 0.1),
fare = seq(0, 200, 0.1),
gender = unique(titanic_imputed$gender),
class = unique(titanic_imputed$class)),
type = "oscillations")
oscillations_equidist
```
## Plotting custom grid {-}
```{r}
oscillations_equidist$`_ids_` <- "Henry"
plot(oscillations_equidist) +
ggplot2::ggtitle("Ceteris-paribus Oscillations",
"Expectation over specified grid of points")
```
## Meeting Videos {-}
### Cohort 1 {-}
`r knitr::include_url("https://www.youtube.com/embed/URL")`
<details>
<summary> Meeting chat log </summary>
```
LOG
```
</details>