-
Notifications
You must be signed in to change notification settings - Fork 62
/
10-logistic.Rmd
482 lines (341 loc) · 15.7 KB
/
10-logistic.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
# Logistic Regression
In this chapter, we continue our discussion of classification. We introduce our first model for classification, logistic regression. To begin, we return to the `Default` dataset from the previous chapter.
```{r}
library(ISLR)
library(tibble)
as_tibble(Default)
```
We also repeat the test-train split from the previous chapter.
```{r}
set.seed(42)
default_idx = sample(nrow(Default), 5000)
default_trn = Default[default_idx, ]
default_tst = Default[-default_idx, ]
```
## Linear Regression
Before moving on to logistic regression, why not plain, old, linear regression?
```{r}
default_trn_lm = default_trn
default_tst_lm = default_tst
```
Since linear regression expects a numeric response variable, we coerce the response to be numeric. (Notice that we also shift the results, as we require `0` and `1`, not `1` and `2`.) Notice we have also copied the dataset so that we can return the original data with factors later.
```{r}
default_trn_lm$default = as.numeric(default_trn_lm$default) - 1
default_tst_lm$default = as.numeric(default_tst_lm$default) - 1
```
Why would we think this should work? Recall that,
$$
\hat{\mathbb{E}}[Y \mid X = x] = X\hat{\beta}.
$$
Since $Y$ is limited to values of $0$ and $1$, we have
$$
\mathbb{E}[Y \mid X = x] = P(Y = 1 \mid X = x).
$$
It would then seem reasonable that $\mathbf{X}\hat{\beta}$ is a reasonable estimate of $P(Y = 1 \mid X = x)$. We test this on the `Default` data.
```{r}
model_lm = lm(default ~ balance, data = default_trn_lm)
```
Everything seems to be working, until we plot the results.
```{r, fig.height=5, fig.width=7}
plot(default ~ balance, data = default_trn_lm,
col = "darkorange", pch = "|", ylim = c(-0.2, 1),
main = "Using Linear Regression for Classification")
abline(h = 0, lty = 3)
abline(h = 1, lty = 3)
abline(h = 0.5, lty = 2)
abline(model_lm, lwd = 3, col = "dodgerblue")
```
Two issues arise. First, all of the predicted probabilities are below 0.5. That means, we would classify every observation as a `"No"`. This is certainly possible, but not what we would expect.
```{r}
all(predict(model_lm) < 0.5)
```
The next, and bigger issue, is predicted probabilities less than 0.
```{r}
any(predict(model_lm) < 0)
```
## Bayes Classifier
Why are we using a predicted probability of 0.5 as the cutoff for classification? Recall, the Bayes Classifier, which minimizes the classification error:
$$
C^B(x) = \underset{g}{\mathrm{argmax}} \ P(Y = g \mid X = x)
$$
So, in the binary classification problem, we will use predicted probabilities
$$
\hat{p}(x) = \hat{P}(Y = 1 \mid { X = x})
$$
and
$$
\hat{P}(Y = 0 \mid { X = x})
$$
and then classify to the larger of the two. We actually only need to consider a single probability, usually $\hat{P}(Y = 1 \mid { X = x})$. Since we use it so often, we give it the shorthand notation, $\hat{p}(x)$. Then the classifier is written,
$$
\hat{C}(x) =
\begin{cases}
1 & \hat{p}(x) > 0.5 \\
0 & \hat{p}(x) \leq 0.5
\end{cases}
$$
This classifier is essentially estimating the Bayes Classifier, thus, is seeking to minimize classification errors.
## Logistic Regression with `glm()`
To better estimate the probability
$$
p(x) = P(Y = 1 \mid {X = x})
$$
we turn to logistic regression. The model is written
$$
\log\left(\frac{p(x)}{1 - p(x)}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p.
$$
Rearranging, we see the probabilities can be written as
$$
p(x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p)}} = \sigma(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p)
$$
Notice, we use the sigmoid function as shorthand notation, which appears often in deep learning literature. It takes any real input, and outputs a number between 0 and 1. How useful! (This is actually a particular sigmoid function called the logistic function, but since it is by far the most popular sigmoid function, often sigmoid function is used to refer to the logistic function)
$$
\sigma(x) = \frac{e^x}{1 + e^x} = \frac{1}{1 + e^{-x}}
$$
The model is fit by numerically maximizing the likelihood, which we will let `R` take care of.
We start with a single predictor example, again using `balance` as our single predictor.
```{r}
model_glm = glm(default ~ balance, data = default_trn, family = "binomial")
```
Fitting this model looks very similar to fitting a simple linear regression. Instead of `lm()` we use `glm()`. The only other difference is the use of `family = "binomial"` which indicates that we have a two-class categorical response. Using `glm()` with `family = "gaussian"` would perform the usual linear regression.
First, we can obtain the fitted coefficients the same way we did with linear regression.
```{r}
coef(model_glm)
```
The next thing we should understand is how the `predict()` function works with `glm()`. So, let's look at some predictions.
```{r}
head(predict(model_glm))
```
By default, `predict.glm()` uses `type = "link"`.
```{r}
head(predict(model_glm, type = "link"))
```
That is, `R` is returning
$$
\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p
$$
for each observation.
Importantly, these are **not** predicted probabilities. To obtain the predicted probabilities
$$
\hat{p}(x) = \hat{P}(Y = 1 \mid X = x)
$$
we need to use `type = "response"`
```{r}
head(predict(model_glm, type = "response"))
```
Note that these are probabilities, **not** classifications. To obtain classifications, we will need to compare to the correct cutoff value with an `ifelse()` statement.
```{r}
model_glm_pred = ifelse(predict(model_glm, type = "link") > 0, "Yes", "No")
# model_glm_pred = ifelse(predict(model_glm, type = "response") > 0.5, "Yes", "No")
```
The line that is run is performing
$$
\hat{C}(x) =
\begin{cases}
1 & \hat{f}(x) > 0 \\
0 & \hat{f}(x) \leq 0
\end{cases}
$$
where
$$
\hat{f}(x) =\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p.
$$
The commented line, which would give the same results, is performing
$$
\hat{C}(x) =
\begin{cases}
1 & \hat{p}(x) > 0.5 \\
0 & \hat{p}(x) \leq 0.5
\end{cases}
$$
where
$$
\hat{p}(x) = \hat{P}(Y = 1 \mid X = x).
$$
Once we have classifications, we can calculate metrics such as the trainging classification error rate.
```{r}
calc_class_err = function(actual, predicted) {
mean(actual != predicted)
}
```
```{r}
calc_class_err(actual = default_trn$default, predicted = model_glm_pred)
```
As we saw previously, the `table()` and `confusionMatrix()` functions can be used to quickly obtain many more metrics.
```{r, message = FALSE, warning = FALSE}
train_tab = table(predicted = model_glm_pred, actual = default_trn$default)
library(caret)
train_con_mat = confusionMatrix(train_tab, positive = "Yes")
c(train_con_mat$overall["Accuracy"],
train_con_mat$byClass["Sensitivity"],
train_con_mat$byClass["Specificity"])
```
We could also write a custom function for the error for use with trained logist regression models.
```{r}
get_logistic_error = function(mod, data, res = "y", pos = 1, neg = 0, cut = 0.5) {
probs = predict(mod, newdata = data, type = "response")
preds = ifelse(probs > cut, pos, neg)
calc_class_err(actual = data[, res], predicted = preds)
}
```
This function will be useful later when calculating train and test errors for several models at the same time.
```{r}
get_logistic_error(model_glm, data = default_trn,
res = "default", pos = "Yes", neg = "No", cut = 0.5)
```
To see how much better logistic regression is for this task, we create the same plot we used for linear regression.
```{r, fig.height=5, fig.width=7}
plot(default ~ balance, data = default_trn_lm,
col = "darkorange", pch = "|", ylim = c(-0.2, 1),
main = "Using Logistic Regression for Classification")
abline(h = 0, lty = 3)
abline(h = 1, lty = 3)
abline(h = 0.5, lty = 2)
curve(predict(model_glm, data.frame(balance = x), type = "response"),
add = TRUE, lwd = 3, col = "dodgerblue")
abline(v = -coef(model_glm)[1] / coef(model_glm)[2], lwd = 2)
```
This plot contains a wealth of information.
- The orange `|` characters are the data, $(x_i, y_i)$.
- The blue "curve" is the predicted probabilities given by the fitted logistic regression. That is,
$$
\hat{p}(x) = \hat{P}(Y = 1 \mid { X = x})
$$
- The solid vertical black line represents the **[decision boundary](https://en.wikipedia.org/wiki/Decision_boundary)**, the `balance` that obtains a predicted probability of 0.5. In this case `balance` = `r -coef(model_glm)[1] / coef(model_glm)[2]`.
The decision boundary is found by solving for points that satisfy
$$
\hat{p}(x) = \hat{P}(Y = 1 \mid { X = x}) = 0.5
$$
This is equivalent to point that satisfy
$$
\hat{\beta}_0 + \hat{\beta}_1 x_1 = 0.
$$
Thus, for logistic regression with a single predictor, the decision boundary is given by the *point*
$$
x_1 = \frac{-\hat{\beta}_0}{\hat{\beta}_1}.
$$
The following is not run, but an alternative way to add the logistic curve to the plot.
```{r, eval = FALSE}
grid = seq(0, max(default_trn$balance), by = 0.01)
sigmoid = function(x) {
1 / (1 + exp(-x))
}
lines(grid, sigmoid(coef(model_glm)[1] + coef(model_glm)[2] * grid), lwd = 3)
```
Using the usual formula syntax, it is easy to add or remove complexity from logistic regressions.
```{r}
model_1 = glm(default ~ 1, data = default_trn, family = "binomial")
model_2 = glm(default ~ ., data = default_trn, family = "binomial")
model_3 = glm(default ~ . ^ 2 + I(balance ^ 2),
data = default_trn, family = "binomial")
```
Note that, using polynomial transformations of predictors will allow a linear model to have non-linear decision boundaries.
```{r}
model_list = list(model_1, model_2, model_3)
train_errors = sapply(model_list, get_logistic_error, data = default_trn,
res = "default", pos = "Yes", neg = "No", cut = 0.5)
test_errors = sapply(model_list, get_logistic_error, data = default_tst,
res = "default", pos = "Yes", neg = "No", cut = 0.5)
```
Here we see the misclassification error rates for each model. The train decreases, and the test decreases, until it starts to increases. Everything we learned about the bias-variance tradeoff for regression also applies here.
```{r}
diff(train_errors)
diff(test_errors)
```
We call `model_2` the **additive** logistic model, which we will use quite often.
## ROC Curves
Let's return to our simple model with only balance as a predictor.
```{r}
model_glm = glm(default ~ balance, data = default_trn, family = "binomial")
```
We write a function which allows use to make predictions based on different probability cutoffs.
```{r}
get_logistic_pred = function(mod, data, res = "y", pos = 1, neg = 0, cut = 0.5) {
probs = predict(mod, newdata = data, type = "response")
ifelse(probs > cut, pos, neg)
}
```
$$
\hat{C}(x) =
\begin{cases}
1 & \hat{p}(x) > c \\
0 & \hat{p}(x) \leq c
\end{cases}
$$
Let's use this to obtain predictions using a low, medium, and high cutoff. (0.1, 0.5, and 0.9)
```{r}
test_pred_10 = get_logistic_pred(model_glm, data = default_tst, res = "default",
pos = "Yes", neg = "No", cut = 0.1)
test_pred_50 = get_logistic_pred(model_glm, data = default_tst, res = "default",
pos = "Yes", neg = "No", cut = 0.5)
test_pred_90 = get_logistic_pred(model_glm, data = default_tst, res = "default",
pos = "Yes", neg = "No", cut = 0.9)
```
Now we evaluate accuracy, sensitivity, and specificity for these classifiers.
```{r}
test_tab_10 = table(predicted = test_pred_10, actual = default_tst$default)
test_tab_50 = table(predicted = test_pred_50, actual = default_tst$default)
test_tab_90 = table(predicted = test_pred_90, actual = default_tst$default)
test_con_mat_10 = confusionMatrix(test_tab_10, positive = "Yes")
test_con_mat_50 = confusionMatrix(test_tab_50, positive = "Yes")
test_con_mat_90 = confusionMatrix(test_tab_90, positive = "Yes")
```
```{r}
metrics = rbind(
c(test_con_mat_10$overall["Accuracy"],
test_con_mat_10$byClass["Sensitivity"],
test_con_mat_10$byClass["Specificity"]),
c(test_con_mat_50$overall["Accuracy"],
test_con_mat_50$byClass["Sensitivity"],
test_con_mat_50$byClass["Specificity"]),
c(test_con_mat_90$overall["Accuracy"],
test_con_mat_90$byClass["Sensitivity"],
test_con_mat_90$byClass["Specificity"])
)
rownames(metrics) = c("c = 0.10", "c = 0.50", "c = 0.90")
metrics
```
We see then sensitivity decreases as the cutoff is increased. Conversely, specificity increases as the cutoff increases. This is useful if we are more interested in a particular error, instead of giving them equal weight.
Note that usually the best accuracy will be seen near $c = 0.50$.
Instead of manually checking cutoffs, we can create an ROC curve (receiver operating characteristic curve) which will sweep through all possible cutoffs, and plot the sensitivity and specificity.
```{r, message = FALSE, warning = FALSE}
library(pROC)
test_prob = predict(model_glm, newdata = default_tst, type = "response")
test_roc = roc(default_tst$default ~ test_prob, plot = TRUE, print.auc = TRUE)
as.numeric(test_roc$auc)
```
A good model will have a high AUC, that is as often as possible a high sensitivity and specificity.
## Multinomial Logistic Regression
What if the response contains more than two categories? For that we need multinomial logistic regression.
$$
P(Y = k \mid { X = x}) = \frac{e^{\beta_{0k} + \beta_{1k} x_1 + \cdots + + \beta_{pk} x_p}}{\sum_{g = 1}^{G} e^{\beta_{0g} + \beta_{1g} x_1 + \cdots + \beta_{pg} x_p}}
$$
We will omit the details, as ISL has as well. If you are interested, the [Wikipedia page](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) provides a rather thorough coverage. Also note that the above is an example of the [softmax function](https://en.wikipedia.org/wiki/Softmax_function).
As an example of a dataset with a three category response, we use the `iris` dataset, which is so famous, it has its own [Wikipedia entry](https://en.wikipedia.org/wiki/Iris_flower_data_set). It is also a default dataset in `R`, so no need to load it.
Before proceeding, we test-train split this data.
```{r}
set.seed(430)
iris_obs = nrow(iris)
iris_idx = sample(iris_obs, size = trunc(0.50 * iris_obs))
iris_trn = iris[iris_idx, ]
iris_test = iris[-iris_idx, ]
```
To perform multinomial logistic regression, we use the `multinom` function from the `nnet` package. Training using `multinom()` is done using similar syntax to `lm()` and `glm()`. We add the `trace = FALSE` argument to suppress information about updates to the optimization routine as the model is trained.
```{r}
library(nnet)
model_multi = multinom(Species ~ ., data = iris_trn, trace = FALSE)
summary(model_multi)$coefficients
```
Notice we are only given coefficients for two of the three class, much like only needing coefficients for one class in logistic regression.
A difference between `glm()` and `multinom()` is how the `predict()` function operates.
```{r}
head(predict(model_multi, newdata = iris_trn))
head(predict(model_multi, newdata = iris_trn, type = "prob"))
```
Notice that by default, classifications are returned. When obtaining probabilities, we are given the predicted probability for **each** class.
Interestingly, you've just fit a neural network, and you didn't even know it! (Hence the `nnet` package.) Later we will discuss the connections between logistic regression, multinomial logistic regression, and simple neural networks.
## `rmarkdown`
The `rmarkdown` file for this chapter can be found [**here**](10-logistic.Rmd). The file was created using `R` version `r paste0(version$major, "." ,version$minor)`. The following packages (and their dependencies) were loaded when knitting this file:
```{r, echo = FALSE}
names(sessionInfo()$otherPkgs)
```