-
Notifications
You must be signed in to change notification settings - Fork 62
/
09-classification.Rmd
298 lines (215 loc) · 11.2 KB
/
09-classification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
# (PART) Classification {-}
# Overview {#classification-overview}
**Classification** is a form of **supervised learning** where the response variable is categorical, as opposed to numeric for regression. *Our goal is to find a rule, algorithm, or function which takes as input a feature vector, and outputs a category which is the true category as often as possible.*
![](images/classification.png)
That is, the classifier $\hat{C}(x)$ returns the predicted category $\hat{y}(X)$.
$$
\hat{y}(x) = \hat{C}(x)
$$
To build our first classifier, we will use the `Default` dataset from the `ISLR` package.
```{r}
library(ISLR)
library(tibble)
as_tibble(Default)
```
Our goal is to properly classify individuals as defaulters based on student status, credit card balance, and income. Be aware that the response `default` is a factor, as is the predictor `student`.
```{r}
is.factor(Default$default)
is.factor(Default$student)
```
As we did with regression, we test-train split our data. In this case, using 50% for each.
```{r}
set.seed(42)
default_idx = sample(nrow(Default), 5000)
default_trn = Default[default_idx, ]
default_tst = Default[-default_idx, ]
```
## Visualization for Classification
Often, some simple visualizations can suggest simple classification rules. To quickly create some useful visualizations, we use the `featurePlot()` function from the `caret()` package.
```{r, message = FALSE, warning = FALSE}
library(caret)
```
A density plot can often suggest a simple split based on a numeric predictor. Essentially this plot graphs a density estimate
$$
\hat{f}_{X_i}(x_i \mid Y = k)
$$
for each numeric predictor $x_i$ and each category $k$ of the response $y$.
```{r, fig.height = 5, fig.width = 10}
featurePlot(x = default_trn[, c("balance", "income")],
y = default_trn$default,
plot = "density",
scales = list(x = list(relation = "free"),
y = list(relation = "free")),
adjust = 1.5,
pch = "|",
layout = c(2, 1),
auto.key = list(columns = 2))
```
Some notes about the arguments to this function:
- `x` is a data frame containing only **numeric predictors**. It would be nonsensical to estimate a density for a categorical predictor.
- `y` is the response variable. It needs to be a factor variable. If coded as `0` and `1`, you will need to coerce to factor for plotting.
- `plot` specifies the type of plot, here `density`.
- `scales` defines the scale of the axes for each plot. By default, the axis of each plot would be the same, which often is not useful, so the arguments here, a different axis for each plot, will almost always be used.
- `adjust` specifies the amount of smoothing used for the density estimate.
- `pch` specifies the **p**lot **ch**aracter used for the bottom of the plot.
- `layout` places the individual plots into rows and columns. For some odd reason, it is given as (col, row).
- `auto.key` defines the key at the top of the plot. The number of columns should be the number of categories.
It seems that the income variable by itself is not particularly useful. However, there seems to be a big difference in default status at a `balance` of about 1400. We will use this information shortly.
```{r, fig.height = 5, fig.width = 10, message = FALSE, warning = FALSE}
featurePlot(x = default_trn[, c("balance", "income")],
y = default_trn$student,
plot = "density",
scales = list(x = list(relation = "free"),
y = list(relation = "free")),
adjust = 1.5,
pch = "|",
layout = c(2, 1),
auto.key = list(columns = 2))
```
Above, we create a similar plot, except with `student` as the response. We see that students often carry a slightly larger balance, and have far lower income. This will be useful to know when making more complicated classifiers.
```{r, fig.height = 6, fig.width = 6, message = FALSE, warning = FALSE}
featurePlot(x = default_trn[, c("student", "balance", "income")],
y = default_trn$default,
plot = "pairs",
auto.key = list(columns = 2))
```
We can use `plot = "pairs"` to consider multiple variables at the same time. This plot reinforces using `balance` to create a classifier, and again shows that `income` seems not that useful.
```{r, fig.height = 6, fig.width = 6, message = FALSE, warning = FALSE}
library(ellipse)
featurePlot(x = default_trn[, c("balance", "income")],
y = default_trn$default,
plot = "ellipse",
auto.key = list(columns = 2))
```
Similar to `pairs` is a plot of type `ellipse`, which requires the `ellipse` package. Here we only use numeric predictors, as essentially we are assuming multivariate normality. The ellipses mark points of equal density. This will be useful later when discussing LDA and QDA.
## A Simple Classifier
A very simple classifier is a rule based on a boundary $b$ for a particular input variable $x$.
$$
\hat{C}(x) =
\begin{cases}
1 & x > b \\
0 & x \leq b
\end{cases}
$$
Based on the first plot, we believe we can use `balance` to create a reasonable classifier. In particular,
$$
\hat{C}(\texttt{balance}) =
\begin{cases}
\text{Yes} & \texttt{balance} > 1400 \\
\text{No} & \texttt{balance} \leq 1400
\end{cases}
$$
So we predict an individual is a defaulter if their `balance` is above 1400, and not a defaulter if the balance is 1400 or less.
```{r}
simple_class = function(x, boundary, above = 1, below = 0) {
ifelse(x > boundary, above, below)
}
```
We write a simple `R` function that compares a variable to a boundary, then use it to make predictions on the train and test sets with our chosen variable and boundary.
```{r}
default_trn_pred = simple_class(x = default_trn$balance,
boundary = 1400, above = "Yes", below = "No")
default_tst_pred = simple_class(x = default_tst$balance,
boundary = 1400, above = "Yes", below = "No")
head(default_tst_pred, n = 10)
```
## Metrics for Classification
In the classification setting, there are a large number of metrics to assess how well a classifier is performing.
One of the most obvious things to do is arrange predictions and true values in a cross table.
```{r}
(trn_tab = table(predicted = default_trn_pred, actual = default_trn$default))
```
```{r}
(tst_tab = table(predicted = default_tst_pred, actual = default_tst$default))
```
Often we give specific names to individual cells of these tables, and in the predictive setting, we would call this table a [**confusion matrix**](https://en.wikipedia.org/wiki/Confusion_matrix). Be aware, that the placement of Actual and Predicted values affects the names of the cells, and often the matrix may be presented transposed.
In statistics, we label the errors Type I and Type II, but these are hard to remember. False Positive and False Negative are more descriptive, so we choose to use these.
![](images/confusion.png)
The `confusionMatrix()` function from the `caret` package can be used to obtain a wealth of additional information, which we see output below for the test data. Note that we specify which category is considered "positive."
```{r}
trn_con_mat = confusionMatrix(trn_tab, positive = "Yes")
(tst_con_mat = confusionMatrix(tst_tab, positive = "Yes"))
```
The most common, and most important metric is the **classification error rate**.
$$
\text{err}(\hat{C}, \text{Data}) = \frac{1}{n}\sum_{i = 1}^{n}I(y_i \neq \hat{C}(x_i))
$$
Here, $I$ is an indicator function, so we are essentially calculating the proportion of predicted classes that match the true class.
$$
I(y_i \neq \hat{C}(x)) =
\begin{cases}
1 & y_i \neq \hat{C}(x) \\
0 & y_i = \hat{C}(x) \\
\end{cases}
$$
It is also common to discuss the **accuracy**, which is simply one minus the error.
Like regression, we often split the data, and then consider Train (Classification) Error and Test (Classification) Error will be used as a measure of how well a classifier will work on unseen future data.
$$
\text{err}_{\texttt{trn}}(\hat{C}, \text{Train Data}) = \frac{1}{n_{\texttt{trn}}}\sum_{i \in \texttt{trn}}^{}I(y_i \neq \hat{C}(x_i))
$$
$$
\text{err}_{\texttt{tst}}(\hat{C}, \text{Test Data}) = \frac{1}{n_{\texttt{tst}}}\sum_{i \in \texttt{tst}}^{}I(y_i \neq \hat{C}(x_i))
$$
Accuracy values can be found by calling `confusionMatrix()`, or, if stored, can be accessed directly. Here, we use them to obtain error rates.
```{r}
1 - trn_con_mat$overall["Accuracy"]
```
```{r}
1 - tst_con_mat$overall["Accuracy"]
```
Sometimes guarding against making certain errors, FP or FN, are more important than simply finding the best accuracy. Thus, sometimes we will consider **sensitivity** and **specificity**.
$$
\text{Sens} = \text{True Positive Rate} = \frac{\text{TP}}{\text{P}} = \frac{\text{TP}}{\text{TP + FN}}
$$
```{r}
tst_con_mat$byClass["Sensitivity"]
```
$$
\text{Spec} = \text{True Negative Rate} = \frac{\text{TN}}{\text{N}} = \frac{\text{TN}}{\text{TN + FP}}
$$
```{r}
tst_con_mat$byClass["Specificity"]
```
Like accuracy, these can easily be found using `confusionMatrix()`.
When considering how well a classifier is performing, often, it is understandable to assume that any accuracy in a binary classification problem above 0.50, is a reasonable classifier. This however is not the case. We need to consider the **balance** of the classes. To do so, we look at the **prevalence** of positive cases.
$$
\text{Prev} = \frac{\text{P}}{\text{Total Obs}}= \frac{\text{TP + FN}}{\text{Total Obs}}
$$
```{r}
trn_con_mat$byClass["Prevalence"]
tst_con_mat$byClass["Prevalence"]
```
Here, we see an extremely low prevalence, which suggests an even simpler classifier than our current based on `balance`.
$$
\hat{C}(\texttt{balance}) =
\begin{cases}
\text{No} & \texttt{balance} > 1400 \\
\text{No} & \texttt{balance} \leq 1400
\end{cases}
$$
This classifier simply classifies all observations as negative cases.
```{r}
pred_all_no = simple_class(default_tst$balance,
boundary = 1400, above = "No", below = "No")
table(predicted = pred_all_no, actual = default_tst$default)
```
The `confusionMatrix()` function won't even accept this table as input, because it isn't a full matrix, only one row, so we calculate error rates directly. To do so, we write a function.
```{r}
calc_class_err = function(actual, predicted) {
mean(actual != predicted)
}
```
```{r}
calc_class_err(actual = default_tst$default,
predicted = pred_all_no)
```
Here we see that the error rate is exactly the prevalence of the minority class.
```{r}
table(default_tst$default) / length(default_tst$default)
```
This classifier does better than the previous. But the point is, in reality, to create a good classifier, we should obtain a test error better than 0.033, which is obtained by simply manipulating the prevalences. Next chapter, we'll introduce much better classifiers which should have no problem accomplishing this task.
## `rmarkdown`
The `rmarkdown` file for this chapter can be found [**here**](09-classification.Rmd). The file was created using `R` version `r paste0(version$major, "." ,version$minor)`. The following packages (and their dependencies) were loaded when knitting this file:
```{r, echo = FALSE}
names(sessionInfo()$otherPkgs)
```