This repository has been archived by the owner on Aug 16, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy path05-3-Scoring.Rmd
565 lines (383 loc) · 18.2 KB
/
05-3-Scoring.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
<!------------------------------------------------------------------------------------>
<!-- CLEANUP AND RELOAD DATASETS -->
```{r echo=FALSE,message=FALSE}
library(tidyverse)
rm(loansTraining,
allFactorsAsCharacteristics,
allFactorsAsBins,
loanSampleBins,
loanSampleCharacteristics)
knitr::opts_chunk$set(
# Chunks
eval = TRUE,
cache = TRUE,
echo = TRUE,
message = FALSE,
warning = FALSE)
```
```{r 05-03-load-models,echo=FALSE,message=FALSE}
# Saved at the end of the modeling file
SGLM_B_train <- readRDS("datasets/SGLM_B_train.rds")
SGLM_B_retrain <- readRDS("datasets/SGLM_B_retrain.rds")
SGLM_B_reretrain <- readRDS("datasets/SGLM_B_reretrain.rds")
```
<!-- END -->
<!------------------------------------------------------------------------------------->
## Model results
### Final list of variables
We selected the model fitted during the second training.
$$\text{ }$$
```{r echo=TRUE,message=FALSE}
# Model to use
GLModel <- SGLM_B_retrain
```
$$\text{ }$$
We collect the list of selected bins.
$$\text{ }$$
```{r 05-03-model-names,echo=TRUE,message=FALSE}
# List of model variables
modelNames <-
attr(GLModel$coefficients, "names") %>%
enframe(x = .) %>%
rename(variableName = "value") %>%
select(variableName) %>%
mutate(variableName = str_remove_all(variableName, "\`"))
```
$$\text{ }$$
And we collect the model summary (coefficients, significance, etc.).
$$\text{ }$$
```{r 05-03-extract-coefficients,echo=TRUE,message=FALSE}
# and their coefficients in the model (the summary function for speedglm objects is not exported and
# bookdown seems to have a problem with that).
GLMCoefficients <-
speedglm:::summary.speedglm(GLModel)$coefficients %>%
as_tibble() %>%
cbind(modelNames) %>%
rename(
zValue = "z value",
pValue = "Pr(>|z|)",
stdErr = "Std. Error"
) %>%
# reorder columns to have names first
select(variableName, everything())
```
$$\text{ }$$
### Scoring
Scoring expresses the coefficients that were estimated during the logistic regression into points on a scale. The model estimates the log-odds that a loan will not default. We will provide a detailed
description of the calculations to, hopefully, provide guidance to others. (The references we have found in the course of our research have been surprisingly limited.)
Let us recall that our model will be in the form of a linear regression to model the logodds of the probability of default. That is:
$$\text{logodds} (p) = \text{intercept} + \sum_{\text{Variable} = k} \sum_{\text{bin} = i} \alpha_{k, i} x_{k, i}$$
where:
$$\text{logodds} (p) = \log \frac{p}{1-p}$$
Thanks to the properties of the logodds transformation, the model can take any value between $-\infty$ and $+\infty$ and yield valid probability values.
The idea of scoring is to replace this linear relationship with a points system: if an applicant ticks a box on a particular question/variable, he/she gets so many points. For example, asked for an age band, an applicant would receive 10 points if between 18 and 25 year-old, and 25 points if between 25 and 32. The age bands (the variable bins) were calculated above.
Basic convenience and marketing common sense dictates that:
+ points should be rounded (who wants to see 14.99432529875 on an application form?)
+ those points should not be negative.
+ Scorecards do not start with an initial piggybank of points. In terms of modeling, that means no intercept.
The idea of scorecard applies to model similar to this one. Marketing consideration might be relevant to our context. They would be completely irrelevant to scorecard for a disease risk assessment or probability of failure of a mechanical part.
Scoring will replace the $\alpha_{k, i}$ coefficients with scores $S_{k, i}$. In addition, we will get rid of the intercept. To remove the intercept, we will apportion it equally across all variables:
$$\text{logodds} (p) = \sum_{\text{Variable} = k} { \left( \frac{\text{intercept}}{K} + \sum_{\text{bin} = i} \alpha_{k, i} x_{k, i} \right) }$$
where $K$ is the number of variables. And if $I_k$ is the number of bins for variable $k$:
$$\text{logodds} (p) = \sum_{\text{Variable k} = 1}^K { \sum_{\text{bin i} = 1}^{I_k} \left( \frac{\text{intercept}}{K . I_k} + \alpha_{k, i} x_{k, i} \right) }$$
Then we select a factor $F$ that will dilate/contract the coefficients such that:
$$\text{logodds} (p) = \sum_{\text{Variable} = k} { \left( \frac{\text{intercept}}{K} + \sum_{\text{bin} = i} \alpha_{k, i} x_{k, i} \right) }$$
### Model scorecard
The conversion is done using three parameters that are chosen somewhat arbitrarily and which define a line:
+ the number of points increase / decrease that would reflect halving / doubling the odds of defaulting;
+ an _anchoring_ score reflecting a particular odd.
For our purpose, we will choose 5,000 points ($\text{Score}_{anchor}$) being equivalent to 1 in a 20 to default ($\text{Odds}_{anchor}$), i.e. $\text{Score}_{anchor}$ 5,000 points <=> $\text{Odds}_{Anchor} = \frac{1 / 20}{1 - 1/20}$. We will also choose 100 to reflect _times 2_ change in odds ($DoubleOdds$). Those choices are completely arbitrary. Basically, the score is a linear representation of the odds. The score is defined by a point (the anchor) and the slope of the line going through that point.
Those values will reflect the total estimated score for a borrower (i.e. loan sample). The number of characteristics (information points gathered in a credit application), or number of bins, have to be irrelevant in calculating this score. In other words, if LendingClub were to gather 5 more information points, the score should, __mutatis mutandis__, be unchanged. (However, we would hope that the quality of the estimated score would improve.)
The score per variable needs to be adjusted using the number of information points, that is the number of characteristics. Here, the model has been trained on the number of bins. We first need to determine how many characteristics are used in the model.
$$\text{ }$$
```{r 05-03-calculate-n-characteristics,echo=TRUE,message=FALSE}
# The list of characteristics is extracted from the list of bins names.
numberOfCharacteristics <-
modelNames %>%
# List of all names excluding the intercept which is not part of the scoring calculations (it
# would mean giving points for free as a base line)
filter(variableName != "(Intercept)") %>%
# Extract strings ("[a-zA-Z0-9-_]*") preceded by an opening bracket ("(?<=\\()") and followed by a
# closing bracket ("(?=\\))"). (The column names were formatted this way for that purpose.) See
# "https://github.com/rstudio/cheatsheets/blob/master/strings.pdf" for details on the regex.
mutate(characteristic = str_match(variableName, "(?<=\\()[a-zA-Z0-9-_]*(?=\\))")) %>%
# We are only interested in how many distinct characteristic there are.
distinct(characteristic) %>%
nrow()
numberOfCharacteristics
```
$$\text{ }$$
We can now perform the scoring calculation.
$$\text{ }$$
```{r 05-03-scoring-parameters,echo=TRUE,message=FALSE}
ProbDefaultAtAnchor <- 1/20
# The model is trained so that the 'Good' outcome is that a loan does not default. The odds that
# therefore calculated on the probability of no default.
ProbAtAnchor <- 1 - ProbDefaultAtAnchor
OddsAnchor <- ProbAtAnchor / (1 - ProbAtAnchor)
ScoreAnchor <- 2000
# Doubling odds = 100 points
DoubleOdds <- 100
ScoreFactor <- OddsAnchor / log(2)
# 5,000 points is 20:1 odds of default
ScoreOffset <- ScoreAnchor - ScoreFactor * log(OddsAnchor)
```
$$\text{ }$$
Then we apportion across characteristics.
$$\text{ }$$
```{r 05-03-parameters-per-chr,echo=TRUE,message=FALSE}
# Score at the intercept
Intercept <- summary(GLModel)$coefficients["(Intercept)", "Estimate"]
ScorePerVariable <- (ScoreFactor * Intercept + ScoreOffset) / numberOfCharacteristics
InterceptPerVariable <- Intercept * ScoreFactor / numberOfCharacteristics
GLMScores <-
GLMCoefficients %>%
mutate(
weight = Estimate * ScoreFactor,
weightScaled = weight + ScorePerVariable,
points = round(weightScaled)
)
```
$$\text{ }$$
__Very important__: The intercept points have been allocated across all characteristics. Therefore the regression coefficient estimated for the intercept becomes redundant and needs removing.
$$\text{ }$$
```{r echo=TRUE,message=FALSE}
GLMScores[1, "points"] <- 0
```
$$\text{ }$$
The `speedglm` package does not have a `predict` function once models have been trained. However, using the model is a simple matrix multiplication: $\text{Loan matrix} \times \text{Scorecard weights}$
## Training set
```{r 05-03-load-factored-datasets,echo=FALSE,message=FALSE}
# Reload the right datasets
allFactorsAsBins <- readRDS("datasets/allBins100.rds")
loansTraining <- readRDS("datasets/LoansTraining.rds")
loansTest <- readRDS("datasets/LoansTest.rds")
```
We first review how the fitted model performed on the training set.
We remove every variable that is not in the list of variables in the model then convert into a matrix.
$$\text{ }$$
```{r 05-03-convert-to-matrix,echo=TRUE,message=FALSE}
# Remove every variable that is not in the list of variables in the model then convert
# into a matrix
allMatrix <-
allFactorsAsBins[, !is.na(match(
names(allFactorsAsBins),
str_remove_all(GLMCoefficients$variableName, "\`")
))] %>%
as.matrix()
```
$$\text{ }$$
The coefficients of the model also include the interept (set at zero in the scores). We add a column of 1's to the training data.
$$\text{ }$$
```{r 05-03-matrix-add-1-col,echo=TRUE,message=FALSE}
# Add a column of 1s for the intercept
allMatrix <-
cbind(as.vector(rep.int(
x = 1, times = dim(allMatrix)[1]
)), allMatrix)
dim(allMatrix)
```
$$\text{ }$$
```{r message=FALSE,echo=FALSE}
# Done with this dataset
rm(allFactorsAsBins)
```
The coefficients of the model, then the scores, are converted to a vector format.
$$\text{ }$$
```{r 05-03-logit-estimates,echo=TRUE,message=FALSE}
CoefficientsVector <- GLMCoefficients$Estimate %>% as.matrix()
# Score per variable
TrainingScorecard <- allMatrix %*% ( GLMScores$points %>% as.matrix() )
```
$$\text{ }$$
We can now multiply the matrix of sample with the vector of coefficients.
$$\text{ }$$
```{r echo=TRUE,message=FALSE}
TrainingLogit <- allMatrix %*% CoefficientsVector
TrainingLogit <-
enframe(TrainingLogit[, 1]) %>%
mutate(oddsGood = exp(value),
p = 1 / (1 + oddsGood)) %>%
cbind(TrainingScorecard)
```
$$\text{ }$$
### Densities of the training results
We plot the results of the training model and group the results by rating ("A" to "G") in Figure \@ref(fig:05-03-training-hists-value).
$$\text{ }$$
```{r 05-03-training-hists-value,fig.cap="Model results on the training set",echo=TRUE,message=FALSE}
gridExtra::grid.arrange(
loansTraining %>%
cbind(TrainingLogit) %>%
filter(between(value, -2, 5)) %>%
ggplot(aes(value, col = grade)) +
geom_density(adjust = 0.5) +
ggtitle("Logit value"),
loansTraining %>%
cbind(TrainingLogit) %>%
ggplot(aes(p, col = grade)) +
geom_density(adjust = 0.5) +
scale_x_log10() +
ggtitle("Probability of being GOOD"),
loansTraining %>%
cbind(TrainingLogit) %>%
filter(between(oddsGood, 0.1, 100)) %>%
ggplot(aes(oddsGood, col = grade)) +
geom_density(adjust = 0.5) +
scale_x_log10() +
ggtitle("Odds of being GOOD"),
loansTraining %>%
cbind(TrainingLogit) %>%
ggplot(aes(TrainingScorecard, col = grade)) +
geom_density(adjust = 1) +
ggtitle("Scorecards"),
ncol = 2, nrow = 2
)
```
$$\text{ }$$
## Test set
We replicate the exact same steps on the test set loans, also converted into a matrix.
```{r echo=FALSE,message=FALSE}
# Prepare the full test set
bestBins <- readRDS("datasets/bestBins100.rds")
```
$$\text{ }$$
```{r 05-03-test-bins,echo=TRUE,message=FALSE}
predictionCategories <- loansTest[, "loanID"]
for (index in 1:length(bestBins$variable)) {
binned <-
binner::categoriseFromWoE.Wide(
df = loansTest,
varName = bestBins$variable[index],
woeTable = bestBins$WoE[[index]]
)
predictionCategories <- cbind(predictionCategories, binned)
}
```
$$\text{ }$$
$$\text{ }$$
```{r 05-03-test-matrix-select-variables,echo=TRUE,message=FALSE}
# Retain only the relevant scorecard categories
predictionMatrix <-
predictionCategories[, !is.na(match(
names(predictionCategories),
str_remove_all(GLMCoefficients$variableName, "\`")
))] %>%
as.matrix()
```
$$\text{ }$$
$$\text{ }$$
```{r 05-03-test-matrix-add-1-col,echo=TRUE,message=FALSE}
predictionMatrix <- cbind(as.vector(rep.int(x = 1, times = dim(predictionMatrix)[1])),
predictionMatrix)
```
$$\text{ }$$
$$\text{ }$$
```{r 05-03-test-matrix,echo=TRUE,message=FALSE}
TestLogit <- predictionMatrix %*% CoefficientsVector
TestLogit <-
tibble::enframe(TestLogit[, 1]) %>%
mutate(
p = 1 / (1 + exp(-value)),
oddsGood = if_else(is.infinite(p / (1 - p)), 1e10, p / (1 - p)))
predictionScorecard <- predictionMatrix %*% ( GLMScores$points %>% as.matrix() )
```
$$\text{ }$$
$$\text{ }$$
```{r 05-03-test-estimates,fig.cap="Density of loans by scorecard",echo=TRUE,message=FALSE}
loansTest %>%
cbind(TestLogit) %>%
cbind(predictionScorecard) %>%
filter(predictionScorecard > 0) %>%
ggplot(aes(predictionScorecard)) +
geom_density(col = "blue", fill = "lightblue", adjust = 3)
```
$$\text{ }$$
We can now see the same downward dynamics as training set
$$\text{ }$$
```{r 05-03-test-viz-value,fig.cap="Logit value predicted by the model",echo=TRUE,message=FALSE}
loansTest %>%
cbind(TestLogit) %>%
filter(between(value, -2, 5)) %>%
ggplot(aes(value, col = grade)) +
geom_density(adjust = 2)
```
$$\text{ }$$
$$\text{ }$$
```{r 05-03-test-viz-p,fig.cap="Probability of a loan being GOOD predicted by the model",echo=TRUE,message=FALSE}
loansTest %>%
cbind(TestLogit) %>%
ggplot(aes(p, col = grade)) +
geom_density(adjust = 2) +
scale_x_log10()
```
$$\text{ }$$
$$\text{ }$$
```{r 05-03-test-viz-odds,fig.cap="Odds of a loan being GOOD predicted by the model",echo=TRUE,message=FALSE}
loansTest %>%
cbind(TestLogit) %>%
filter(between(oddsGood, 0.1, 100)) %>%
ggplot(aes(oddsGood, col = grade)) +
geom_density(adjust = 2) +
scale_x_log10()
```
$$\text{ }$$
## Confusion matrix
Given a probability $p$ from the model, we use a $p = 0.50$ cut-off point to decide whether a loan is Good or Bad. The Confusion Matrix results are:
$$\text{ }$$
```{r 05-03-test-confusion-matrix,echo=TRUE,message=FALSE}
tCM <- loansTest %>%
cbind(TestLogit) %>%
select(p, isGoodLoan) %>%
mutate(p = if_else(p >= 0.50, "GOOD", "BAD"),
isGoodLoan = if_else(isGoodLoan, "GOOD", "BAD")) %>%
rename(Predicted = p,
Actual = isGoodLoan) %>%
table() %>%
caret::confusionMatrix(positive = "GOOD")
tCM
```
$$\text{ }$$
The results suggest that the model is effrective at predicting good loans (measured by the sensitivity = $\frac{TP}{TP + FN}$ = `r round(100*tCM$byClass["Sensitivity"], digits = 2)`%). However, this is deceptive since the dataset is unbalanced. The model is performing poorly at detecting bad loans (measured by the specificity = $\frac{TN}{TN + FP}$ = `r round(100*tCM$byClass["Specificity"], digits = 2)`%) The dataset is unbalanced and measuring the model's performance with a confusion matrix is imprecise. A much better approach would be to train many models on balanced datasets (by sampling a reduced 'good loans' dataset) and study the distribution of that resulting models and their parameters.
More critically, the confusion matrix does not (and cannot) reflect the consequence of getting predictions wrong. At the end of the day, the only relevant consequence is estimating the number of dollars lost on a loan. A misqualified loan might lead to a loss of a single dollar, or a million. Predicting a probability of default is not enough. We need to subsequently predict the expected loss when a particular loan defaults. In the conclusion, we suggest one possible avenue.
## ROC Curve
A popular measure for the performance of such a logistic regression model is to consider its _Receiver Operating Characteristic_ curve (_ROC_) and calculate its area under curve (_AUC_). We use the `ROCR` package ([@10.1093/bioinformatics/bti623]). We recommend \@fawcett2004roc for a very good overview of Receiver Operating Characteristics graphs.
We first create a prediction object that will be used for plotting.
$$\text{ }$$
```{r 05-03-test-ROC,echo=TRUE,message=FALSE}
ROCRPrediction <- ROCR::prediction(TestLogit$p, loansTest$isGoodLoan)
```
$$\text{ }$$
Figure \@ref(fig:05-03-ROC-curve) plots the Receiver Operating Characteristic curve of the regression. The area under the ROC curve is `r round(100 * ROCR::performance(ROCRPrediction, "auc")@y.values[[1]], 2)`%.
$$\text{ }$$
```{r 05-03-ROC-curve,echo=TRUE,message=FALSE,fig.cap="Receiver Operating Characteristic"}
ROCR::performance(ROCRPrediction,
measure = "tpr",
x.measure = "fpr") %>%
ROCR::plot(colorize = TRUE)
```
$$\text{ }$$
The AUC has an important statistical property: the AUC of a classifer is equivalent to the probability that the classifer will rank a randomly chosen positive instance higher than a randomly chosen negative instance. This is equivalent to the Wilcoxon test of ranks.The AUC is also closely related to the Gini index, which is twice the area between the diagonal and the ROC curve ($\text{Gini} + 1 = 2 \times \text{AUC}$).
$$\text{ }$$
```{r 05-03-precision-recall-curve,echo=TRUE,message=FALSE,fig.cap="Precision/Recall curve"}
ROCR::performance(ROCRPrediction,
measure = "prec",
x.measure = "rec") %>%
ROCR::plot(colorize = TRUE)
```
$$\text{ }$$
$$\text{ }$$
```{r 05-03-sens-spec-curve,echo=TRUE,message=FALSE,fig.cap="Sensitivity/Specificity curve"}
ROCR::performance(ROCRPrediction,
measure = "sens",
x.measure = "spec") %>%
ROCR::plot(colorize = TRUE)
```
$$\text{ }$$
$$\text{ }$$
```{r 05-03-lift-chart,echo=TRUE,message=FALSE,fig.cap="Lift Chart"}
ROCRPerformance <- ROCR::performance(ROCRPrediction,
measure = "lift",
x.measure = "rpp") %>%
ROCR::plot(colorize = TRUE)
```
$$\text{ }$$