05-3-Scoring.Rmd


<!------------------------------------------------------------------------------------>
<!-- CLEANUP AND RELOAD DATASETS                                                    -->

```{r echo=FALSE,message=FALSE}
library(tidyverse)
rm(loansTraining, 
   allFactorsAsCharacteristics, 
   allFactorsAsBins, 
   loanSampleBins, 
   loanSampleCharacteristics)

knitr::opts_chunk$set(
  # Chunks
  eval = TRUE,
  cache = TRUE,
  echo = TRUE,
  message = FALSE,
  warning = FALSE)
```


```{r 05-03-load-models,echo=FALSE,message=FALSE}
# Saved at the end of the modeling file
SGLM_B_train <- readRDS("datasets/SGLM_B_train.rds")
SGLM_B_retrain <- readRDS("datasets/SGLM_B_retrain.rds")
SGLM_B_reretrain <- readRDS("datasets/SGLM_B_reretrain.rds")
```


<!-- END                                                                             -->
<!------------------------------------------------------------------------------------->


## Model results


### Final list of variables


We selected the model fitted during the second training.

$$\text{ }$$
```{r echo=TRUE,message=FALSE}
# Model to use
GLModel <- SGLM_B_retrain
```
$$\text{ }$$


We collect the list of selected bins.

$$\text{ }$$
```{r 05-03-model-names,echo=TRUE,message=FALSE}
# List of model variables
modelNames <- 
  attr(GLModel$coefficients, "names") %>%
  enframe(x = .) %>%
  rename(variableName = "value") %>%
  select(variableName) %>%
  mutate(variableName = str_remove_all(variableName, "\`"))
```
$$\text{ }$$


And we collect the model summary (coefficients, significance, etc.). 

$$\text{ }$$
```{r 05-03-extract-coefficients,echo=TRUE,message=FALSE}
# and their coefficients in the model (the summary function for speedglm objects is not exported and
# bookdown seems to have a problem with that).

GLMCoefficients <- 
  speedglm:::summary.speedglm(GLModel)$coefficients %>%
  as_tibble() %>%
  cbind(modelNames) %>%
  rename(
    zValue = "z value",
    pValue = "Pr(>|z|)",
    stdErr = "Std. Error"
  ) %>%

  # reorder columns to have names first
  select(variableName, everything())

```
$$\text{ }$$


### Scoring

Scoring expresses the coefficients that were estimated during the logistic regression into points on a scale. The model estimates the log-odds that a loan will not default. We will provide a detailed 
description of the calculations to, hopefully, provide guidance to others. (The references we have found in the course of our research have been surprisingly limited.)

Let us recall that our model will be in the form of a linear regression to model the logodds of the probability of default. That is:

$$\text{logodds} (p) = \text{intercept} + \sum_{\text{Variable} = k} \sum_{\text{bin} = i} \alpha_{k, i} x_{k, i}$$

where:

$$\text{logodds} (p) = \log \frac{p}{1-p}$$

Thanks to the properties of the logodds transformation, the model can take any value between $-\infty$ and $+\infty$ and yield valid probability values. 


The idea of scoring is to replace this linear relationship with a points system: if an applicant ticks a box on a particular question/variable, he/she gets so many points. For example, asked for an age band, an applicant would receive 10 points if between 18 and 25 year-old, and 25 points if between 25 and 32. The age bands (the variable bins) were calculated above. 


Basic convenience and marketing common sense dictates that:  
 
  + points should be rounded (who wants to see 14.99432529875 on an application form?) 
  
  + those points should not be negative.
  
  + Scorecards do not start with an initial piggybank of points. In terms of modeling, that means no intercept.
  

The idea of scorecard applies to model similar to this one. Marketing consideration might be relevant to our context. They would be completely irrelevant to scorecard for a disease risk assessment or probability of failure of a mechanical part.

Scoring will replace the $\alpha_{k, i}$ coefficients with scores $S_{k, i}$. In addition, we will get rid of the intercept. To remove the intercept, we will apportion it equally across all variables: 


$$\text{logodds} (p) = \sum_{\text{Variable} = k} { \left( \frac{\text{intercept}}{K} + \sum_{\text{bin} = i} \alpha_{k, i} x_{k, i} \right) }$$
where $K$ is the number of variables. And if $I_k$ is the number of bins for variable $k$:

$$\text{logodds} (p) = \sum_{\text{Variable k} = 1}^K { \sum_{\text{bin i} = 1}^{I_k} \left( \frac{\text{intercept}}{K . I_k} + \alpha_{k, i} x_{k, i} \right) }$$


Then we select a factor $F$ that will dilate/contract the coefficients such that:

$$\text{logodds} (p) = \sum_{\text{Variable} = k} { \left( \frac{\text{intercept}}{K} + \sum_{\text{bin} = i} \alpha_{k, i} x_{k, i} \right) }$$


### Model scorecard

The conversion is done using  three parameters that are chosen somewhat arbitrarily and which define a line:

+ the number of points increase / decrease that would reflect halving / doubling the odds of defaulting;

+ an _anchoring_ score reflecting a particular odd.

For our purpose, we will choose 5,000 points ($\text{Score}_{anchor}$) being equivalent to 1 in a 20 to default ($\text{Odds}_{anchor}$), i.e. $\text{Score}_{anchor}$ 5,000 points <=> $\text{Odds}_{Anchor} = \frac{1 / 20}{1 - 1/20}$. We will also choose 100 to reflect _times 2_ change in odds ($DoubleOdds$). Those choices are completely arbitrary. Basically, the score is a linear representation of the odds. The score is defined by a point (the anchor) and the slope of the line going through that point.

Those values will reflect the total estimated score for a borrower (i.e. loan sample). The number of characteristics (information points gathered in a credit application), or number of bins, have to be irrelevant in calculating this score. In other words, if LendingClub were to gather 5 more information points, the score should, __mutatis mutandis__, be unchanged. (However, we would hope that the quality of the estimated score would improve.)

The score per variable needs to be adjusted using the number of information points, that is the number of characteristics. Here, the model has been trained on the number of bins. We first need to determine how many characteristics are used in the model.


$$\text{ }$$
```{r 05-03-calculate-n-characteristics,echo=TRUE,message=FALSE}

# The list of characteristics is extracted from the list of bins names.
numberOfCharacteristics <-

  modelNames %>%

  # List of all names excluding the intercept which is not part of the scoring calculations (it
  # would mean giving points for free as a base line)
  filter(variableName != "(Intercept)") %>%

  # Extract strings ("[a-zA-Z0-9-_]*") preceded by an opening bracket ("(?<=\\()") and followed by a
  # closing bracket ("(?=\\))"). (The column names were formatted this way for that purpose.) See
  # "https://github.com/rstudio/cheatsheets/blob/master/strings.pdf" for details on the regex.
  mutate(characteristic = str_match(variableName, "(?<=\\()[a-zA-Z0-9-_]*(?=\\))")) %>%

  # We are only interested in how many distinct characteristic there are.
  distinct(characteristic) %>%
  nrow()

numberOfCharacteristics
```
$$\text{ }$$


We can now perform the scoring calculation. 


$$\text{ }$$
```{r 05-03-scoring-parameters,echo=TRUE,message=FALSE}

ProbDefaultAtAnchor <- 1/20

# The model is trained so that the 'Good' outcome is that a loan does not default. The odds that
# therefore calculated on the probability of no default.
ProbAtAnchor <- 1 - ProbDefaultAtAnchor
OddsAnchor <- ProbAtAnchor / (1 - ProbAtAnchor)
ScoreAnchor <- 2000

# Doubling odds = 100 points
DoubleOdds <- 100

ScoreFactor <- OddsAnchor / log(2)

# 5,000 points is 20:1 odds of default
ScoreOffset <- ScoreAnchor - ScoreFactor * log(OddsAnchor)
```
$$\text{ }$$

Then we apportion across characteristics.

$$\text{ }$$
```{r 05-03-parameters-per-chr,echo=TRUE,message=FALSE}

# Score at the intercept
Intercept <- summary(GLModel)$coefficients["(Intercept)", "Estimate"]

ScorePerVariable <- (ScoreFactor * Intercept + ScoreOffset) / numberOfCharacteristics 
InterceptPerVariable <- Intercept * ScoreFactor / numberOfCharacteristics

GLMScores <-
  GLMCoefficients %>%
  mutate(
    weight = Estimate * ScoreFactor,
    weightScaled = weight + ScorePerVariable,
    points = round(weightScaled)
  )
```
$$\text{ }$$

__Very important__: The intercept points have been allocated across all characteristics. Therefore the regression coefficient estimated for the intercept becomes redundant and needs removing.

$$\text{ }$$
```{r echo=TRUE,message=FALSE}
GLMScores[1, "points"] <-  0
```
$$\text{ }$$


The `speedglm` package does not have a `predict` function once models have been trained. However, using the model is a simple matrix multiplication: $\text{Loan matrix} \times \text{Scorecard weights}$

## Training set

```{r 05-03-load-factored-datasets,echo=FALSE,message=FALSE}
# Reload the right datasets
allFactorsAsBins <- readRDS("datasets/allBins100.rds")
loansTraining <- readRDS("datasets/LoansTraining.rds")
loansTest <- readRDS("datasets/LoansTest.rds")
```


We first review how the fitted model performed on the training set.

We remove every variable that is not in the list of variables in the model then convert into a matrix.

$$\text{ }$$
```{r 05-03-convert-to-matrix,echo=TRUE,message=FALSE}
# Remove every variable that is not in the list of variables in the model then convert
# into a matrix
allMatrix <-
  allFactorsAsBins[, !is.na(match(
    names(allFactorsAsBins),
    str_remove_all(GLMCoefficients$variableName, "\`")
  ))] %>%
  as.matrix()
```
$$\text{ }$$

The coefficients of the model also include the interept (set at zero in the scores). We add a column of 1's to the training data.

$$\text{ }$$
```{r 05-03-matrix-add-1-col,echo=TRUE,message=FALSE}
# Add a column of 1s for the intercept
allMatrix <-
  cbind(as.vector(rep.int(
    x = 1, times = dim(allMatrix)[1]
  )), allMatrix)
dim(allMatrix)
```
$$\text{ }$$


```{r message=FALSE,echo=FALSE}
# Done with this dataset
rm(allFactorsAsBins)
```


The coefficients of the model, then the scores, are converted to a vector format.


$$\text{ }$$
```{r 05-03-logit-estimates,echo=TRUE,message=FALSE}
CoefficientsVector <- GLMCoefficients$Estimate %>% as.matrix()

# Score per variable
TrainingScorecard <- allMatrix %*% ( GLMScores$points %>% as.matrix() ) 
```
$$\text{ }$$
 

We can now multiply the matrix of sample with the vector of coefficients.

 
$$\text{ }$$
```{r echo=TRUE,message=FALSE}
TrainingLogit <- allMatrix %*% CoefficientsVector
TrainingLogit <-
  enframe(TrainingLogit[, 1]) %>%
  mutate(oddsGood = exp(value),
         p = 1 / (1 + oddsGood)) %>% 
  cbind(TrainingScorecard)

```
$$\text{ }$$


### Densities of the training results

We plot the results of the training model and group the results by rating ("A" to "G") in Figure \@ref(fig:05-03-training-hists-value).


$$\text{ }$$
```{r 05-03-training-hists-value,fig.cap="Model results on the training set",echo=TRUE,message=FALSE}

gridExtra::grid.arrange(
  loansTraining %>%
    cbind(TrainingLogit) %>%
    filter(between(value, -2, 5)) %>%
    ggplot(aes(value, col = grade)) +
    geom_density(adjust = 0.5) +
    ggtitle("Logit value"),
  
  loansTraining %>%
    cbind(TrainingLogit) %>%
    ggplot(aes(p, col = grade)) +
    geom_density(adjust = 0.5) +
    scale_x_log10() +
    ggtitle("Probability of being GOOD"),
  
  loansTraining %>%
    cbind(TrainingLogit) %>%
    filter(between(oddsGood, 0.1, 100)) %>%
    ggplot(aes(oddsGood, col = grade)) +
    geom_density(adjust = 0.5) +
    scale_x_log10() +
    ggtitle("Odds of being GOOD"),
  
  loansTraining %>%
    cbind(TrainingLogit) %>%
    ggplot(aes(TrainingScorecard, col = grade)) +
    geom_density(adjust = 1) +
    ggtitle("Scorecards"),

  ncol = 2, nrow = 2
)

```
$$\text{ }$$


## Test set

We replicate the exact same steps on the test set loans, also converted into a matrix.

```{r echo=FALSE,message=FALSE}
# Prepare the full test set
bestBins <- readRDS("datasets/bestBins100.rds")
```

$$\text{ }$$
```{r 05-03-test-bins,echo=TRUE,message=FALSE}
predictionCategories <- loansTest[, "loanID"]

for (index in 1:length(bestBins$variable)) {
  binned <-
    binner::categoriseFromWoE.Wide(
      df = loansTest,
      varName = bestBins$variable[index],
      woeTable = bestBins$WoE[[index]]
    )

  predictionCategories <- cbind(predictionCategories, binned)
}
```
$$\text{ }$$


$$\text{ }$$
```{r 05-03-test-matrix-select-variables,echo=TRUE,message=FALSE}
# Retain only the relevant scorecard categories
predictionMatrix <-
  predictionCategories[, !is.na(match(
    names(predictionCategories),
    str_remove_all(GLMCoefficients$variableName, "\`")
  ))] %>%
  as.matrix()
```
$$\text{ }$$


$$\text{ }$$
```{r 05-03-test-matrix-add-1-col,echo=TRUE,message=FALSE}
predictionMatrix <- cbind(as.vector(rep.int(x = 1, times = dim(predictionMatrix)[1])), 
                          predictionMatrix)
```
$$\text{ }$$


$$\text{ }$$
```{r 05-03-test-matrix,echo=TRUE,message=FALSE}
TestLogit <- predictionMatrix %*% CoefficientsVector
TestLogit <-
  tibble::enframe(TestLogit[, 1]) %>%
  mutate(
    p = 1 / (1 + exp(-value)),
    oddsGood = if_else(is.infinite(p / (1 - p)), 1e10, p / (1 - p)))

predictionScorecard <- predictionMatrix %*% ( GLMScores$points %>% as.matrix() ) 
```
$$\text{ }$$


$$\text{ }$$
```{r 05-03-test-estimates,fig.cap="Density of loans by scorecard",echo=TRUE,message=FALSE}
loansTest %>%
  cbind(TestLogit) %>%
  cbind(predictionScorecard) %>%
  filter(predictionScorecard > 0) %>%
  ggplot(aes(predictionScorecard)) +
  geom_density(col = "blue", fill = "lightblue", adjust = 3)

```
$$\text{ }$$


We can now see the same downward dynamics as training set


$$\text{ }$$
```{r 05-03-test-viz-value,fig.cap="Logit value predicted by the model",echo=TRUE,message=FALSE}
loansTest %>%
  cbind(TestLogit) %>%
  filter(between(value, -2, 5)) %>% 
  ggplot(aes(value, col = grade)) +
  geom_density(adjust = 2)
```
$$\text{ }$$


$$\text{ }$$
```{r 05-03-test-viz-p,fig.cap="Probability of a loan being GOOD predicted by the model",echo=TRUE,message=FALSE}
loansTest %>%
  cbind(TestLogit) %>%
  ggplot(aes(p, col = grade)) +
  geom_density(adjust = 2) +
  scale_x_log10()
```
$$\text{ }$$


$$\text{ }$$
```{r 05-03-test-viz-odds,fig.cap="Odds of a loan being GOOD predicted by the model",echo=TRUE,message=FALSE}
loansTest %>%
  cbind(TestLogit) %>%
  filter(between(oddsGood, 0.1, 100)) %>% 
  ggplot(aes(oddsGood, col = grade)) +
  geom_density(adjust = 2) + 
  scale_x_log10()
  
```
$$\text{ }$$


## Confusion matrix

Given a probability $p$ from the model, we use a $p = 0.50$ cut-off point to decide whether a loan is Good or Bad. The Confusion Matrix results are:

$$\text{ }$$
```{r 05-03-test-confusion-matrix,echo=TRUE,message=FALSE}
tCM <- loansTest %>%
  cbind(TestLogit) %>%
  select(p, isGoodLoan) %>% 
  
  mutate(p = if_else(p >= 0.50, "GOOD", "BAD"), 
         isGoodLoan = if_else(isGoodLoan, "GOOD", "BAD")) %>% 
  rename(Predicted = p, 
         Actual = isGoodLoan) %>% 
  
  table() %>% 
  caret::confusionMatrix(positive = "GOOD")

tCM

```
$$\text{ }$$


The results suggest that the model is effrective at predicting good loans (measured by the sensitivity = $\frac{TP}{TP + FN}$ = `r round(100*tCM$byClass["Sensitivity"], digits = 2)`%). However, this is deceptive since the dataset is unbalanced. The model is performing poorly at detecting bad loans (measured by the specificity = $\frac{TN}{TN + FP}$ = `r round(100*tCM$byClass["Specificity"], digits = 2)`%) The dataset is unbalanced and measuring the model's performance with a confusion matrix is imprecise. A much better approach would be to train many models on balanced datasets (by sampling a reduced 'good loans' dataset) and study the distribution of that resulting models and their parameters.

More critically, the confusion matrix does not (and cannot) reflect the consequence of getting predictions wrong. At the end of the day, the only relevant consequence is estimating the number of dollars lost on a loan. A misqualified loan might lead to a loss of a single dollar, or a million. Predicting a probability of default is not enough. We need to subsequently predict the expected loss when a particular loan defaults. In the conclusion, we suggest one possible avenue.


## ROC Curve

A popular measure for the performance of such a logistic regression model is to consider its _Receiver Operating Characteristic_ curve (_ROC_) and calculate its area under curve (_AUC_). We use the `ROCR` package ([@10.1093/bioinformatics/bti623]). We recommend \@fawcett2004roc for a very good overview of Receiver Operating Characteristics graphs.

We first create a prediction object that will be used for plotting.

$$\text{ }$$
```{r 05-03-test-ROC,echo=TRUE,message=FALSE}
ROCRPrediction <- ROCR::prediction(TestLogit$p, loansTest$isGoodLoan)
```
$$\text{ }$$

Figure \@ref(fig:05-03-ROC-curve) plots the Receiver Operating Characteristic curve of the regression. The area under the ROC curve is `r round(100 * ROCR::performance(ROCRPrediction, "auc")@y.values[[1]], 2)`%. 


$$\text{ }$$
```{r 05-03-ROC-curve,echo=TRUE,message=FALSE,fig.cap="Receiver Operating Characteristic"}

ROCR::performance(ROCRPrediction, 
                  measure = "tpr", 
                  x.measure = "fpr") %>% 
  ROCR::plot(colorize = TRUE) 

```
$$\text{ }$$
The AUC has an important statistical property: the AUC of a classifer is equivalent to the probability that the classifer will rank a randomly chosen positive instance higher than a randomly chosen negative instance. This is equivalent to the Wilcoxon test of ranks.The AUC is also closely related to the Gini index, which is twice the area between the diagonal and the ROC curve ($\text{Gini} + 1 = 2 \times \text{AUC}$). 


$$\text{ }$$
```{r 05-03-precision-recall-curve,echo=TRUE,message=FALSE,fig.cap="Precision/Recall curve"}

ROCR::performance(ROCRPrediction, 
                  measure = "prec",
                  x.measure = "rec") %>% 
  ROCR::plot(colorize = TRUE) 


```
$$\text{ }$$


$$\text{ }$$
```{r 05-03-sens-spec-curve,echo=TRUE,message=FALSE,fig.cap="Sensitivity/Specificity curve"}

ROCR::performance(ROCRPrediction, 
                  measure = "sens", 
                  x.measure = "spec") %>% 
  ROCR::plot(colorize = TRUE) 

```
$$\text{ }$$


$$\text{ }$$
```{r 05-03-lift-chart,echo=TRUE,message=FALSE,fig.cap="Lift Chart"}

ROCRPerformance <- ROCR::performance(ROCRPrediction, 
                  measure = "lift", 
                  x.measure = "rpp") %>% 
  ROCR::plot(colorize = TRUE) 


```
$$\text{ }$$