Nick Howlett 15 December 2016
This report details the analysis and construction of a logistic regression model for predicting loan application success. The data are sourced from: https://s3-eu-west-1.amazonaws.com/logicalgluebucket/interview/LendingClub.csv
The first task is to load the data into R and to examine its nature. In particular it is crucial to understand the nature of the independent variables that will be used to predict loan application success.
A broad overview of the data set is produced via a convenience function. The data summary is shown below.
df <- read.csv("LendingClub.csv")
df_sum <- makeDataSummary(df)
kable(df_sum)
Var | Class | NumLev | NumNA | ProcNA | ProcNum |
---|---|---|---|---|---|
Loan.Amount | integer | 804 | 0 | 0.0000 | 0.07 |
Loan.Term | factor | 2 | 0 | 0.0000 | 0.83 |
Employment.Length | numeric | 12 | 0 | 0.0000 | 0.22 |
Home.Ownership | factor | 5 | 0 | 0.0000 | 0.48 |
Annual.Income | numeric | 3615 | 0 | 0.0000 | 0.04 |
Loan.Purpose | factor | 15 | 0 | 0.0000 | 0.47 |
Address.State | factor | 51 | 0 | 0.0000 | 0.18 |
Debt.To.Income.Ratio | numeric | 2903 | 1 | 0.0000 | 0.01 |
No..Delinquencies.In.Last.2.Years | factor | 11 | 1 | 0.0000 | 0.89 |
Earliest.Credit.Line.Opened | numeric | 23843 | 0 | 0.0000 | 0.03 |
FICO.Credit.Score | integer | 41 | 1 | 0.0000 | 0.06 |
No..Inquiries.In.Last.6.Months | integer | 10 | 1 | 0.0000 | 0.45 |
Months.Since.Last.Delinquency | integer | 89 | 0 | 0.0000 | 0.65 |
No..Of.Credit.Lines | integer | 43 | 1 | 0.0000 | 0.10 |
No..Adverse.Public.Records | factor | 6 | 1 | 0.0000 | 0.95 |
Total.Credit.Balance | integer | 15483 | 1 | 0.0000 | 0.03 |
Use.Of.Credit.Line | numeric | 1035 | 52 | 0.0021 | 0.03 |
Total.Number.Of.Credit.Lines | integer | 81 | 1 | 0.0000 | 0.04 |
Loan.Application.Description | integer | 993 | 0 | 0.0000 | 0.28 |
No..Of.Public.Record.Bankruptcies | factor | 3 | 0 | 0.0000 | 0.96 |
Class | factor | 2 | 0 | 0.0000 | 0.82 |
The summary table contains six fields; 'Var', 'Class', 'NumLev', 'ProcNA' and 'ProcNum'. 'Var' and 'Class' detail the variable name and class of the variable, 'NumLev' details the number of unique values taken by 'Var' and 'ProcNA'/'ProcNum' detail the proportion of records which are NA and contain the modal value respectively.The data are a mixture of numeric and categorical variables.
From the summary table it is seen that missingness isn't a large issue for the data with the most affected variable, 'Use.Of.Credit.Line', missing less than 0.25% of its records. A EDA plotting function is used here to examine the nature of the distribution of this variable.
simple.eda(df$Use.Of.Credit.Line)
Looking at the symmetry of the distribution and considering that only a very small fraction of the values are missing it seems reasonable in this case to impute the mean value for those cases where it is missing.
df[is.na(df$Use.Of.Credit.Line), "Use.Of.Credit.Line"] <-
mean(df$Use.Of.Credit.Line, na.rm = TRUE)
A few other variables contain one NA which might indicate that one record is missing several values. If this is the case this record should be located and pruned. To search for this record the number of NAs across all columns was calculated as follows.
df <- df %>% mutate(sumNA = apply(df, 1, function(x) sum(is.na(x))))
Upon examination it is found that the record on row 7667 is missing 11 of the 21 dependent variables (9 are NAs and 2 are erroneous entries). Hence this record is removed from the data and any further analyses.
df <- df[-7667, ]
Some further inspection of the variables reveals that some factor variables are actually meant to be numerics; the variables 'No..Delinquencies.In.Last.2.Years', 'No..Adverse.Public.Records', 'No..Of.Public.Record.Bankruptcies' represent numbers (as the names suggest) but the data have been coded as unordered factors "one", "two", etc. To correct for this a function was written and applied to the data.
# Recoding factor columns into numerics
string.to.num <- function(s) {
s <- tolower(s)
# substituions to be made
old <- c("none","one", "two", "three", "four", "five",
"six", "seven", "eight", "nine", "ten")
new <- c(0:10)
for(i in seq_along(old)) s <- gsub(old[i], new[i], s, fixed = TRUE)
s <- as.numeric(s)
s
}
df <- df %>%
mutate(No..Delinquencies.In.Last.2.Years =
string.to.num(No..Delinquencies.In.Last.2.Years),
No..Adverse.Public.Records =
string.to.num(No..Adverse.Public.Records),
No..Of.Public.Record.Bankruptcies =
string.to.num(No..Of.Public.Record.Bankruptcies))
Below the top 3 factors in terms of level count are examined (note: this is counted after the transformation of the factors to numerics detailed earlier); these factors are the loan purpose, state address, and home ownership.
## Exploring the factor varibles
# Loan purpose
# Tabulate purposes
df_loanPurpose <- table(df$Loan.Purpose, df$Class) %>% as.data.frame()
names(df_loanPurpose)[1:2] <- c("Loan.Purpose", "Outcome")
# Calculate proptions
df_loanPurpose <- df_loanPurpose %>%
group_by(Outcome) %>%
mutate(Prop = 100*Freq/sum(Freq),
Loan.Purpose = fct_reorder(Loan.Purpose, Prop))
# Create barplot
ggpurp <- ggplot(df_loanPurpose, aes(x = Loan.Purpose, y = Prop))
ggpurp <- ggpurp + geom_bar(stat = "identity")
ggpurp <- ggpurp + facet_wrap(~Outcome)
ggpurp <- ggpurp + coord_flip()
ggpurp
# Home ownership
# Tabulate Ownership
df_homeOwnership <- table(df$Home.Ownership, df$Class) %>% as.data.frame()
names(df_homeOwnership)[1:2] <- c("Home.Ownership", "Outcome")
df_homeOwnership <- df_homeOwnership %>%
group_by(Outcome) %>%
mutate(Prop = 100*Freq/sum(Freq),
Home.OwnerShip = fct_reorder(Home.Ownership, Prop))
# Create barplot
gghome <- ggplot(df_homeOwnership, aes(x = Home.Ownership, y = Prop))
gghome <- gghome + geom_bar(stat = "identity")
gghome <- gghome + facet_wrap(~Outcome)
gghome <- gghome + coord_flip()
gghome
# State Address
# Tabulate Address
df_state <- table(df$Address.State, df$Class) %>% as.data.frame()
names(df_state)[1:2] <- c("Address.State", "Outcome")
df_state <- df_state %>%
group_by(Outcome) %>%
mutate(Prop = 100*Freq/sum(Freq),
Address.State = fct_reorder(Address.State, Prop))
# Create barplot
ggstate <- ggplot(df_state, aes(x = Address.State, y = Prop))
ggstate <- ggstate + geom_bar(stat = "identity")
ggstate <- ggstate + facet_wrap(~Outcome)
ggstate <- ggstate + coord_flip()
ggstate
The above plots illustrate two things. Firstly the overall proportion of the separate levels in each factor; for instance it is seen that "Debt Consolidation" is by far the most popular purpose for the loan, and California is the most frequent State found in the data. Secondly by faceting the plots over the outcome any apparent differences in acceptance for each level is shown; for instance it can be seen that for the 'Small Business' loan purpose there is a larger portion of 'Uncreditworthy' outcomes than 'Creditworthy'.
Overall these plots show that these three factors don't appear to contain a lot of predictive value; for the most part the proportions of the levels are homogeneous over the outcome variable.
Variables are examined in a univariate manner for their predictive capabilities on the target variable 'Class'.
The target variable is a binary factor which modal value appears in 82% of records. This is useful information as it will reveal the absolute worst case of performance that a model could make; the case where the model always predicts a success or failure (depending on which appears most often). To verify which is the modal value ('Creditworthy' or 'Uncreditworthy') the following code is executed;
sum(df$Class == "Creditworthy")/nrow(df)
## [1] 0.8182846
which confirms that loans are accepted in 82% of the data. Hence the null model of always predicting 'Creditworthy' would be correct 82% of the time.
Weight of Evidence [WoE] and Information Value [IV] are calculated for each variable. When the variable is categorical the WoE is the log-density ratio for each level, where the density refers to the density of target outcomes for each level of the factor. WoE is useful to recode the correlation between a factor level and the target outcome and is often used to substitute the original variable in model building. However WoE can be misleading when the distribution of levels is inhomogeneous. The IV takes into account the total distribution of the target variable over all levels of the factor and serves as a good metric for predictive power.
The outcome variable 'Class' is projected onto a binary flag 'Class_bin' and using the 'Information' package WoE and IV values are calculated. A summary table is shown below.
# Project the target variable onto a binary flag
df$Class_binNum <- ifelse(df$Class == "Creditworthy", 1, 0)
df$Class_bin <- as.factor(df$Class_binNum)
# Calculate WoE and IV
df_infotable <- create_infotables(df[ ,setdiff(names(df),
c("Class", "sumNA", "Class_bin"))],
y = "Class_binNum")
kable(df_infotable$Summary)
Variable | IV | |
---|---|---|
2 | Loan.Term | 0.1883109 |
11 | FICO.Credit.Score | 0.1653211 |
17 | Use.Of.Credit.Line | 0.1200644 |
6 | Loan.Purpose | 0.0922819 |
5 | Annual.Income | 0.0672703 |
7 | Address.State | 0.0540908 |
12 | No..Inquiries.In.Last.6.Months | 0.0425058 |
8 | Debt.To.Income.Ratio | 0.0400810 |
18 | Total.Number.Of.Credit.Lines | 0.0226000 |
3 | Employment.Length | 0.0193641 |
1 | Loan.Amount | 0.0186575 |
15 | No..Adverse.Public.Records | 0.0169235 |
10 | Earliest.Credit.Line.Opened | 0.0154134 |
20 | No..Of.Public.Record.Bankruptcies | 0.0105380 |
14 | No..Of.Credit.Lines | 0.0099140 |
4 | Home.Ownership | 0.0099040 |
19 | Loan.Application.Description | 0.0093618 |
16 | Total.Credit.Balance | 0.0038017 |
9 | No..Delinquencies.In.Last.2.Years | 0.0018038 |
13 | Months.Since.Last.Delinquency | 0.0011922 |
The table above shows the IV for each variable. A good rule of thumb for IV is as follows:
Information.Value | Predictive.Power |
---|---|
<0.02 | Useless for Prediction |
0.02 to 0.1 | Weak Predictor |
0.1 to 0.3 | Medium Predictor |
0.3 to 0.5 | Strong Predictor |
It follows that the best predictors in the data are 'Loan.Term', 'FICO.Credit.Score', and 'Use.Of.Credit.Line'. Concerning 'Loan.Term' it is a binary categorical factor recording the term of the loan; either 36 or 60 months. Using WoE we can understand the correlation of this variable with the outcome.
kable(df_infotable$Tables$Loan.Term)
Loan.Term | N | Percent | WOE | IV |
---|---|---|---|---|
36 Months | 20585 | 0.8308779 | 0.2301186 | 0.0408284 |
60 Months | 4190 | 0.1691221 | -0.8312470 | 0.1883109 |
From the table it can be surmised that there is a negative correlation (negative WoE) with moderate predictive power (IV (\approx) 0.2) for the 60 month loan term. To verify this a contingency table is generated.
kable(table(df$Loan.Term, df$Class_bin))
0 | 1 | |
---|---|---|
36 Months | 3087 | 17498 |
60 Months | 1415 | 2775 |
This contingency table shows that loans of 36 months are accepted with a fraction of (\frac{17498}{3087 + 17498} = 0.85), similar to the overall distribution of 0.82 for the entire set. Whereas loans of 60 months are accepted with a fraction of (\frac{2775}{1415 + 2775} = 0.66), which is notably lower than the overall distribution. The inference being that 60 month loans are less likely to be accepted than 36 month loans most likely due to the increased risk associated with longer term loan repayment schemes.
From earlier it was seen that the variables 'Address.State' and 'Loan.Purpose' where factors with numerous levels and they rank 6(^{\textrm{th}}) and 4(^{\textrm{th}}) respectively in terms of IV.
In testing it was found that WoE substitution made little difference to model quality and hence for the sake of interpretability the models presented here use the standard variables. Logistic regression models can be built using the 'glmnet' package and is facilitated using functions from the 'caret' package. The algorithms used by glmnet require dummy coding on the variables.
#Drop uneeded Levels
df <- droplevels(df)
# Save Outcome
outCome <- df$Class_bin
# Create dummy coded variables
dfDummy <- dummyVars("~.",
data = dplyr::select(df,
-Class_bin,
-Class_binNum,
-Class,
-sumNA),
fullRank = FALSE)
# Create dummy df
df_dum <- as.data.frame(predict(dfDummy,df))
df_dum$Class_bin <- outCome
levels(df_dum$Class_bin) <- make.names(levels(factor(df_dum$Class_bin)))
# Generalise outcome and predictor names
outcomeName <- 'Class_bin'
predictorsNames <- names(df_dum)[names(df_dum) != outcomeName]
# split data into training and testing chunks
set.seed(1234)
splitIndex <- createDataPartition(df_dum[,outcomeName], p = .70, list = FALSE, times = 1)
df_train <- df_dum[ splitIndex,]
df_test <- df_dum[-splitIndex,]
In many use-cases the interpretability of a model can be more important than the raw predictive performance. Variable selection allows for the model to be as simple as possible (i.e. to have as few coefficients for interpretation) while maximising performance.
The glmnet algorithms use the lasso technique for variable selection and regularization. Also, to further mitigate overfitting, the model is validated in a 10-fold cross-validation scheme. For tuning the parameters of the model two optimisation metrics were chosen to see how they effect the model's predictive power; Area Under the ROC Curve [AUC], and misclassification rate. Plots of the (\lambda)-path on these metrics are shown, where the second (unlabeled) horizontal axis is the effective number of variables in the model as a consequence of the lasso penalty.
# Build Models
model_glm_auc = cv.glmnet(as.matrix(df_train[,predictorsNames]),
df_train[,outcomeName],
family="binomial",
type.measure = "auc")
plot(model_glm_auc)
model_glm_MCR = cv.glmnet(as.matrix(df_train[,predictorsNames]),
df_train[,outcomeName],
family="binomial",
type.measure = "class")
plot(model_glm_MCR)
Interpreting these plots we see the best performing model (indicated by the left-most dashed vertical line) had an AUC of 0.72 for the first model and minimum misclassification error of 0.179. The right-most dashed line shows the model with the highest (\lambda) that is within one standard error of the optimum model (typically this model is called the '1-se' model). As this model is evaluated at a larger (\lambda) typically this model has few coefficients and is simpler. The optimal AUC model had 54 variables and the 1-se model had 26. The optimal misclassification error model had 24 variables and the 1-se model had 8. Lets examine the predictions made by the 1-se models on the test data.
predAUC <- predict(model_glm_auc,
newx = as.matrix(df_test[ ,predictorsNames]),
s = "lambda.1se",
type = "class")
predMCR <- predict(model_glm_MCR,
newx = as.matrix(df_test[ ,predictorsNames]),
s = "lambda.1se",
type = "class")
df_test$predAUC <- predAUC
df_test$predMCR <- predMCR
confusionMatrix(df_test$predAUC,
df_test$Class_bin,
positive = "X1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 52 28
## X1 1298 6053
##
## Accuracy : 0.8216
## 95% CI : (0.8127, 0.8302)
## No Information Rate : 0.8183
## P-Value [Acc > NIR] : 0.2403
##
## Kappa : 0.0535
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99540
## Specificity : 0.03852
## Pos Pred Value : 0.82343
## Neg Pred Value : 0.65000
## Prevalence : 0.81833
## Detection Rate : 0.81456
## Detection Prevalence : 0.98923
## Balanced Accuracy : 0.51696
##
## 'Positive' Class : X1
##
confusionMatrix(df_test$predMCR,
df_test$Class_bin,
positive = "X1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 17 4
## X1 1333 6077
##
## Accuracy : 0.8201
## 95% CI : (0.8112, 0.8288)
## No Information Rate : 0.8183
## P-Value [Acc > NIR] : 0.3545
##
## Kappa : 0.0193
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99934
## Specificity : 0.01259
## Pos Pred Value : 0.82011
## Neg Pred Value : 0.80952
## Prevalence : 0.81833
## Detection Rate : 0.81779
## Detection Prevalence : 0.99717
## Balanced Accuracy : 0.50597
##
## 'Positive' Class : X1
##
Examining the confusion matrices it is seen that both models perform similarly to each other however overall the performance is not spectacular only providing small boosts to the accuracy over the No Information Rate. At this stage usually a a decision would be made with the stakeholder to establish which business objective this model is hoping to achieve. Does the model want to auto-reject cases where it is certain the credit will not be awarded or auto-accept those where it will?
The (non-zero) coefficients for the AUC model and the ROC curve are shown below.
coef <- round(as.matrix(coef(model_glm_auc, s = model_glm_auc$lambda.1se))[
as.matrix(coef(model_glm_auc, s = model_glm_auc$lambda.1se)) !=0, ], 4)
coef <- data.frame(names(coef), coef) %>% arrange(desc(coef))
coef
## names.coef. coef
## 1 Loan.Term.36 Months 0.9716
## 2 Address.State.WY 0.2491
## 3 Loan.Purpose.Credit Card 0.2228
## 4 Address.State.LA 0.0794
## 5 Loan.Purpose.Car 0.0718
## 6 FICO.Credit.Score 0.0073
## 7 Total.Number.Of.Credit.Lines 0.0026
## 8 Loan.Application.Description 0.0001
## 9 Loan.Amount 0.0000
## 10 Annual.Income 0.0000
## 11 Debt.To.Income.Ratio -0.0048
## 12 Use.Of.Credit.Line -0.0067
## 13 Address.State.NJ -0.0306
## 14 Address.State.NV -0.1168
## 15 No..Adverse.Public.Records -0.1237
## 16 No..Inquiries.In.Last.6.Months -0.1246
## 17 Loan.Purpose.Other -0.1593
## 18 Address.State.CA -0.1767
## 19 Address.State.FL -0.2958
## 20 Loan.Purpose.Small Business -0.9360
## 21 (Intercept) -4.3061
predAUCres <- predict(model_glm_auc,
newx = as.matrix(df_test[ ,predictorsNames]),
s = "lambda.1se",
type = "response")
predObj <- prediction(predAUCres, df_test$Class_bin)
ROC <- performance(predObj, measure = "tpr", x.measure = "fpr")
plot(ROC);lines(seq(0, 1, 0.1), seq(0, 1, 0.1))
From the coefficients it is seen that the most important factors in a credit success are; 36 month loan term, living in Wyoming and having the loan purpose 'Credit Car'. On the other hand the most important factors in credit failure were; having the loan purpose 'Small Business', and living in Florida or California.
The performance of the model was not spectacular. The chief limitation are the data themselves. Currently there are no particularly strong predictor variables. In order to improve this model feature engineering could be performed to enrich the feature space of the data. It seems that geographic location, in terms of the State lived in, had a few significant correlations with the target outcome. Potentially the data could be joined to census data to explore if this is actually a manifestation of an underlying combination of socioeconomic predictors. Loan purpose also seemed to show promise, perhaps the current categories could be projected onto a new set which better describes the underlying reasoning for accepting/rejecting a credit application.
It would also be worth training other models such as a Random Forest or Neural Net to see if model performance could not be increased. This would have the downside of losing the interpretability of the model coefficients.