-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathGAM.Rmd
263 lines (195 loc) · 10.6 KB
/
GAM.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---
title: "Classification Using GAM (Spline Based Model)"
subtitle: "Assignment"
author:
- "Name : Saheli Datta"
- "Roll No : MD2213"
date: "2023-11-18"
toc: true
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
\newpage
# Introduction
Generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.
The model relates a univariate response variable, $Y$, to some predictor variables, $x_i$. An exponential family distribution is specified for $Y$ (for example normal, binomial or Poisson distributions) along with a link function $g$ (for example the identity or log functions) relating the expected value of $Y$ to the predictor variables via a structure such as:
$$g[E(Y)] = \alpha+f_{1}(X_{1})+.......+f_{p}(X_{p})$$
The functions $f_i$ may be functions with a specified parametric form or may be specified non-parametrically, or semi-parametrically, simply as 'smooth functions', to be estimated by non-parametric means.
# Objective
Here, we will demonstrate classification using GAM. Below is the mathematical formulation of logistic regression GAM in binary classification problem:
$$\log(\frac{p(X)}{1-p(X)}) = \alpha+f_{1}(X_{1})+.......+f_{p}(X_{p})$$
Here, $f_j$ is the non-linear function (smoothing spline) of predictor $X_j$.
To perform the classification task we use **Pima Indian Diabetes Dataset**. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
# Data Description
**Pima Indian Diabetes Dataset** is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Here, all patients here are females at least 21 years old of Pima Indian heritage.
The variables are described below:
- **Outcome :** Binary variable taking 0 if the paties does not have diabetes and 1 otherwise.
- **Pregnancies :** Number of times pregnant.
- **Glucose :** Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- **BloodPressure :** Diastolic blood pressure (mm Hg).
- **SkinThickness :** Triceps skin fold thickness (mm).
- **Insulin :** 2-Hour serum insulin (mu U/ml).
- **BMI :** Body mass index (weight in kg/(height in m)^2).
- **DiabetesPedigree :** Diabetes pedigree function.
- **Age :** Age (years).
The dataset has total 768 observations.
\newpage
# Analysis of the Dataset
Here, we will use some visualization tools to see what our data actually looks like and what are the key relations between the features.
Let us first visualiza the data:
```{r, message=FALSE, warning=FALSE}
#Required libraries
library(ggplot2)
library(dplyr)
# load data
rm(list=ls())
data=read.csv('diabetes.csv')
# Split data into train and test
set.seed(123)
head(data,5)
data$Outcome = as.factor(data$Outcome)
```
Count plot of the response variable:
```{r, message=FALSE, warning=FALSE}
data %>% count(Outcome) %>%
ggplot(aes(x = Outcome, y = n/sum(n))) +
geom_bar(stat = 'identity', width = 0.3,
colour = 'black', fill = 'yellow') +
labs(title = 'Frequency distribution of the Target variable: Outcome',
x = 'Diabetes status', y = 'Density')
```
```{r, message=FALSE,warning=FALSE}
data %>% count(Outcome)
```
So, there are 268 patients who have diabetes and rest 500 patients do not have diabetes.
Now we have draw boxplot of each covariate w.r.t. each group.
```{r,include=TRUE}
par(mfrow=c(2,2))
for(i in 1:4)
{
boxplot(data[,i]~data$Outcome,xlab="Diabetes Status",ylab=colnames(data)[i],pch=19,col="skyblue")
}
#mtext("Boxplot of default w.r.t. balance and income", side = 3, line = - 2, outer = TRUE)
par(mfrow=c(1,1))
```
```{r,include=TRUE, echo=FALSE}
par(mfrow=c(2,2))
for(i in 5:8)
{
boxplot(data[,i]~data$Outcome,xlab="Diabetes Status",ylab=colnames(data)[i],pch=19,col="skyblue")
}
#mtext("Boxplot of default w.r.t. balance and income", side = 3, line = - 2, outer = TRUE)
par(mfrow=c(1,1))
```
Observe that, patients having diabetes have more glucose level, more BMI and are usually more aged than the ones who do not have diabetes.
\newpage
# Classification using GAM (spline based functions)
Let us first load the necessary packages for model fiting, training and model evaluation:
```{r, message=FALSE,warning=FALSE}
library(mgcv) # library for GAM
library(ggplot2) # for beautiful plots
library(cdata) # data wrangling
library(sigr) # AUC calculation
```
## Train Test Split:
Now, Split the data into both training (80%) and test data (20%):
```{r, message=FALSE,warning=FALSE}
randn=runif(nrow(data))
train_idx=randn<=0.8
train=data[train_idx,]
test=data[!train_idx,]
```
## Model Fitting:
$gam()$ function in $‘mgcv’$ package is used to train the GAM model. Since the relationship between each predictor and logit is not known beforehand, we use $s()$ to denote basis function for all the attributes.
```{r, message=FALSE, warning=FALSE}
form_gam=as.formula(Outcome~s(Pregnancies)+s(Glucose)+s(BloodPressure)+
s(SkinThickness)+s(Insulin)+s(BMI)+s(DiabetesPedigreeFunction)+
s(Age))
gam_model=gam(form_gam,data=train,family = 'binomial')
gam_model$converged # check if the algorithm converges
summary(gam_model)
```
**Observations :**
- The model converges. That means the parameters are estimated well.
- The smoothness of function $f_j$ for variable $X_j$ can be summarized via edf in the above output. An edf close to one indicates that the variable has approximate linear relationship with the output. Predictors like pregnancies, blood pressure, skin thickness and insulin have edf of exactly one or close to one.
- The “deviance explained” is the proportion of variance in logit output explained by the model — in this case, 36.6% of variance can be explained by the model.
- From the training data, it seems that GLucose, Blood Pressure, BMI, Diabetes Pedigree and Age are key factors in determining diabetes. High value of any of these indicates that the patient might has diabetes.
## Visualization of the Basis Functions:
The following code snippets gives another way of visualizing the basis function mapping on each predictor.
```{r, message=FALSE, warning=FALSE}
# Visualize s() output
terms=predict(gam_model,type='terms')
terms=cbind(Outcome=train$Outcome,terms)
# convert terms into data frame
frame1=as.data.frame(terms)
# remove brackets
colnames(frame1)=gsub('[()]','',colnames(frame1))
vars=setdiff(colnames(train),'Outcome')
# remove first character ‘s’
colnames(frame1)[-1]=sub(".","",colnames(frame1)[-1])
# Convert the wide to long form
library(cdata)
frame1_long=unpivot_to_blocks(frame1,nameForNewKeyColumn = "basis_function",
nameForNewValueColumn = "basis_value",
columnsToTakeFrom = vars)
# get the predictor values from training data
data_long=unpivot_to_blocks(train,nameForNewKeyColumn = "predictor",
nameForNewValueColumn = "value",
columnsToTakeFrom = vars)
frame1_long=cbind(frame1_long,value=data_long$value)
ggplot(data=frame1_long,aes(x=value,y=basis_value)) +
geom_smooth() +
facet_wrap(~basis_function,ncol = 3,scales = "free")
```
This figure gives the interpretation about how the basis predictors are related in determining the log odds ratio of te response. The top left-hand panel of the figure shows that the log odds of getting diagnosed as diabetes increases from age 21 to around 45 and plateau, and finally declines after age 50. This interpretation is true, provided that all the predictors remain unchanged. Togetherwith the model summary this interpretation of covariates should also be taken into consideration in the interpretation of GAM model.
\newpage
# Model Performance
using test set and train set separately, we evaluate the model performence. Here, we use accuracy, precision, recall and AUC as the model metrics.
```{r, message=FALSE, warning=FALSE}
# function to calculate the performance of model in terms of #accuracy, precision, recall
performance=function(y,pred){
confmat_test=table(truth=y,predict=pred>0.5)
acc=sum(diag(confmat_test))/sum(confmat_test)
precision=confmat_test[2,2]/sum(confmat_test[,2])
recall=confmat_test[2,2]/sum(confmat_test[2,])
c(acc,precision,recall)
}
train$pred=predict(gam_model,newdata = train,type = "response")
test$pred=predict(gam_model,newdata=test,type="response")
perf_train=performance(train$Outcome,train$pred)
perf_test=performance(test$Outcome,test$pred)
perf_mat=rbind(perf_train,perf_test)
colnames(perf_mat)=c("accuracy","precision","recall")
round(perf_mat,3)
```
\newpage
# Selection of Important Covariate:
Now, we will show the procedure of selecting important covariates.
From the model summary above we have already seen that Glucose, Blood Pressure, BMI, Age and Diabetes Pedigree are important features. Now we will use Partial Dependence Plot (PDP) to visualise the effect of the covariates on the response.
Here, we generate PDPs for the spline terms of the covariates.
```{r,include=TRUE}
par(mfrow=c(3,3))
plot.gam(gam_model)
par(mfrow=c(1,1))
```
Again, We see that \textbf{BMI, Glucose, Age, Bloodsugar, Diabetes Pedigree} are significant covariates.
## Model Evaluation with Selected Features:
After identifying important features, we retrain the model with only those features and evaluate its performance.
```{r,include=TRUE}
form_gam_new=as.formula(Outcome~s(Glucose)+s(BloodPressure)+
s(BMI)+s(DiabetesPedigreeFunction)+
s(Age))
gam_model_new=gam(form_gam_new,data=train,family = 'binomial')
train$pred_new=predict(gam_model_new,newdata = train,type = "response")
test$pred_new=predict(gam_model_new,newdata=test,type="response")
# model performance evaluated using training data
perf_train_new=performance(train$Outcome,train$pred_new)
perf_test_new=performance(test$Outcome,test$pred_new)
perf_mat_new=rbind(perf_train_new,perf_test_new)
colnames(perf_mat_new)=c("accuracy","precision","recall")
round(perf_mat_new,3)
```
## Conclusion:
We can see that for both the models the accuracy, precision and recall, all of them have increased for both the test set and training set. Now, the test accuracy is turned out to be approximately 75\% which is pretty good. Hence, we can conclude that inclusion of the feature \textbf{BMI, Glucose, Age, Bloodsugar, Diabetes Pedigree} serves great in diabetes classification and the other features are of less importance and can be ignored in this case.