GAM.Rmd

---
title: "Classification Using GAM (Spline Based Model)"
subtitle: "Assignment"
author: 
      - "Name : Saheli Datta"
      
      - "Roll No : MD2213"
date: "2023-11-18"
toc: true
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

\newpage

# Introduction

Generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

The model relates a univariate response variable, $Y$, to some predictor variables, $x_i$. An exponential family distribution is specified for $Y$ (for example normal, binomial or Poisson distributions) along with a link function $g$ (for example the identity or log functions) relating the expected value of $Y$ to the predictor variables via a structure such as:

$$g[E(Y)] = \alpha+f_{1}(X_{1})+.......+f_{p}(X_{p})$$

The functions $f_i$ may be functions with a specified parametric form or may be specified non-parametrically, or semi-parametrically, simply as 'smooth functions', to be estimated by non-parametric means.

# Objective

Here, we will demonstrate classification using GAM. Below is the mathematical formulation of logistic regression GAM in binary classification problem:

$$\log(\frac{p(X)}{1-p(X)}) = \alpha+f_{1}(X_{1})+.......+f_{p}(X_{p})$$

Here, $f_j$ is the non-linear function (smoothing spline) of predictor $X_j$. 

To perform the classification task we use **Pima Indian Diabetes Dataset**. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

# Data Description 

**Pima Indian Diabetes Dataset** is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.  Here, all patients here are females at least 21 years old of Pima Indian heritage.

The variables are described below:

- **Outcome :** Binary variable taking 0 if the paties does not have diabetes and 1 otherwise.

- **Pregnancies :** Number of times pregnant.

- **Glucose :** Plasma glucose concentration a 2 hours in an oral glucose tolerance test.

- **BloodPressure :** Diastolic blood pressure (mm Hg).

- **SkinThickness :** Triceps skin fold thickness (mm).

- **Insulin :** 2-Hour serum insulin (mu U/ml).

- **BMI :** Body mass index (weight in kg/(height in m)^2).

- **DiabetesPedigree :** Diabetes pedigree function.

- **Age :** Age (years).

The dataset has total 768 observations.

\newpage
# Analysis of the Dataset

Here, we will use some visualization tools to see what our data actually looks like and what are the key relations between the features.

Let us first visualiza the data:
```{r, message=FALSE, warning=FALSE}
#Required libraries
library(ggplot2)
library(dplyr)
# load data
rm(list=ls())
data=read.csv('diabetes.csv')
# Split data into train and test
set.seed(123)
head(data,5)
data$Outcome = as.factor(data$Outcome)
```

Count plot of the response variable:
```{r, message=FALSE, warning=FALSE}
data %>% count(Outcome) %>% 
  ggplot(aes(x = Outcome, y = n/sum(n))) +
  geom_bar(stat = 'identity', width = 0.3,
           colour = 'black', fill = 'yellow') +
  labs(title = 'Frequency distribution of the Target variable: Outcome',
       x = 'Diabetes status', y = 'Density')
```

```{r, message=FALSE,warning=FALSE}
data %>% count(Outcome)
```

So, there are 268 patients who have diabetes and rest 500 patients do not have diabetes.

Now we have draw boxplot of each covariate w.r.t. each group.

```{r,include=TRUE}
par(mfrow=c(2,2))
for(i in 1:4)
{
  boxplot(data[,i]~data$Outcome,xlab="Diabetes Status",ylab=colnames(data)[i],pch=19,col="skyblue")
}
#mtext("Boxplot of default w.r.t. balance and income", side = 3, line = - 2, outer = TRUE)
par(mfrow=c(1,1))
```

```{r,include=TRUE, echo=FALSE}
par(mfrow=c(2,2))
for(i in 5:8)
{
  boxplot(data[,i]~data$Outcome,xlab="Diabetes Status",ylab=colnames(data)[i],pch=19,col="skyblue")
}
#mtext("Boxplot of default w.r.t. balance and income", side = 3, line = - 2, outer = TRUE)
par(mfrow=c(1,1))
```

Observe that, patients having diabetes have more glucose level, more BMI and are usually more aged than the ones who do not have diabetes.

\newpage
# Classification using GAM (spline based functions)

Let us first load the necessary packages for model fiting, training and model evaluation:

```{r, message=FALSE,warning=FALSE}
library(mgcv) # library for GAM
library(ggplot2) # for beautiful plots
library(cdata) # data wrangling
library(sigr) # AUC calculation
```

## Train Test Split:
Now, Split the data into both training (80%) and test data (20%):

```{r, message=FALSE,warning=FALSE}
randn=runif(nrow(data))
train_idx=randn<=0.8
train=data[train_idx,]
test=data[!train_idx,]
```

## Model Fitting:
$gam()$ function in $‘mgcv’$ package is used to train the GAM model. Since the relationship between each predictor and logit is not known beforehand, we use $s()$ to denote basis function for all the attributes.

```{r, message=FALSE, warning=FALSE}
form_gam=as.formula(Outcome~s(Pregnancies)+s(Glucose)+s(BloodPressure)+
 s(SkinThickness)+s(Insulin)+s(BMI)+s(DiabetesPedigreeFunction)+
 s(Age))
gam_model=gam(form_gam,data=train,family = 'binomial')
gam_model$converged # check if the algorithm converges
summary(gam_model)

```

**Observations :**

- The model converges. That means the parameters are estimated well.

- The smoothness of function $f_j$ for variable $X_j$ can be summarized via edf in the above output. An edf close to one indicates that the variable has approximate linear relationship with the output. Predictors like pregnancies, blood pressure, skin thickness and insulin have edf of exactly one or close to one.

- The “deviance explained” is the proportion of variance in logit output explained by the model — in this case, 36.6% of variance can be explained by the model.

- From the training data, it seems that GLucose, Blood Pressure, BMI, Diabetes Pedigree and Age are key factors in determining diabetes. High value of any of these indicates that the patient might has diabetes.

## Visualization of the Basis Functions:
The following code snippets gives another way of visualizing the basis function mapping on each predictor.

```{r, message=FALSE, warning=FALSE}
# Visualize s() output 
terms=predict(gam_model,type='terms')
terms=cbind(Outcome=train$Outcome,terms)
# convert terms into data frame
frame1=as.data.frame(terms)
# remove brackets
colnames(frame1)=gsub('[()]','',colnames(frame1))
vars=setdiff(colnames(train),'Outcome')
# remove first character ‘s’
colnames(frame1)[-1]=sub(".","",colnames(frame1)[-1])
# Convert the wide to long form
library(cdata)
frame1_long=unpivot_to_blocks(frame1,nameForNewKeyColumn = "basis_function",
 nameForNewValueColumn = "basis_value",
 columnsToTakeFrom = vars)
# get the predictor values from training data
data_long=unpivot_to_blocks(train,nameForNewKeyColumn = "predictor",
 nameForNewValueColumn = "value",
 columnsToTakeFrom = vars)
frame1_long=cbind(frame1_long,value=data_long$value)
ggplot(data=frame1_long,aes(x=value,y=basis_value)) +
 geom_smooth() +
 facet_wrap(~basis_function,ncol = 3,scales = "free")
```

This figure gives the interpretation about how the basis predictors are related in determining the log odds ratio of te response. The top left-hand panel of the figure shows that the log odds of getting diagnosed as diabetes increases from age 21 to around 45 and plateau, and finally declines after age 50. This interpretation is true, provided that all the predictors remain unchanged. Togetherwith the model summary this interpretation of covariates  should also be taken into consideration in the interpretation of GAM model.
\newpage

# Model Performance

using test set and train set separately, we evaluate the model performence. Here, we use accuracy, precision, recall and AUC as the model metrics.

```{r, message=FALSE, warning=FALSE}
# function to calculate the performance of model in terms of #accuracy, precision, recall 
performance=function(y,pred){
  confmat_test=table(truth=y,predict=pred>0.5)
  acc=sum(diag(confmat_test))/sum(confmat_test)
  precision=confmat_test[2,2]/sum(confmat_test[,2])
  recall=confmat_test[2,2]/sum(confmat_test[2,])
  c(acc,precision,recall)
}

train$pred=predict(gam_model,newdata = train,type = "response")
test$pred=predict(gam_model,newdata=test,type="response")

perf_train=performance(train$Outcome,train$pred)
perf_test=performance(test$Outcome,test$pred)
perf_mat=rbind(perf_train,perf_test)
colnames(perf_mat)=c("accuracy","precision","recall")
round(perf_mat,3)
```

\newpage
# Selection of Important Covariate:

Now, we will show the procedure of selecting important covariates.

From the model summary above we have already seen that Glucose, Blood Pressure, BMI, Age and Diabetes Pedigree are important features. Now we will use Partial Dependence Plot (PDP) to visualise the effect of the covariates on the response.

Here, we generate PDPs for the spline terms of the covariates.

```{r,include=TRUE}
par(mfrow=c(3,3))
plot.gam(gam_model)
par(mfrow=c(1,1))
```

Again, We see that \textbf{BMI, Glucose, Age, Bloodsugar, Diabetes Pedigree} are significant covariates.


## Model Evaluation with Selected Features:
After identifying important features, we retrain the model with only those features and evaluate its performance.

```{r,include=TRUE}
form_gam_new=as.formula(Outcome~s(Glucose)+s(BloodPressure)+
 s(BMI)+s(DiabetesPedigreeFunction)+
 s(Age))
gam_model_new=gam(form_gam_new,data=train,family = 'binomial')
train$pred_new=predict(gam_model_new,newdata = train,type = "response")
test$pred_new=predict(gam_model_new,newdata=test,type="response")
# model performance evaluated using training data
perf_train_new=performance(train$Outcome,train$pred_new)
perf_test_new=performance(test$Outcome,test$pred_new)
perf_mat_new=rbind(perf_train_new,perf_test_new)
colnames(perf_mat_new)=c("accuracy","precision","recall")
round(perf_mat_new,3)
```

## Conclusion:
We can see that for both the models the accuracy, precision and recall, all of them have increased for both the test set and training set. Now, the test accuracy is turned out to be approximately 75\% which is pretty good. Hence, we can conclude that inclusion of the feature \textbf{BMI, Glucose, Age, Bloodsugar, Diabetes Pedigree} serves great in diabetes classification and the other features are of less importance and can be ignored in this case.