-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathmodel_selection.Rmd
227 lines (134 loc) · 8.82 KB
/
model_selection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
title: "Model Selection"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Automatic Model Selection
When we have limited number of variables in the data set, it's relatively easy to construct a model and compare the performance between them. There also exist some techniques to deal with large amount of variables and determine which to include in the model.
This tutorial introduces three ways of automatic model selection, which are useful when you want to decide what variables to include in your model. We will talk about backward, forward, and stepwise selection in `stats` package and demonstrate the R code respectively.
```{r load-package, warning = FALSE, message= FALSE, error = FALSE}
library(tidyverse)
library(here)
library(GGally)
library(leaps)
```
### Regression model
We will use a NYC restaurant dataset in Sheather (2009) that has information about 150 Italian restaurants in Manhattan that were open in 2001 (some of them are closed now). The variables are:
- Case: case-indexing variable
- Restaurant: name of the restaurant
- Price: average price of meal and a drink
- Food: average Zagat rating of the quality of the food (from 0 to 25)
- Decor: same as above, but with quality of the decor
- Service: same as above, but with quality of service
- East: it is equal to East if the restaurant is on the East Side (i.e. east of Fifth Ave) and West otherwise
First, let's load the data:
```{r load-data}
nyc <- read_csv('https://raw.githubusercontent.com/YuxiaoLuo/r_analysis_dri_2022/main/data/nyc.csv')
```
Let's explore the data by looking at the variables' type and samples.
```{r explore_variables}
glimpse(nyc)
```
We can also explore the summary and scatter plot matrices of hte variables.
```{r explore_summary}
summary(nyc)
```
Note that scatterplot matrices can only show numerical variables (continuous & discrete). So we remove **Case**, **Restaurant**, and **East**.
```{r explore_ggpairs}
nycplot <- nyc %>% select(-Case, -Restaurant, -East)
ggpairs(nycplot)
```
Now, let's construct a regression model to predict **Price** using all the other variables as predictors.
```{r regression}
lm_nyc <- lm(Price ~ Food + Decor + Service + East, data = nyc)
summary(lm_nyc)
```
### Backward selection
Backward selection uses the `step` function in `stats` package and select a formula-based model by [AIC (Akaike Information Criterion)](https://en.wikipedia.org/wiki/Akaike_information_criterion). The rule of thumb is the less the AIC score, the better the model performs.
Let's try backward selection, which starts with the full model and removes the variable step by step while examining the change of AIC score as the performance measure.
```{r backward}
step(lm_nyc, direction = 'backward')
```
From the summary result, the model that uses **Food**, **Decor**, and **East** as the independent variables to predict the **Price** is more preferred.
### Forward selection
We can also do forward selection, which needs a starting model and a larger model that includes all the variables that we might want to include in our model.
```{r forward}
# create a null model
nullnyc <- lm(Price ~ 1, data = nyc) # no variables
fwd <- step(nullnyc,
scope = list(upper = lm_nyc),
direction = 'forward')
```
From the result, you can see the best model is `Price ~ Decor + Food + East`, which has an AIC score of 524.64.
### Stepwise regression
We can also do stepwise regression with `direction = 'both'`.In stepwise regression, variables can get in or out of the model. We can specify the smallest and biggest model in our search with `scope` argument in the `step` function.
For instance, let's start the stepwise search with a model that has **Service** as a predictor and restrict our search to models that include **Service** and all the other possible independent variables:
```{r stepwise}
lm2_nyc <- lm(Price ~ Service, data = nyc)
step(lm2_nyc,
scope= list(lower = lm2_nyc, upper = lm_nyc),
direction = 'both')
```
### BIC criterion
Let's change selection criterion to BIC using the command `k = log(n)`, where n is the number of observations in the dataset. According to the documentation, `k` means the multiple of the number of degrees of freedom used for the penalty. Only k = 2 gives the genuine AIC: `k = log(n)` is sometimes referred to as BIC or SBC.
```{r BIC}
n <- nrow(nyc)
step(lm_nyc, direction = 'backward', k = log(n))
```
### All subsets selection using `leaps`
If you want to find the 'best' model among a set of variables based on BIC or adjusted $R^2$, you can use library `leaps`.
```{r leaps}
#library(leaps)
allsubs <- regsubsets(Price ~ Food + Decor + Service + East, data = nyc)
summary(allsubs)
```
From the table, we know that if the model only contains one predictor, the one with `Decor` will has the smallest RSS (residual sum of squares). If the model only contains two predictors, the one with `Food` and `Decor` will have the smallest RSS. If the model only contains three predictors, the one with `Food`, `Decor`, and `East` will have the smallest RSS. In each group of models that have a fixed number of predictors, the best model is selected based on AIC, BIC, and RSS, which will coincide with the sample goal, that is, smallest residual sum of squares. ([AIC](https://www.statisticshowto.com/akaikes-information-criterion/), [Adjusted R^2](https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/adjusted-r2/))
The overall "best" model is the one selected from the 4 "best" model for a fixed number of independent variables. We can use either AIC or BIC to select the "best" model because these two measures penalize model sizes (# of variables) differently. Specifically, the penalty for AIC is smaller and it prefers model with more variables.
Let's visualize the BIC scores of the best models in each group with a fixed number of predictors. The model at the top is the best model with the lowest BIC score among all the best models in different groups. From the plot below, we know the best model is the one with two predictors, **Food** and **Decor**.
```{r bestmodel}
plot(allsubs)
```
We can also visualize the adjusted $R^2$ for different best models. From the plot below, we know the "best" model with highest adjusted $R^2$ is the one with three predictors, **Food**, **Decor**, and **East**.
```{r r-squared}
plot(allsubs, scale = 'adjr')
```
### Prediction
Now we can try to use the model to predict the **Price** in a testing dataset, which has some data of Italian restaurants not included in the nyc restaurant dataset.
```{r read_testdata}
nyctest <- read_csv('https://raw.githubusercontent.com/YuxiaoLuo/r_analysis_dri_2022/main/data/nyctest.csv', col_select = 3:8)
glimpse(nyctest)
```
Let's first create a model using the `nyc` restaurant data (training dataset). Then, we can find point predictions and 99% prediction intervals using the model.
```{r model_nyctest}
lm_nyc <- lm(Price ~ Food + Decor + East, data = nyc)
preds <- predict(lm_nyc, newdata = nyctest, interval = 'prediction',
level = 0.99)
preds
```
The output of the `predict` function is a matrix, we can transfer it to the data structure in `tidyverse` with `as_tibble` and add the actual price column to it.
```{r model_prediciton}
preds <- as_tibble(preds) %>% mutate(actualPrice = nyctest$Price)
preds
```
Then, we can plot both the predictions and actual data. The result shows a positive relation between the predicted value and actual price.
```{r plot_prediction, message=FALSE}
ggplot(preds, mapping = aes(x = fit, y = actualPrice)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE)
```
### Interaction terms
An interaction occurs when an independent variable has a different
effect on the outcome depending on the values of another independent variable. More details can be found [here](https://statisticsbyjim.com/regression/interaction-effects/).
If you want to fit interaction terms in R, you can try the following code. For example, I formed an interaction term of `Decor` and `Service` in the model. The interaction term posits the interaction between `Decor` and `Service` have an effect on `Price` that is distinctive from `Decor` and `Service` separately.
```{r interaction}
lm_nyc_int <- lm(Price ~ Food + Decor + Service + East + Decor*Service,
data = nyc)
summary(lm_nyc_int)
```
From the result, we see the interaction term `Decor:Service` (p = 0.4549) is not significant. We can reach the conclusion that there is no interaction effect of `Decor` and `Service` on `Price`.
## Reference
- Many of content are adapted from the Workshops by [Victor Pena](https://vicpena.github.io/).
- Sheather, Simon. *A modern approach to regression with R*. Springer Science & Business Media, 2009.
- Multiple and logistic regression, Datacamp course by [Ben Baumer](https://beanumber.github.io/www/).