-
Notifications
You must be signed in to change notification settings - Fork 15
/
2-MultipleLinearRegression.R
172 lines (157 loc) · 7.54 KB
/
2-MultipleLinearRegression.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
#'
#' #### Example exercise: Trip production of 57 Traffic Assignment Zones of Chicago in 1960's.
#'
#' **Your task**: Estimate a linear regression model that predicts trips per occupied dwelling unit.
#'
#' #### Variables:
#'
#' * `TODU`: Motorized Trips (private car or Public Transportation) per occupied dwelling unit;
#' * `ACO`: Average car ownership (cars per dwelling);
#' * `AHS`: Average household size;
#' * `SRI`: Social Rank Index:
#' 1. proportion of blue-collar workers (e.g., construction, mining);
#' 2. proportion of people with age higher than 25 years that have completed at least 8 year of education;
#' (_**Note:** The SRI has its maximum value when there are no blue-collar workers and all adults have education of at least 8 years_)
#' * `UI`: Urbanization Index:
#' 1. fertility rate, defined as the ratio of children under 5 years of age to the female population of childbearing age;
#' 2. female labor force participation rate, meaning the % of women who are in the labor force;
#' 3. % of single family units to total dwelling units.
#'
#' The degree of urbanization index would be increased by
#' a) lower fertility rate,
#' b) higher female labor force participation rate, and
#' c) higher proportion of single dwelling units.
#' (_**Note:** High values for this index imply less attachment to the home_)
#'
#' * `SI`:Segregation Index
#' It measures the proportion of an area to which minority groups (e.g: non-whites, foreign-born, Eastern Europeans)
#' live in isolation.
#' (_**Note:** High values for this index imply that those communities are less prone to leaving their living areas and
#' as such to having lower levels of mobility_)
#'
#' ## Let's begin!
#'
#' ##### Import Libraries
library(readxl) #Library used to import excel files
library(tidyverse) # Pack of most used libraries
library(skimr) # Library used for providing a summary of the data
library(DataExplorer) # Library used in data science to perform exploratory data analysis
library(corrplot) # Library used for correlation plots
library(car) # Library used for testing autocorrelation (Durbin Watson)
library(olsrr) # Library used for testing multicollinearity (VIF, TOL, etc.)
#'
#' ##### Import dataset
dataset <- read_excel("Data/TDM_Class3_MLR_Chicago_Example.xls")
class(dataset)
#'
#' ##### Transform the dataset into a dataframe
df <- data.frame(dataset)
#'
#' ##### Show summary statistics
skim(df)
summary(df)
#'
#' ## Multiple Linear Regression
#' Equation with `TODU` as the dependent variable:
#'
#' #### Checking assumptions
#' Before running the model, you need to check if the assumptions are met.
#'
#' ##### Linear relation
#' For instance, let's take a look if the independent variables have linear relation with the dependent variable.
#'
par(mfrow=c(2,3)) #set plot area as 2 rows and 3 columns
plot(x = df$TODU, y = df$ACO, xlab = "TODU", ylab = "ACO")
plot(x = df$TODU, y = df$AHS, xlab = "TODU", ylab = "AHS")
plot(x = df$TODU, y = df$SI, xlab = "TODU", ylab = "SI")
plot(x = df$TODU, y = df$SRI, xlab = "TODU", ylab = "SRI")
plot(x = df$TODU, y = df$UI, xlab = "TODU", ylab = "UI")
#'
#' Or you could execute a pairwise scatterplot matrix, that compares every variable with each other:
#'
pairs(df[,1:6], pch = 19, lower.panel = NULL)
#'
#' > **Note:** SRI and TODU do not have a linear relationship. This should interfere on the model.
#'
#' ##### Normal distribution of the dependent variable
#' Check if the dependent variable is normally distributed.
#' If the sample is smaller than 2000 observations, use Shapiro-Wilk test:
#'
shapiro.test(df$TODU)
#'
#' If not, use the Kolmogorov-Smirnov test
#'
ks.test(df$TODU, "pnorm", mean=mean(df$TODU), sd = sd(df$TODU))
#'
#' > **Note:** Regarding the warning that appears in the Kolmogorov-Smirnov test
#' "ties should not be present for the Kolmogorov-Smirnov test", what most likely happened is that this test
#' is only reliable with continuous variables.
#'
#' Although `TODU` is a continuous variable, the small sample size (n=57) makes it likely to have repeated values.
#' Consequently, the test considers `TODU` as a categorical variable.
#' Therefore, this is another evidence, that for small samples it is more appropriate to use the Shapiro-Wilk Test.
#' The null hypothesis of both tests is that the distribution is normal.
#' Therefore, for the distribution to be normal, the pvalue > 0.05 and you should not reject the null hypothesis.
#'
#' ### Multiple linear regression model
#'
model <- lm(TODU ~ ACO + AHS + SI + SRI + UI, data = df)
summary(model)
#'
#' **Assessing the model**:
#'
#' 1. First check the **pvalue** and the **F statistics** of the model to see if there is any statistical relation
#' between the dependent variable and the independent variables.
#' If pvalue < 0.05 and the F statistics > Fcritical = 2,39, then the model is statistically acceptable.
#'
#' 2. The **R-square** and **Adjusted R-square** evaluate the amount of variance that is explained by the model.
#' The difference between one and another is that the R-square does not consider the number of variables.
#' If you increase the number of variables in the model, the R-square will tend to increase which can lead to overfitting.
#' On the other hand, the Adjusted R-square adjust to the number of independent variables.
#'
#' 3. Take a look a the **t-value** and the Pr(>|t|).
#' If the t-value > 1,96 or Pr(>|t|) < 0,05, then the IV is statistically significant to the model.
#'
#' 4. To analyze the **estimates** of the variables, you should first check the **signal**
#' and evaluate if the independent variable has a direct or inverse relationship with the dependent variable.
#' It is only possible to evaluate the **magnitude** of the estimate if all variables are continuous and standardized
#' or by calculating the elasticities. The elasticities are explained and demonstrated in chapter 4.
#'
#'
#' ##### Residuals
#' Let's see how do the residuals behave by plotting them.
#'
#' * **Residuals vs Fitted:** This plot is used to detect non-linearity, heteroscedasticity, and outliers.
#' * **Normal Q-Q:** The quantile-quantile (Q-Q) plot is used to check if the dependent variable follows a normal distribution.
#' * **Scale-Location:** This plot is used to verify if the residuals are spread equally (homoscedasticity) or not
#' (heteroscedasticity) through the sample.
#' * **Residuals vs Leverage:** This plot is used to detect the impact of the outliers in the model.
#' If the outliers are outside the Cook-distance, this may lead to serious problems in the model.
#'
#' Try analyzing the plots and check if the model meets the assumptions.
par(mfrow=c(2,2))
plot(model)
#'
#'
#' ##### Autocorrelation
#' Execute the Durbin-Watson test to evaluate autocorrelation of the residuals
durbinWatsonTest(model)
#'
#' > **Note:** In the Durbin-Watson test, values of the D-W Statistic vary from 0 to 4.
#' If the values are from 1.8 to 2.2 this means that there is no autocorrelation in the model.
#'
#' ##### Multicollinearity
#' Calculate the VIF and TOL to test for multicollinearity.
#'
ols_vif_tol(model)
#'
#' > **Note:** Values of VIF > 5, indicate multicollinearity problems.
#'
#' Calculate the Condition Index to test for multicollinearity
ols_eigen_cindex(model)
#'
#' > **Note:** Condition index values > 15 indicate multicollinearity problems,
#' and values > 30 indicate serious problems of multicollinearity.
#'
#' To test both simultaneously, you can run the code below:
ols_coll_diag(model)