This repository has been archived by the owner on Sep 17, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
courseProject.Rmd
353 lines (244 loc) · 11.3 KB
/
courseProject.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
---
title: "Coursera Paractical Machine Learning Course Project"
author: "Dmitry A. Grechka"
date: "May 21, 2016"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache = TRUE)
```
# Study design
1. Error rate definition
2. Splitting the data
3. Picking features
+ Clearing data
+ Transforming covariates
4. Picking prediction function
+ Setting cross-validation options for parameter tuning
+ Tuning different prdiction functions
5. Final model choice
# Study
## Initial preparation
```{r data_loading,message=FALSE }
library(caret)
set.seed(12321)
barbell_lifts <- read.csv('pml-training.csv')
```
## 1. Choosing error rate
Let's evaluate what kind of prediction we need
```{r}
str(barbell_lifts$classe)
```
We have the outcome variable `classe` to be a factor of 5.
This means that we need to do classification to assign observation to one of five classes. The appropriate error rate measure for this kind of classification is **accuracy** which accounts for false positives/negatives equally. Another possible measure is area under the ROC curve (AUC) but we will use accuracy.
## 2. Splitting the data
To properly choose the training, testing, validation data set we need to evaluate the amount of available data.
```{r}
nrow(barbell_lifts)
```
We have thousands of observations in pml-training.csv which is large enough.
We can follow the following typical splitting scheme for the datases:
* 70% training
* 30% testing
As a validation set, we will assume final results of 20 question quiz.
We will use 10-fold cross validation on training data set for prediction functions tuning. Testing dataset will be used for final out-of-sample error evaluation
``` {r splitting_data}
inTrain <- createDataPartition(y=barbell_lifts$classe, p=0.7,list=F)
training <- barbell_lifts[inTrain,]
testing <- barbell_lifts[-inTrain,]
```
## 3. Picking features
### 3.1. Clearing data
#### Removing unrelated variables
The dataset contains several variables that we do not want to use as predictors
* user_name - we do not want to account a particular person doing dumbbell lifting
* cvtd_timestamp - we do not want to account the date and time of the dumbbell lifting
* X - we do not want to account dataset row number
```{r echo=F}
possPredictorNames <- colnames(training)
factorPredictorsNamesToOmit<- c('user_name','cvtd_timestamp','X','classe')
filteredPredicotrsNames <- c()
for (posPredName in possPredictorNames){
if (posPredName %in% factorPredictorsNamesToOmit) {
next # we are interested only in unrelated variables
}
filteredPredicotrsNames <- c(filteredPredicotrsNames,posPredName)
}
possPredictorNames <- filteredPredicotrsNames
```
After this filtering there are **`r length(possPredictorNames)`** possible predictors left.
#### Converting factors with 'numeric' levels to real numeric
The data set contains several variables that are numeric, but treated as factors during import. Converting them to numeric
```{r echo=F, warning=F}
for (posPredName in possPredictorNames){
if(class(training[[posPredName]]) == 'factor') {
training[[posPredName]] <- as.numeric(as.character(training[[posPredName]])) #converting factor to cumeric via character
}
}
training <- training[,names(training) %in% c('classe',filteredPredicotrsNames)]
```
#### Identifing and elemination near zero variance predictors
```{r}
nzvars <- nearZeroVar(training)
training <- training[,-nzvars]
```
After this stage there are **`r ncol(training)-1`** predictors left
#### Eliminating variables with lots of NA
```{r echo=F}
naAllowedFraction <- .1
namesToCheck <- names(training)
eliminatedDueNACount <- 0
for(curName in namesToCheck) {
if(curName == 'classe')
next;
naFraction <- sum(is.na(training[[curName]]))/nrow(training)
if(naFraction>naAllowedFraction) {
#print(paste("Fraction of NA in variable ",curName,' is ',naFraction,' which is more than ',naAllowedFraction,'.'))
eliminatedDueNACount <- eliminatedDueNACount+1
training <- training[,names(training) != curName]
}
}
```
After excluding the varaibles wich have more than 10% of NA values (there were `r eliminatedDueNACount` such varaibles) there are **`r ncol(training)-1`** presictors left
### 3.2 Transforming covariates:
#### Handling correlated predictors
Looking at the correlaction matrix of all `r ncol(training)-1` predictors
```{r}
predictorsOnly <- training[,names(training) != 'classe']
varM <- cor(predictorsOnly)
varMcol<- colorRampPalette(c("red", "white", "blue"))(20)
heatmap(x = varM, col = varMcol, symm = TRUE)
```
We can see that there are several correlated varaible clusters.
We will reduce predictor space dimensions by appling PCA, with default desired described variance threshold of 95%.
```{r}
predictorsOnly <- training[,names(training) != 'classe']
pcaPreProc <- preProcess(predictorsOnly,method="pca")
pcaPredictedTraining <- predict(pcaPreProc,predictorsOnly)
pcaPredictedTraining$classe <- training$classe
```
After this stage we have **`r ncol(pcaPredictedTraining)-1`** linear independent principal component predictors.
## 4. Picking predictor function
### Prediction function parameter tuning with Cross Validation
We will use 10-fold cross validation (as our dataset is large enough).
This should keep both bias and varaince from being exream.
```{r}
trainOptions <- trainControl(
method = "cv",
number=10 #number of folds
)
```
### Tring different prediction function families, tuning their parameters
We will try to train a set of models of different families with the same cross-validation options for parameters tuning
```{r results='hide', warning=F}
rpartModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="rpart")
rfModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="rf")
gbmModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="gbm")
ldaModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="lda")
nbModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="nb")
svmRadialModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="svmRadial")
nnetModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="nnet")
trained <- list(
rpart=rpartModel,
rf=rfModel,
gbm=gbmModel,
lda=ldaModel,
nb=nbModel,
svm=svmRadialModel,
nnet=nnetModel
)
```
### Evaluating in-sample & out of sample errors
```{r echo=F, warning=F}
#preprocessing the testing data the same way as we did with traiing
for (posPredName in possPredictorNames){
if(class(testing[[posPredName]]) == 'factor') {
testing[[posPredName]] <- as.numeric(as.character(testing[[posPredName]])) #converting factor to cumeric via character
}
}
pcaPredictedTesting <- predict(pcaPreProc,testing)
#calculating accuacy
methods <- c()
inSampleErrors <- c()
outOfSampleErrors <- c()
for(i in 1:length(trained)) {
currentMethod <-trained[[i]]
methods <- c(methods,currentMethod$modelInfo$label)
inSamplePrediction <- predict(currentMethod,newdata=pcaPredictedTraining)
confM <- confusionMatrix(pcaPredictedTraining$classe,inSamplePrediction)
inSampleErrors <-c(inSampleErrors,confM$overall[["Accuracy"]])
outOfSamplePrediction <- predict(currentMethod,newdata=pcaPredictedTesting)
confM <- confusionMatrix(pcaPredictedTesting$classe,outOfSamplePrediction)
outOfSampleErrors <-c(outOfSampleErrors,confM$overall[["Accuracy"]])
}
errorRates <- data.frame(method = methods,in_sample_accuracy=inSampleErrors,out_of_sample_accuracy=outOfSampleErrors)
errorRates
```
We can see that Random forest are 100% accurate for traininig sample so it looks like overfitting.
But out-of-sample accuracy shows that this overfitting works better than other methods.
### Combining the models
We can check how the trained models are consistent in thier classification
```{r}
library(corrplot)
corrplot(modelCor(resamples(trained)),method='pie')
```
So the models are not very consistent. This is good as we can combine them to try to get even better prediction taking advantages of each of them.
#### Ensampling the models using simple voting (mode calculation)
Let's try 2 combined predictors: combining all the predictors and most accurate predictors.
Combination can be done by using simple voting via mode calculation (the most common predicted value accross several predictors will be chosen as final value).
```{r echo=F, warning=F}
calcmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
predictMode <- function(models,newdata,level_labels){
predictions <- matrix(ncol=length(models),nrow=nrow(newdata))
results <- c()
for(i in 1:length(models)) {
currentMethod <-models[[i]]
prediction <- predict(currentMethod,newdata=newdata)
predictions[,i] <- prediction
}
for(i in 1:nrow(newdata)) {
results<-c(results,calcmode(predictions[i,]))
}
return(factor(results,labels=level_labels))
}
inSampleCombinedPrediction <- predictMode(trained,pcaPredictedTraining,levels(pcaPredictedTraining$classe))
confM <- confusionMatrix(pcaPredictedTraining$classe,inSampleCombinedPrediction)
inSampleAcc <- confM$overall[["Accuracy"]]
outOfSampleCombinedPrediction <- predictMode(trained,pcaPredictedTesting,levels(pcaPredictedTesting$classe))
confM <- confusionMatrix(pcaPredictedTesting$classe,outOfSampleCombinedPrediction)
outOfSampleAcc <- confM$overall[["Accuracy"]]
combinedErrorRates <- data.frame(method = 'all combined',in_sample_accuracy=inSampleAcc,out_of_sample_accuracy=outOfSampleAcc)
#errorRates <- rbind(errorRates,combinedErrorRates)
trained.best <- trained[!(names(trained) %in% c('rpart','lda','"nb'))]
inSampleCombinedPrediction <- predictMode(trained.best,pcaPredictedTraining,levels(pcaPredictedTraining$classe))
confM <- confusionMatrix(pcaPredictedTraining$classe,inSampleCombinedPrediction)
inSampleAcc <- confM$overall[["Accuracy"]]
outOfSampleCombinedPrediction <- predictMode(trained.best,pcaPredictedTesting,levels(pcaPredictedTesting$classe))
confM <- confusionMatrix(pcaPredictedTesting$classe,outOfSampleCombinedPrediction)
outOfSampleAcc <- confM$overall[["Accuracy"]]
combinedErrorRates2 <- data.frame(method = 'rf,gbm,svm,nnet combined',in_sample_accuracy=inSampleAcc,out_of_sample_accuracy=outOfSampleAcc)
rbind(combinedErrorRates,combinedErrorRates2)
#errorRates <- rbind(errorRates,combinedErrorRates)
#errorRates
```
## 5. Final model choice
It it reasonable to choose the "rf,gbm,svm,nnet combined model" as it seems to be protected from overfitting by balancing.
This predictor is accurate enough in the same time with out of sample accuracy **`r outOfSampleAcc`**.
#### Final quiz prediciton.
```{r echo=F,results='hide', warning=F}
#assessing final quiz questions
barbell_lifts_quiz <- read.csv('pml-testing.csv')
for (posPredName in possPredictorNames){
if(class(barbell_lifts_quiz[[posPredName]]) == 'factor') {
barbell_lifts_quiz[[posPredName]] <- as.numeric(as.character(barbell_lifts_quiz[[posPredName]])) #converting factor to cumeric via character
}
}
pcaPredictedQuiz <- predict(pcaPreProc,barbell_lifts_quiz)
quizPrediction <- predictMode(trained.best,pcaPredictedQuiz,levels(pcaPredictedTesting$classe))
```
I've got **"17/20 points earned (85%)"** for the final quiz with the chosen model.
Accuracy rate of 85% is explainable as it is little bit less then out-of-sample accuracy expectation.