courseProject.Rmd

---
title: "Coursera Paractical Machine Learning Course Project"
author: "Dmitry A. Grechka"
date: "May 21, 2016"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache = TRUE)
```

# Study design

1. Error rate definition
2. Splitting the data
3. Picking features
  + Clearing data
  + Transforming covariates
4. Picking prediction function
  + Setting cross-validation options for parameter tuning
  + Tuning different prdiction functions
5. Final model choice

# Study

## Initial preparation

```{r data_loading,message=FALSE }
library(caret)
set.seed(12321)
barbell_lifts  <- read.csv('pml-training.csv')
```

## 1. Choosing error rate

Let's evaluate what kind of prediction we need

```{r}
str(barbell_lifts$classe)
```

We have the outcome variable `classe` to be a factor of 5.
This means that we need to do classification to assign observation to one of five classes. The appropriate error rate measure for this kind of classification is **accuracy** which accounts for false positives/negatives equally. Another possible measure is area under the ROC curve (AUC) but we will use accuracy.

## 2. Splitting the data

To properly choose the training, testing, validation data set we need to evaluate the amount of available data.

```{r}
nrow(barbell_lifts)
```

We have thousands of observations in pml-training.csv which is large enough.
We can follow the following typical splitting scheme for the datases:

* 70% training
* 30% testing

As a validation set, we will assume final results of 20 question quiz.

We will use 10-fold  cross validation on training data set for prediction functions tuning. Testing dataset will be used for final out-of-sample error evaluation


``` {r splitting_data}
inTrain <- createDataPartition(y=barbell_lifts$classe, p=0.7,list=F)

training <- barbell_lifts[inTrain,]
testing <- barbell_lifts[-inTrain,]

```

## 3. Picking features

### 3.1. Clearing data

#### Removing unrelated variables

The dataset contains several variables that we do not want to use as predictors

* user_name - we do not want to account a particular person doing dumbbell lifting
* cvtd_timestamp - we do not want to account the date and time of the dumbbell lifting
* X - we do not want to account dataset row number

```{r echo=F}
possPredictorNames <- colnames(training)

factorPredictorsNamesToOmit<- c('user_name','cvtd_timestamp','X','classe')

filteredPredicotrsNames <- c()

for (posPredName in possPredictorNames){
  if (posPredName %in% factorPredictorsNamesToOmit) {
    next # we are interested only in unrelated variables
  }
  
  filteredPredicotrsNames <- c(filteredPredicotrsNames,posPredName)
}

possPredictorNames <- filteredPredicotrsNames
```

After this filtering there are **`r length(possPredictorNames)`** possible predictors left.

#### Converting factors with 'numeric' levels to real numeric
The data set contains several variables that are numeric, but treated as factors during import. Converting them to numeric

```{r echo=F, warning=F}

for (posPredName in possPredictorNames){
  if(class(training[[posPredName]]) == 'factor') {
    training[[posPredName]] <- as.numeric(as.character(training[[posPredName]])) #converting factor to cumeric via character
  }
}

training <- training[,names(training) %in% c('classe',filteredPredicotrsNames)]
```

#### Identifing and elemination near zero variance predictors

```{r}
nzvars <- nearZeroVar(training)
training <- training[,-nzvars]
```

After this stage there are **`r ncol(training)-1`** predictors left

#### Eliminating variables with lots of NA

```{r echo=F}
naAllowedFraction <- .1

namesToCheck <- names(training)

eliminatedDueNACount <- 0

for(curName in namesToCheck) {
  if(curName == 'classe')
    next;
  naFraction <- sum(is.na(training[[curName]]))/nrow(training)
  if(naFraction>naAllowedFraction) {
    #print(paste("Fraction of NA in variable ",curName,' is ',naFraction,' which is more than ',naAllowedFraction,'.'))
    eliminatedDueNACount <- eliminatedDueNACount+1
    training <- training[,names(training) != curName]
  }
}
```

After excluding the varaibles wich have more than 10% of NA values (there were `r eliminatedDueNACount` such varaibles) there are **`r ncol(training)-1`** presictors left

### 3.2 Transforming covariates:

#### Handling correlated predictors

Looking at the correlaction matrix of all `r ncol(training)-1` predictors

```{r}

predictorsOnly <- training[,names(training) != 'classe']
varM <- cor(predictorsOnly)
varMcol<- colorRampPalette(c("red", "white", "blue"))(20)
heatmap(x = varM, col = varMcol, symm = TRUE)

```

We can see that there are several correlated varaible clusters.

We will reduce predictor space dimensions by appling PCA, with default desired described variance threshold of 95%.

```{r}
predictorsOnly <- training[,names(training) != 'classe']

pcaPreProc <- preProcess(predictorsOnly,method="pca")
pcaPredictedTraining <- predict(pcaPreProc,predictorsOnly)
pcaPredictedTraining$classe <- training$classe
```

After this stage we have **`r ncol(pcaPredictedTraining)-1`** linear independent principal component predictors.

## 4. Picking predictor function

### Prediction function parameter tuning with Cross Validation

We will use 10-fold cross validation (as our dataset is large enough).
This should keep both bias and varaince from being exream.

```{r}
trainOptions <- trainControl(
   method = "cv",
   number=10 #number of folds
)
```

### Tring different prediction function families, tuning their parameters

We will try to train a set of models of different families with the same cross-validation options for parameters tuning

```{r results='hide', warning=F}

rpartModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="rpart")
rfModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="rf")
gbmModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="gbm")
ldaModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="lda")
nbModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="nb")
svmRadialModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="svmRadial")
nnetModel<-train(classe ~ .,data = pcaPredictedTraining,trControl = trainOptions,method="nnet")

trained <- list(
  rpart=rpartModel,
  rf=rfModel,
  gbm=gbmModel,
  lda=ldaModel,
  nb=nbModel,
  svm=svmRadialModel,
  nnet=nnetModel
)
```

### Evaluating in-sample & out of sample errors

```{r echo=F, warning=F}

#preprocessing the testing data the same way as we did with traiing

for (posPredName in possPredictorNames){
  if(class(testing[[posPredName]]) == 'factor') {
    testing[[posPredName]] <- as.numeric(as.character(testing[[posPredName]])) #converting factor to cumeric via character
  }
}

pcaPredictedTesting <- predict(pcaPreProc,testing)

#calculating accuacy

methods <- c()
inSampleErrors <- c()
outOfSampleErrors <- c()
for(i in 1:length(trained)) {
  currentMethod <-trained[[i]]
  methods <- c(methods,currentMethod$modelInfo$label)
  
  inSamplePrediction <- predict(currentMethod,newdata=pcaPredictedTraining)
  confM <- confusionMatrix(pcaPredictedTraining$classe,inSamplePrediction)
  inSampleErrors <-c(inSampleErrors,confM$overall[["Accuracy"]])
  outOfSamplePrediction <- predict(currentMethod,newdata=pcaPredictedTesting)
  confM <- confusionMatrix(pcaPredictedTesting$classe,outOfSamplePrediction)
  outOfSampleErrors <-c(outOfSampleErrors,confM$overall[["Accuracy"]])
}

errorRates <- data.frame(method = methods,in_sample_accuracy=inSampleErrors,out_of_sample_accuracy=outOfSampleErrors)

errorRates
```

We can see that Random forest are 100% accurate for traininig sample so it looks like overfitting.
But out-of-sample accuracy shows that this overfitting works better than other methods.

### Combining the models

We can check how the trained models are consistent in thier classification

```{r}
library(corrplot)
corrplot(modelCor(resamples(trained)),method='pie')
```

So the models are not very consistent. This is good as we can combine them to try to get even better prediction taking advantages of each of them.

#### Ensampling the models using simple voting (mode calculation)

Let's try 2 combined predictors: combining all the predictors and most accurate predictors.
Combination can be done by using simple voting via mode calculation (the most common predicted value accross several predictors will be chosen as final value).

```{r echo=F, warning=F}

calcmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

predictMode <- function(models,newdata,level_labels){
  predictions <- matrix(ncol=length(models),nrow=nrow(newdata))
  
  results <- c()
  
  for(i in 1:length(models)) {
    currentMethod <-models[[i]]
    prediction <- predict(currentMethod,newdata=newdata)
    predictions[,i] <- prediction
  }
  
  for(i in 1:nrow(newdata)) {
    results<-c(results,calcmode(predictions[i,]))
  }
  return(factor(results,labels=level_labels))
}

inSampleCombinedPrediction <- predictMode(trained,pcaPredictedTraining,levels(pcaPredictedTraining$classe))
confM <- confusionMatrix(pcaPredictedTraining$classe,inSampleCombinedPrediction)
inSampleAcc <- confM$overall[["Accuracy"]]
outOfSampleCombinedPrediction <- predictMode(trained,pcaPredictedTesting,levels(pcaPredictedTesting$classe))
confM <- confusionMatrix(pcaPredictedTesting$classe,outOfSampleCombinedPrediction)
outOfSampleAcc <- confM$overall[["Accuracy"]]

combinedErrorRates <- data.frame(method = 'all combined',in_sample_accuracy=inSampleAcc,out_of_sample_accuracy=outOfSampleAcc)

#errorRates <- rbind(errorRates,combinedErrorRates)

trained.best <- trained[!(names(trained) %in% c('rpart','lda','"nb'))]

inSampleCombinedPrediction <- predictMode(trained.best,pcaPredictedTraining,levels(pcaPredictedTraining$classe))
confM <- confusionMatrix(pcaPredictedTraining$classe,inSampleCombinedPrediction)
inSampleAcc <- confM$overall[["Accuracy"]]
outOfSampleCombinedPrediction <- predictMode(trained.best,pcaPredictedTesting,levels(pcaPredictedTesting$classe))
confM <- confusionMatrix(pcaPredictedTesting$classe,outOfSampleCombinedPrediction)
outOfSampleAcc <- confM$overall[["Accuracy"]]

combinedErrorRates2 <- data.frame(method = 'rf,gbm,svm,nnet combined',in_sample_accuracy=inSampleAcc,out_of_sample_accuracy=outOfSampleAcc)

rbind(combinedErrorRates,combinedErrorRates2)

#errorRates <- rbind(errorRates,combinedErrorRates)
#errorRates
```

## 5. Final model choice

It it reasonable to choose the "rf,gbm,svm,nnet combined model" as it seems to be protected from overfitting by balancing.
This predictor is accurate enough in the same time with out of sample accuracy **`r outOfSampleAcc`**.


#### Final quiz prediciton.

```{r echo=F,results='hide', warning=F}
#assessing final quiz questions

barbell_lifts_quiz  <- read.csv('pml-testing.csv')

for (posPredName in possPredictorNames){
  if(class(barbell_lifts_quiz[[posPredName]]) == 'factor') {
    barbell_lifts_quiz[[posPredName]] <- as.numeric(as.character(barbell_lifts_quiz[[posPredName]])) #converting factor to cumeric via character
  }
}

pcaPredictedQuiz <- predict(pcaPreProc,barbell_lifts_quiz)
quizPrediction <- predictMode(trained.best,pcaPredictedQuiz,levels(pcaPredictedTesting$classe))

```

I've got **"17/20 points earned (85%)"** for the final quiz with the chosen model.

Accuracy rate of 85% is explainable as it is little bit less then out-of-sample accuracy expectation.