our_notebook.Rmd

---
title: "R Notebook"
output:
  pdf_document: default
  html_document:
    df_print: paged
editor_options:
  chunk_output_type: console
---

```{r}
rm(list = ls())
if (Sys.info()[8] == "leoczajka") {
  setwd("/Users/leoczajka/Dropbox (Leo_and_Co)/Research/learning/Doctoral Training/machine-learning/ULB-CHM-Machine-Learners")
}
if (Sys.info()[8] == "??") {
  setwd("~/BioInformatique/MA1/Q2/machine learning/projet")
}

if (Sys.info()[8] == "??") {
  setwd("~/RProjects/Pumpit")
}

```


**Part 1** 

First we import the data, name our main (train) dataset *df*, 

```{r}

train_values <- read.csv("trainingsetvalues.csv")
train_labels <- read.csv("trainingsetlabels.csv")
df <- merge(train_values, train_labels)
test_values <- read.csv("testsetvalues.csv")
```

Then we load each package needed for this part
```{r}
library(Hmisc)
library(tidyverse)
library(dplyr) 
```
Let's have a general look at all features. This will allow us to quickly classify the different variables of our data set, given their type and structure.
```{r}
describe(df)
```
We remove thre variables : *num_private* is almost always equal to 0 and there is no documentation about it; *recorded_by* is always equal to the same value, and *wpt_name* is the name of the water point therefore it is different for each water point and as such should not be useful to make any prediction about water points outside our data train set.
```{r}
describe(df$num_private)
describe(df$recorded_by)
 df<- df[, !(colnames(df) %in% c("recorded_by", "num_private", "wpt_name"))]

```
There are five strictly (we consider dates separately) numerical variables:  *latitude*, *longitude*, *gps_height*, *population* and *amount_tsh*. All are relevant to the prediction problem we try to solve here. A closer inspection of the *longitude* reveals that some observations have value 0. Given the geographical situation of Tanzania this is not possible. A thorough correction of the data set would imply infering the *longitude* from categorical geographical variables available in the data set. However this would take time and, as the number of observations for which this occurs is small (3%), we simply decide to remove those observations from the data set. By contrast this is not an issue for the *latitude* which seems to be relatively well distributed. About 1/3 of the *gps_height* on the other hand is equal to 0. This value is of course possible, but it's frequency, together with the fact that zero probably correspond to missing in the longitude's variable, may suggest that, as such, *gps_height* is a relatively badly measured variable. 
```{r}
hist(df$longitude)
sum(df$longitude==0)/nrow(df)
hist(df$latitude)
sum(df$latitude==0)
hist(df$gps_height)
sum(df$gps_height==0)
df <- df %>% filter(df$longitude!=0)
```
The variables *population* and *amount_tsh* also seem to have been badly measured. 36% of observations have a population equal to 0, and an additional 12% have a population equal to 1. Both values could be correct. But water pumps should not be too far from people benefiting from it, and as such we would have expected such values to be less frequent. Also, the relative weight of these values is in stark contrast with the rest of the distribution of the *population* variable. About 70% of the *amount_tsh* variable is equal to 0, meaning that there was no water available to waterpoint. This figure is puzzling given that more than half of the pump are qualified as functional, we would expect indeed that some water needs to be available at the water point for the pump to be tried and ultimately assessed as functional. A closer inspection of the distribution of non-zero values further illustrate how irregular this variable is. 

```{r}
sum(df$population==0)/nrow(df)
sum(df$population==1)/nrow(df)
hist(df$population)
hist(df$population[df$population>1 & df$population <500], breaks = seq(from=1, to=501, by=10))
hist(df$amount_tsh)
hist(df$amount_tsh[df$amount_tsh>0 & df$amount_ts <1000])
sum(df$amount_tsh==0)/nrow(df)
table(df$status_group)
```
The data set contains two information about time: one about the year when the pump was constructed (*construction_year*), the other about the exact date when the pump was tested (*date_recorded*). *construction_year* equals zero for about 35% of the observation. This is obviously a measurement error. We set these observations to the median. *date_recorded* could be used as a factor variable with 356 different values and capturing daily fixed effect. However given that this contributes to considerably increase dimensions of the problem and therefore computing-time for several machine learning procedures, we decide to exploit information from this variable differently. First we transform it into a number to capture time trained at daily rate. We delete 31 observations with strictly implausible values. Then we create 3 categorical variables capturing the year, month and day of the week, during which the measure was taken. Finally we generate the variable *age* equal to the difference between the year of the observation and the construction year.

```{r}
sum(df$construction_year==0)/nrow(df)
describe(df$construction_year[df$construction_year!=0])
df$construction_year[df$construction_year==0] <- 2000

df$m_date <- as.Date(df$date_recorded)
df$daily_time_trend <- as.numeric(df$m_date)
hist(df$daily_time_trend)
table(df$daily_time_trend)
  df <- df %>% filter(df$daily_time_trend>=14977)
hist(df$daily_time_trend[df$daily_time_trend>14942], breaks = seq(from=14942, to=16042, by=10))

df$m_months <- as.factor(as.numeric(format(df$m_date,'%m')))
df$m_day <- as.factor(weekdays(df$m_date))

df$m_year <- as.numeric(format(df$m_date,'%Y'))
df$age <- df$m_year - df$construction_year 

df$age <- as.numeric(df$m_year) - df$construction_year

 df<- df[, !(colnames(df) %in% c("date_recorded"))]

```

The remining the variables are all categorical. One subgroup provide information on the geographical location (*basin* *subvillage* *region* *region_code* *district_code* *lga* *ward*). The rest provide diverse information about the pump. The geographical variable can be divided into two groups depending on the number of different categories: few (*basin* *region* *region_code* *district_code* *lga* *ward*) or many (*subvillage* *ward*). We remove from our data set those with too many levels. 

```{r}
table(df$basin)
table(df$region, df$region_code)
table(df$district_code)
table(df$region_code)
table(df$lga)
describe(df$ward)
describe(df$subvillage)
 df<- df[, !(colnames(df) %in% c("ward", "subvillage"))]
```
Next comes the group of variable giving information about the pump with different levels of details (*extraction_type*  *extraction_type_group*  *extraction_type_class*  *management*  *management_group*  *payment*  *payment_type*  *water_quality*  *quality_group*  *quantity*  *quantity_group*  *source*  *source_type*  *source_class*  *waterpoint_type*  *waterpoint_type_group*). For instance *extraction_type* is strictly more precise than *extraction_type_group*, which is itself strictly more precise than *extraction_type_class*. As computational time varies a lot across machine learning algorithm depending on the number of features, we decide to keep both precise and general versions of these features, thus is allowing us to pick one or the other, depending on the model at use.

```{r}
  table(df$extraction_type, df$extraction_type_group)
  table(df$extraction_type_group, df$extraction_type_class)
  table(df$management, df$management_group )
  table(df$payment,df$payment_type )
  table(df$water_quality,df$quality_group )
  table(df$quantity, df$quantity_group)
  table(df$source, df$source_type )
  table(df$source_type,df$source_class )
  table(df$waterpoint_type,df$waterpoint_type_group )
```
Finally we create a dummy variable equal to 1 whenever the *funder* is the same as the *installer* and remove these two features for they contain too many features. 
```{r}
df$installer <- tolower(df$installer)
df$funder <- tolower(df$funder)
df$funder_is_installer <- df$funder == df$installer & df$installer!=""
df$funder_is_installer[df$installer=="" | df$funder==""] <- "MISSING"
table(df$funder_is_installer)

df<- df[, !(colnames(df) %in% c("funder", "installer"))]

```
Last we run a chi-test for each of the categorical variable with relatively few levels against our dependent variable. All p-values are below significant levels therefore if any discrimination can be made between this categorical variable it should be on the basis of the magnitude of the coefficient. By decreasing order the most relevant categorical variable are the following : *quantity* *quantity_group*  *waterpoint_type* *extraction_type* *extraction_type_group* *extraction_type_class* *waterpoint_type_group* *region_code* *region* *payment_type* *payment* *source* *source* *water_quality*  *quality_group* *management* *scheme_management* *basin*  *source_type*  *district_code* *source_class* *public_meeting* *management_group* *permit*. 

```{r}
chisq.test(df$status_group, df$basin)
chisq.test(df$status_group, df$region)
chisq.test(df$status_group, df$region_code)
chisq.test(df$status_group, df$district_code)
chisq.test(df$status_group, df$public_meeting)
chisq.test(df$status_group, df$scheme_management)
chisq.test(df$status_group, df$permit)
chisq.test(df$status_group, df$extraction_type)
chisq.test(df$status_group, df$extraction_type_group)
chisq.test(df$status_group, df$extraction_type_class)
chisq.test(df$status_group, df$management)
chisq.test(df$status_group, df$management_group)
chisq.test(df$status_group, df$payment)
chisq.test(df$status_group, df$payment_type)
chisq.test(df$status_group, df$water_quality)
chisq.test(df$status_group, df$quality_group)
chisq.test(df$status_group, df$quantity)
chisq.test(df$status_group, df$quantity_group)
chisq.test(df$status_group, df$source)
chisq.test(df$status_group, df$source_type)
chisq.test(df$status_group, df$source_class)
chisq.test(df$status_group, df$waterpoint_type)
chisq.test(df$status_group, df$waterpoint_type_group)
```

We apply all modification to the *test_value* dataset
```{r}

test_values<- test_values[, !(colnames(test_values) %in% c("recorded_by", "num_private", "wpt_name"))]

# test_values <- test_values %>% filter(test_values$longitude!=0)
test_values$construction_year[test_values$construction_year==0] <- 2000
test_values$m_date <- as.Date(test_values$date_recorded)
test_values$daily_time_trend <- as.numeric(test_values$m_date) 
# test_values <- test_values %>% filter(test_values$daily_time_trend>=14977)

test_values$m_months <- as.factor(as.numeric(format(test_values$m_date,'%m')))
test_values$m_day <- as.factor(weekdays(test_values$m_date))

test_values$m_year <- as.numeric(format(test_values$m_date,'%Y'))
test_values$age <- test_values$m_year - test_values$construction_year 

test_values<- test_values[, !(colnames(test_values) %in% c("date_recorded"))]

test_values<- test_values[, !(colnames(test_values) %in% c("ward", "subvillage"))]

test_values$installer <- tolower(test_values$installer)
test_values$funder <- tolower(test_values$funder)
test_values$funder_is_installer <- test_values$funder == test_values$installer & test_values$installer!=""
test_values$funder_is_installer[test_values$installer=="" | test_values$funder==""] <- "MISSING"

test_values<- test_values[, !(colnames(test_values) %in% c("funder", "installer"))]

```