Description.txt

# PRE-PROCESSING

* Processing of missing values

Before starting, it is worth mentioning that we removed the feature corresponding to the fifth column of the X_train and X_test matrices, because its covariance with respect to the other features was small and therefore, it kept little information about the problem. In addition, we verified that when the fifth variable is used, the validation error for any algorithm increases.

To estimate the missing values (nan), we used a Gaussian Process Regression model. The procedure divided the X_train data matrix into three sub-matrices: X_train_aux, X_test_aux and y_train_aux.

X_train_aux contains samples X2, X3 and X4, of which the value of X1 is known. X_test_aux contains samples X2, X3 and X4 for which the value of X1 (nan) is NOT known. Finally, y_train_aux contains the known X1 samples.

Once this split is made, all the datasets are normalized with respect to the mean and standard deviation of the X_train_aux data.

With these matrices, the regression model is constructed using Gaussian Process Regression, in which the hyperparameters L and sigma_eps were cross-validated. Once we optimized these hyperparameters, the estimation of the unknown X1s was performed (output of X_test_aux).

The same procedure is performed to calculate the unknown values of X_test. Therefore, the only thing that changes now is X_test_aux, with the matrices X_train_aux and y_train_aux being the same.


# MODEL / ALGORITHM CHOSEN

The model that was chosen to make the predictions was, once again, Gaussian Processes Regression. To perform the regression model, the X_train and y_train matrices were used. To find the optimal values of L and sigma_eps, a cross-validation procedure was utilized.

To a first approximation, we first used the the SE (Squared Exponential) Kernel. Next, the Matérn Kernel was tested for v = 5/2 and we realized that the results improved slightly, so we selected it as the final Kernel.

The Gaussian Process model was used twice: first, to calculate the unknown values (nan) of the variable X1 based on the matrices X_train_aux, y_train_aux and X_test_aux before; and secondly, to estimate the output 'y' based on X_train, X_test and y_train.


# SELECTION OF HYPERPARAMETERS

The methodology that was used to select the hyperparameters and sigma_eps of the Gaussian Processes was 10-fold cross-validation. To find the optimal values we start by testing 10 linearly separated values within a wide range for both parameters (exploration). Then we reduced the range for those values that produced the least root mean square of validation (explotation).

To find the optimal values we defined a square matrix in which each element corresponds to a specific tuple (sigma_eps, L). By searching for the minimum root mean square error (averaged over the 10 folds) in the aforesaid matrix, we selected the values of L and sigma_eps that originated said minimum error.