A collection of machine learning projects following some of the courses I took while at Eurecom: Algorithmic Machine Learning, Digital Image Processing (Notes), Advanced Statistical Inference, Distributed Systems and Cloud Computing (Notes) and Deep Learning.
- Denoising.
After loading the image in Matlab, I added some noise and compared the results after the application of three different filters:
- Averaging filter
- Median filter
- Wiener filter
- Low level feature detection.
In order to highlight the edges, I compared the performance of three approaches:
- Gradient filter
- Laplacian filter's zero crossings
- Canny edge detector
- High level detection and interpretation. I displayed and analyzed the Radon and Hough transforms. After interpreting the Radon transform points, I extracted the image orientation and rotated it.
I studied and implemented the k-means algorithm using Python and numpy. I analyzed the convergence of the algorithm on different 2D datasets of different shapes to easily visualize the results. I implemented smart centroid initialization in k-means++ to improve the clustering of unconventional shapes and I used the elbow method to find the optimal number of clusters. Finally, I implemented the distributed version of k-means with PySpark and compared it with the serial implementation.
I analyzed a flight dataset using the DataFrame API that provides a better abstraction over distributed data with respect to RDDs (Resilient Distributed Dataset), the basic abstraction in Spark. I performed data exploration and analysis using pandas and SparkSQL, and I built insightful visualizations using seaborn and matplotlib. Finally, I integrated the original dataset with airport, carrier, plane and weather data to answer additional exploratory questions.
I performed classification over the Fashion MNIST and CIFAR10 datasets using different models. I implemented the naive Bayes classifier and Bayesian regression using Python and numpy following the scikit-learn declaration style. I analyzed the performance of these models in terms of error, accuracy and confusion matrices, and I highlighted the strenght and weaknesses of the two approaches (clearly, Bayesian regression does not fit the classification task). Furthermore, I implemented a convolutional neural network using pytorch, I trained it on the Fashion MNIST dataset and I analyzed the convolutional filters. Finally, I explored possible integrations of the naive bayes features in the convolutional neural network.Digital image processing - Image filtering (Matlab), Stereo images (Matlab), Image filtering (OpenCV)
I analyzed the relation and performance of different filters and domains using MatLab and OpenCV:
- Linear filtering in the frequency domain and analysis of the cutting frequency
- Linear filtering in the spatial domain using the averaging filter varying its size and the noise intensity
- Non linear filtering with the median filter and its performance with respect to the linear one.
Using MatLab, I processed stereo images, I implemented a stereo matching function to create the disparity map with the Sum of Absolute intensity Differences (SAD). I analysed the result with different kernel sizes and used images with different baselines. I computed the depth of the image and performed segmentation using the histogram of the distribution of grey values. Finally, I produced an anaglyph for 3D vision using colour filter glasses given the source images.
I estimated the financial risk of a portfolio of stocks including GSPC, IXIC, the return of crude oil and the return of treasury bonds. In order to do so, I calculated the Value at Risk (VaR) of a two weeks interval with 95% confidence level and the associated VaR confidence interval. After data analysis and cleaning, I approximated the distributions of the market factors using Kernel Density Estimation and I used featurization to improve the performance of a simple linear regression model. Finally, I used Monte Carlo sampling and p-value testing to estimate the risk and I backtested on historical data to validate the results. I worked with a house prices dataset made of 81 features for 1200 entries (train data). I started by performing univariate analysis of the sale price and the individual features (after splitting them in categorical, ordinal and numerical). I performed missing values cleaning and imputation, outliers removal, and I encoded ordinal and categorical features. To finalize the preprocessing I performed feature engineering, normalization and scaling, and I selected the most relevant features using mutual information and the associated p-value.I started predicting the sale price using a simple linear regression without regularization to get a baseline of the performance. Successively, I explored the effects of different regularizations: Lasso, Ridge and ElasticNet, using a grid search approach to find the best hyper-parameters. Moreover, I trained a more complex model, XGBoost, and fine tuned its parameters using both grid search and Bayesian optimization. Finally, I evaluated two model ensemblig approaches: averaging and stacking.
I initially loaded the MNIST dataset and trained a logistic regression model using tensorflow. Using tensorboard, I visualized the computational graph, getting an overview of the model, and monitored the accuracy and loss learning curves. Then, I implemented LeNet 5 architecture and compared the performance obtained using standard gradient descent and Adam. Finally, I added dropout layers to the original architecture to improve the robustness of the model and its performance on test data. I manually implemented a Multi Layer Percepron using the sigmoid transfer function and the Mean Squared Error loss function. Initially, I wrote down the calculation to execute one forward and one backward step on some artificial values, in order to see how the weights were updated. Then, I used numpy to implement a vectorized version of the feedforward and backpropagation algorithms and I added the methods to the NeuralNetwork class.In the second part of the notebook, I loaded the MNIST dataset and defined a train method for the NeuralNetwork class. In order to be more flexible, I defined a general mini-batch gradient descent training so that I could compare the different performances of stochastic, batch and mini-batch gradient descent by changing the batch size (1 for stochastic; len(dataset) for batch; len(minibatch) for mini-batch). Furthermore, I tested the accuracy of neural networks with different hidden layer size, and I compared and explained the results. Finally, I changed the trasfer function of the output layer with the softmax function and implemented the cross-entropy loss to improve the neural network performance.
The aim of this introductory lab is to get familiar with the Jupyter notebooks, python and its modules (Pandas, Matplotlib, Numpy). Finally, this notebook contains a presentation of PySpark and how to interact with the HDFS, together with two examples of distributed code: word count and an analysis of night flights.