Three movie recommendation systems using 1) K-Nearest Neighbour Algorithm, 2) Correlation Analysis and 3) Matrix Factorization.
Jupyter notebook
Python
SciPy library
NumPy
Scikit-learn
Pandas
Fuzzywuzzy
Movielens Dataset: https://grouplens.org/datasets/movielens/100k/
This dataset consists of:
- 100,000 ratings (1-5) from 943 users on 1682 movies.
- Each user has rated at least 20 movies.
- Simple demographic info for the users (age, gender, occupation, zip) MovieID::Title::Genres
-
Titles are identical to titles provided by the IMDB (including year of release)
-
Genres are pipe-separated and are selected from the following genres:
- Action
- Adventure
- Animation
- Children's
- Comedy
- Crime
- Documentary
- Drama
- Fantasy
- Film-Noir
- Horror
- Musical
- Mystery
- Romance
- Sci-Fi
- Thriller
- War
- Western
As we have used unsupervised learning in these recommender systems, we do not have the necessary data of class labels in order to accurately and easily predict accuracy, precision, recall, and f-measure.
● Accuracy refers to the ratio of correct predictions made.
○ (# Correct Predictions) / (Total # of samples)
● Precision identifies the proportion of correct positive identifications
○ (True Positives) / (True Positives+False Positives)
● Recall is the proportion of actual positives identified correctly
○ (True Positives) / (True Positives+False Negative)
● F1 measure is the weighted harmonic mean between precision and recall
○ (2*Precision*Recall) / (Precision+Recall)
We have explored three different, simple algorithms for collaborative filtering-based movie recommendation systems.
-
Correlation Analysis identifies the relation between rankings of a watched movie and every other movie in the system.
-
KNN finds movies with the least distance (most similarity) to the watched movie.
-
Matrix Factorization identifies movies that hold a similarity with the list of movies that the user has previously viewed, resulting in a more personalized list with more inter-list dissimilarity.
Although none of the algorithms yield perfect results, they may be combined into a more complex recommender system, through the use of an ensemble. Ensemble learning allows the usage of all of these algorithms to be incorporated into a single one with higher performance. One such possible system could be:
-
KNN takes a watched movie as input and outputs a list of related movies
-
Matrix Factorization takes a user as input and outputs a list of movies similar to what the user has viewed in the past.
-
On a streaming platform, a user with an account ID who has previously watched a movie can be recommended a list of movies based on the majority of the movies offered in both algorithms. This would yield higher accuracy, as it satisfies the requirements of movie similarity and personalization.
Another possible system could take the output of the KNN model as an input to the SVD model and further filter the recommendation based on the user’s preferences, using Matrix Factorization.
Sanjana Rai
Naveena Koneru
https://www.github.com/SaurusXI/Movie-Recommender
https://www.ee.columbia.edu/~cylin/course/bigdata/EECS6893-BigDataAnalytics-Lecture4.pdf
https://www.dezyre.com/data-science-in-python-tutorial/principal-component-analysis-tutorial
https://blog.imarticus.org/data-analytics-popular-algorithms-explained/
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-recommendation-engine-python/
https://towardsdatascience.com/evaluation-metrics-for-recommender-systems-df56c6611093
https://grouplens.org/datasets/movielens/
https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb