Project of Insight Data Engineering Fellow 2018B
The purpose of this project is to build a recommendation system according to the rates what users gave before. The core algorithm was implemented by Spark.
The project started with a subset of Yelp dataset. After that, scale up to the whole dataset, which contains 5.2 million review records. Finally, I simulated 4 times of this dataset to train the recommendation system model.
The recommendation system model used in this project is item-based collaborative filtering, which calculate the similarity matrix between every two items according to the rates given by common users of those two items. The formular of similarity matrix calculation shown as follow:
With the similartiy matrix, we can predict the rates of a user to an item by using the formular as follow: A simple example is shown as follow to understand the algorithm better:Change the user_id and item_id from String to Integer.
Increase the value of "Spark.memory.fraction", decreased the value of "Spark.memory.storageFraction". Save memory for shuffle write.
Instead of joining the big table "review" with the small table "avg_star" twice, broadcast the "avg_star" first, and saved it as a variable. Using "map-side join" twice to save time. To be specific, save the small table on each executor, and leave the large table untouched, doesn't shuffle anything. So the small table is saved on executors and we can do a linear scan through all the small table to the big partitions and then you can join on the keys.
A large-scale recommendation system is implemented in this project. Data skew problem was solved by using broadcast and map-side join, which also avoid the shuffle. After simulated more data, the business recommendation system can still handle it, model training part cost 1.6 hours.
This project was made by Xiaojin(Ruby)Liu. If you have any questions, please feel free ton contact me through email: [email protected]