Recommendation System Optimization

Project of Insight Data Engineering Fellow 2018B

Introduction

The purpose of this project is to build a recommendation system according to the rates what users gave before. The core algorithm was implemented by Spark.

Data

The project started with a subset of Yelp dataset. After that, scale up to the whole dataset, which contains 5.2 million review records. Finally, I simulated 4 times of this dataset to train the recommendation system model.

Algorithm

The recommendation system model used in this project is item-based collaborative filtering, which calculate the similarity matrix between every two items according to the rates given by common users of those two items. The formular of similarity matrix calculation shown as follow:

With the similartiy matrix, we can predict the rates of a user to an item by using the formular as follow:

A simple example is shown as follow to understand the algorithm better:

Data Engineering Challenge and Optimization

Data Preprocessing

Change the user_id and item_id from String to Integer.

Spark Tuning

Increase the value of "Spark.memory.fraction", decreased the value of "Spark.memory.storageFraction". Save memory for shuffle write.

Data Skew Solution

Instead of joining the big table "review" with the small table "avg_star" twice, broadcast the "avg_star" first, and saved it as a variable. Using "map-side join" twice to save time. To be specific, save the small table on each executor, and leave the large table untouched, doesn't shuffle anything. So the small table is saved on executors and we can do a linear scan through all the small table to the big partitions and then you can join on the keys.

Pipeline

Performance

Conclusion

A large-scale recommendation system is implemented in this project. Data skew problem was solved by using broadcast and map-side join, which also avoid the shuffle. After simulated more data, the business recommendation system can still handle it, model training part cost 1.6 hours.

Author

This project was made by Xiaojin(Ruby)Liu. If you have any questions, please feel free ton contact me through email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Slides		Slides
input		input
src		src
tmp		tmp
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recommendation System Optimization

Table of Contents

Introduction

Data

Algorithm

Data Engineering Challenge and Optimization

Data Preprocessing

Spark Tuning

Data Skew Solution

Pipeline

Performance

Conclusion

Author

About

Releases

Packages

Languages

Xiaojin1215/RecommendationSystemOpt

Folders and files

Latest commit

History

Repository files navigation

Recommendation System Optimization

Table of Contents

Introduction

Data

Algorithm

Data Engineering Challenge and Optimization

Data Preprocessing

Spark Tuning

Data Skew Solution

Pipeline

Performance

Conclusion

Author

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages