Skip to content

Xiaojin1215/RecommendationSystemOpt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Recommendation System Optimization

Project of Insight Data Engineering Fellow 2018B

Table of Contents

  1. Introduction
  2. Data
  3. Algorithm
  4. Spark Tuning
  5. Data Skew Solution
  6. Pipeline
  7. Performance

Introduction

The purpose of this project is to build a recommendation system according to the rates what users gave before. The core algorithm was implemented by Spark.

Data

The project started with a subset of Yelp dataset. After that, scale up to the whole dataset, which contains 5.2 million review records. Finally, I simulated 4 times of this dataset to train the recommendation system model.

Algorithm

The recommendation system model used in this project is item-based collaborative filtering, which calculate the similarity matrix between every two items according to the rates given by common users of those two items. The formular of similarity matrix calculation shown as follow:

With the similartiy matrix, we can predict the rates of a user to an item by using the formular as follow:
A simple example is shown as follow to understand the algorithm better:

Data Engineering Challenge and Optimization

Data Preprocessing

Change the user_id and item_id from String to Integer.

Spark Tuning

Increase the value of "Spark.memory.fraction", decreased the value of "Spark.memory.storageFraction". Save memory for shuffle write.

Data Skew Solution

Instead of joining the big table "review" with the small table "avg_star" twice, broadcast the "avg_star" first, and saved it as a variable. Using "map-side join" twice to save time. To be specific, save the small table on each executor, and leave the large table untouched, doesn't shuffle anything. So the small table is saved on executors and we can do a linear scan through all the small table to the big partitions and then you can join on the keys.

Pipeline

Performance

Conclusion

A large-scale recommendation system is implemented in this project. Data skew problem was solved by using broadcast and map-side join, which also avoid the shuffle. After simulated more data, the business recommendation system can still handle it, model training part cost 1.6 hours.

Author

This project was made by Xiaojin(Ruby)Liu. If you have any questions, please feel free ton contact me through email: [email protected]

About

Project of Insight Data Engineering Fellow 2018B

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published