Skip to content

🎬 Utilizing PySpark's Alternating Least Squares (ALS) algorithm for model-based movie recommendations.

Notifications You must be signed in to change notification settings

ShanSabri/PySpark-Movie-Recommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

PySpark-Movie-Recommender

For any given user we would like to use their movie ratings, in combination with all the existing user ratings, to determine which movies they might prefer. For example, a user might highly rate Annie Hall and The Purple Rose of Cairo (both Woody Allen movies that our database does not have information for), can we infer from other users that they might like Zelig (another Woody Allen movie)? These might also include affinities for an actor, or director, or genre, etc.

Motivation

  • Familiarize myself with the Apache Spark via PySpark for big data applications
  • Familiarize myself with training and tuning an Alternating Least Squares (ALS) algorithm for model-based movie recommendations
  • Netflix Prize

Input

User movie ratings:

ratings_raw_RDD = sc.textFile('data/ratings.csv')
ratings_RDD = ratings_raw_RDD.map(lambda line: line.split(",")).map(lambda tokens: (int(tokens[0]),int(tokens[1]),float(tokens[2])))
ratings_RDD.take(3) # [(user_id, movie_id, rating)]
[(1, 31, 2.5), (1, 1029, 3.0), (1, 1061, 3.0)]

New user movie ratings:

new_user = [
     (0,100,4), # City Hall (1996)
     (0,237,1), # Forget Paris (1995)
     (0,44,4),  # Mortal Kombat (1995)
     (0,25,5),  # etc....
     (0,456,3),
     (0,849,3),
     (0,778,2),
     (0,909,3),
     (0,478,5),
     (0,248,4)
    ]
new_user_RDD = sc.parallelize(new_user)

Movie lookup (to map movie_id to movie_title):

movies_raw_RDD = sc.textFile('data/movies.csv')
movies_RDD = movies_raw_RDD.map(lambda line: line.split(",")).map(lambda tokens: (int(tokens[0]),tokens[1]))
movies_RDD.take(3)
[(1, u'Toy Story (1995)'), (2, u'Jumanji (1995)'), (3, u'Grumpier Old Men (1995)')] 

Output

Top recommendations for new user with predicted ratings:

# +--------------------+--------------------+--------------------+
# |               movie|              rating|       scaled_rating|
# +--------------------+--------------------+--------------------+
# | Hear My Song (1991)| [6.768762414140875]|               [5.0]|
# |    Novocaine (2001)| [6.082847646559083]| [4.692583046687709]|
# |    Let It Be (1970)| [5.960487606934326]| [4.637743068943494]|
# | "Broken Hearts Club| [5.607092763004114]| [4.479356675514017]|
# |Evangelion: 1.0 Y...| [5.519627831466998]|[4.4401561741607605]|
# |Six-String Samura...| [5.486386537669752]| [4.425257913113933]|
# |         Cops (1922)|  [5.46740481439733]| [4.416750582738764]|
# |               "Goat|  [5.46740481439733]| [4.416750582738764]|
# |Land of Silence a...|  [5.46740481439733]| [4.416750582738764]|
# |         "Play House|  [5.46740481439733]| [4.416750582738764]|
# |  Dersu Uzala (1975)| [5.441536583883281]| [4.405156820673771]|
# |                "Now|[5.4369177354878655]| [4.403086720467973]|
# |    The Witch (2015)|  [5.37433033977239]| [4.375035966327726]|
# |              "Norte|  [5.35937346933437]|[4.3683325160472295]|
# |"Secret in Their ...| [5.358001407621149]|   [4.3677175780818]|
# |Book of Shadows: ...| [5.354853467836861]| [4.366306717573419]|
# |   Angel Baby (1995)| [5.351015314453442]| [4.364586513438384]|
# |       Gabbeh (1996)| [5.351015314453442]| [4.364586513438384]|
# |Picture Bride (Bi...| [5.351015314453442]| [4.364586513438384]|
# |King Kong vs. God...|[5.3293427074850825]|  [4.35487316839985]|
# +--------------------+--------------------+--------------------+
# only showing top 20 rows

Notice that out top ranked movies have predicted ratings higher than 5. This makes sense as there is no ceiling implied in our algorithm and one can imagine that certain combinations of factors would combine to create “better than anything you’ve seen yet” ratings.

Nevertheless, we constrain our ratings to a scaled range of 1-5 via MinMaxScaler.

References

License

MIT

About

🎬 Utilizing PySpark's Alternating Least Squares (ALS) algorithm for model-based movie recommendations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages