Find similar subreddits using MinHash and SimRank algorithms.
Done as term project for CS425 Web-Scale Data course.
I implemented a graph visualization tool for results, using Cytoscape.js library.
You can check it out here.
Some examples:
Similar subs to r/Python
Similar subs to r/DotA2
Similar subs to r/canada
Similar subs to r/soccer
I used MinHash algorithm to find number of common users of each subreddit.
Jaccard Similarity can be calculated using MinHash and it gives accurate and fast results. Number of common users can be calculated using Jaccard Similarity and number of users of each subreddit.
This was needed to preprocess user information which was the bottleneck of SimRank algorithm (~3M users) and ran SimRank on subreddits only.
SimRank algorithm is an extension of PageRank algorithm to find similarity in a graph.
Random walk with restarts approach gives pretty accurate similarity results.
After processing the dataset using MinHash, I ran SimRank on a graph where nodes represent subreddits and weighted edges represent number of common users if exists, like below.
Ran crawler.py
on Heroku for 3 weeks and written the data in an Amazon RDS database.
Dataset contains,
- 8300 different subreddits
- 3 million unique active users
- 10 million comments
Used PRAW library, a Python Reddit API wrapper.