Reddit Analyzed

Find similar subreddits using MinHash and SimRank algorithms.

Done as term project for CS425 Web-Scale Data course.

Results

I implemented a graph visualization tool for results, using Cytoscape.js library.

You can check it out here.

Some examples:

Similar subs to r/Python

Similar subs to r/DotA2

Similar subs to r/canada

Similar subs to r/soccer

Algorithms Used

MinHash

I used MinHash algorithm to find number of common users of each subreddit.

Jaccard Similarity can be calculated using MinHash and it gives accurate and fast results. Number of common users can be calculated using Jaccard Similarity and number of users of each subreddit.

This was needed to preprocess user information which was the bottleneck of SimRank algorithm (~3M users) and ran SimRank on subreddits only.

SimRank

SimRank algorithm is an extension of PageRank algorithm to find similarity in a graph.

Random walk with restarts approach gives pretty accurate similarity results.

After processing the dataset using MinHash, I ran SimRank on a graph where nodes represent subreddits and weighted edges represent number of common users if exists, like below.

Collecting Data

Ran crawler.py on Heroku for 3 weeks and written the data in an Amazon RDS database.

Dataset contains,

8300 different subreddits
3 million unique active users
10 million comments

Used PRAW library, a Python Reddit API wrapper.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets/img		assets/img
visualization-tool		visualization-tool
README.md		README.md
config.py		config.py
crawler.py		crawler.py
database.py		database.py
minhash.py		minhash.py
simrank.py		simrank.py
simrank_v2.py		simrank_v2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit Analyzed

Results

Algorithms Used

MinHash

SimRank

Collecting Data

About

Releases

Packages

Languages

ahmetcandiroglu/reddit-analyzed

Folders and files

Latest commit

History

Repository files navigation

Reddit Analyzed

Results

Algorithms Used

MinHash

SimRank

Collecting Data

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages