Predict Subreddit

An NLP model that predicts subreddit based on the title of a post.

Play with it on HuggingFace Space

Post on r/MachineLearning

Data Collection

The model was trained using the titles of the top 1000 posts from the top 250 subreddits scraped using PRAW.

Dataset hosted on HuggingFace

Steps to create the dataset using dataset.py script:

Make sure to install the requirements using pip install -r requirements.txt
Create a .env file consisting of reddit authentication info like this

ID = <YOUR_ID>
SECRET = <YOUR_SECRET>
AGENT = <YOUR_AGENT>

Now run the script to create the dataset like this

python3 dataset.py <npage> <dfilename>

npage is the no of pages to scrape for top subreddits from redditlist.com (1 page => 125 subs) and filename is the csv filename to save the dataset to.

After the above steps are run, a csv file will be created under give filename consisting of title and subreddit pairs.

Modelling

HuggingFace Transformers' DistilBERT, is fine-tuned on the dataset of post titles labelled with their respective subreddit.

For steps to make the model check out the model notebook in the repo or open in Colab.

Model hosted on HuggingFace

Examples

Limitations and bias

Since the model is trained on top 250 subreddits (for reference) therefore it can only categorise within those subreddits.
Some subreddits have a specific format for their post title, like r/todayilearned where post title starts with "TIL" so the model becomes biased towards "TIL" --> r/todayilearned. This can be removed by cleaning the dataset of these specific terms.
In some subreddit like r/gifs, the title of the post doesn't matter much, so the model performs poorly on them.

Contributing

If you want to contribute code, simply create a pull request. If you have an idea, create an issue and the developers will look into it!

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
model.ipynb		model.ipynb
requirments.txt		requirments.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predict Subreddit

Data Collection

Steps to create the dataset using dataset.py script:

Modelling

Examples

Limitations and bias

Contributing

About

Releases

Packages

Contributors 2

Languages

License

daspartho/predict-subreddit

Folders and files

Latest commit

History

Repository files navigation

Predict Subreddit

Data Collection

Steps to create the dataset using dataset.py script:

Modelling

Examples

Limitations and bias

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages