An NLP model that predicts subreddit based on the title of a post.
Play with it on HuggingFace Space
Post on r/MachineLearning
The model was trained using the titles of the top 1000 posts from the top 250 subreddits scraped using PRAW.
Dataset hosted on HuggingFace
Steps to create the dataset using dataset.py script:
- Make sure to install the requirements using
pip install -r requirements.txt
- Create a
.env
file consisting of reddit authentication info like this
ID = <YOUR_ID>
SECRET = <YOUR_SECRET>
AGENT = <YOUR_AGENT>
- Now run the script to create the dataset like this
python3 dataset.py <npage> <dfilename>
npage
is the no of pages to scrape for top subreddits from redditlist.com (1 page => 125 subs) and filename
is the csv filename to save the dataset to.
- After the above steps are run, a csv file will be created under give filename consisting of title and subreddit pairs.
HuggingFace Transformers' DistilBERT, is fine-tuned on the dataset of post titles labelled with their respective subreddit.
For steps to make the model check out the model notebook in the repo or open in Colab.
Model hosted on HuggingFace
- Since the model is trained on top 250 subreddits (for reference) therefore it can only categorise within those subreddits.
- Some subreddits have a specific format for their post title, like r/todayilearned where post title starts with "TIL" so the model becomes biased towards "TIL" --> r/todayilearned. This can be removed by cleaning the dataset of these specific terms.
- In some subreddit like r/gifs, the title of the post doesn't matter much, so the model performs poorly on them.
If you want to contribute code, simply create a pull request. If you have an idea, create an issue and the developers will look into it!