Downloading the dataset:
-
Download the dataset from Kaggle - Toxic Comment Classification - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data
-
The root directory should contain the data folder with the dataset from Kaggle.
-
Install the modules provided in the
requirements.txt
-
THE
data
FOLDER NEEDS TO PUT INTO THEsource
folder extracted fromsource.zip
. -
For the
project.ipynb
to be able to display images,Images
folder from gitlab needs be downloaded via this link - https://csil-git1.cs.surrey.sfu.ca/krutp/nlpclass-1197-g-lexchunkers/-/archive/master/nlpclass-1197-g-lexchunkers-master.zip?path=project%2FImages -
The
images
folder should be placed inside thesource
folder extracted fromsource.zip
Here's how to run all the three models
For Logistic Regression
- run
python3 Log_reg/log_regression.py
For LSTM
- Download
crawl-300d-2M.vec
andglove.840B.300d.txt
. Put them indata
folder - run
python3 LSTM/LSTM.py
For TextCNN
- Download
crawl-300d-2M.vec.zip
and extract it indata
folder - run
python3 TextCNN/textCNN.py
NOTE: If something doesn't work just clone the project directory from https://csil-git1.cs.surrey.sfu.ca/krutp/nlpclass-1197-g-lexchunkers/tree/master/project. Word embedding still would have to be downloaded separately.
NOTE: Report file project.ipynb
contains images so Images
folder needs to be downloaded from gitlab
Checking the output files: The output.zip contains all the submission predictions generated by the three models. They should be submitted to Kaggle for evaluation