Toxic Comment Classification Challenge

Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

It is necessary for platforms to use some kind of moderation before any comment is posted on the internet. Since the use of social media is on the rise everywhere, an effective approach to moderate the comments would be using Machine Learning to identify the toxicity of the comments entered.

With this thought in mind, I decided to use this dataset.

Preprocessing

Since this is a Natural Language Processing task, a considerable amount of data preprocessing is required.

Remove numeric and empty texts
Cleaning unnecessary text
Tokenizing
Lemmatization
Sequence Creation

Pretrained Word Vectors - GloVe

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. - (Stanford NLP)

I used GloVe's pre-trained word vectors.

Model

NLP tasks require contextual understanding, thus making the use of RNNs/GRUs/LSTMs natural.

Also, recent developments in Deep Learning have shown that the use of 1-d CNNs helps the feature understanding of the Neural Network.

I used a combination of GRU and CNN for this task. Here's the model architecture -

Although the architecture might seem simple, it is extremely complex and huge to train.

Model Performance

The model performed pretty well on the training data with an accuracy of 98%. However, due to the regularization introduced, the validation accuracy is also 98%.

Thus, the model is not overfitting to the training data.

Since the training of this model is a huge task, I was not able to get the graphs of the accuracy and loss. However, will train the model again and upload the graphs.

The model has achieved an accuracy of 98.2% on the test data - which puts this model in the top 20 scores of Kaggle.

Model Deployment

Only training a model is not enough, it has to be deployed so that people can use it.

I've used Flask to build a small UI (work in progress) to deploy this model.

The model is containerized with Docker and deployed on Microsoft Azure.

The API has been designed to accept GET and POST requests so that it is easily accessible from everywhere.

The inference time is less than 20ms making this model extremely quick to determine the factors of toxicity in the text.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
models		models
src		src
templates		templates
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
app-test.py		app-test.py
app.py		app.py
model.png		model.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toxic Comment Classification Challenge

Preprocessing

Pretrained Word Vectors - GloVe

Model

Model Performance

The model has achieved an accuracy of 98.2% on the test data - which puts this model in the top 20 scores of Kaggle.

Model Deployment

Snapshots

Positive Speech

Negative Speech

About

Releases

Packages

Languages

License

adhishthite/toxic-comments

Folders and files

Latest commit

History

Repository files navigation

Toxic Comment Classification Challenge

Preprocessing

Pretrained Word Vectors - GloVe

Model

Model Performance

The model has achieved an accuracy of 98.2% on the test data - which puts this model in the top 20 scores of Kaggle.

Model Deployment

Snapshots

Positive Speech

Negative Speech

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages