Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. Improvements to the current model will hopefully help online discussion become more productive and respectful.
This dataset contains a large number of Wikipedia comments which have been labeled by human rates for toxic behavior.
In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of of toxicity such as:
- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate
Disclaimer: the dataset for this competition contains text that may be considered profane, vulgar, or offensive.
This is not an exhaustive list of tasks, the points are provided in order to guide you:
Try to various methods to preprocess the comments into tokens.
Test the performance of different model architectures. Tune your model to improve its performance.
Report your results using appropriate metrics. See if your model performs equally among classes. Suggest possible imporvements.
Toxic Comment Classification Challenge by Jigsaw/Conversation AI https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge