LSTM training for Toxic Comment classification

Introduction

The task consists of predicting wheater the input comment is Toxic or Clean.

Dataset

The dataset comes from Kaggle and is an extension of the Civil Comments dataset and it can be download at the following link. The text of the individual comment is found in the comment_text column and each comment has also a target column that specifies the toxicity of the text. In this example, the target column is used as label:

target >= 0.5 Toxic Comment
target < 0.5 Clean Comment

train = pd.read_csv("train.csv")
train.shape

> (1804874, 45)

Convert the target column from continuous values into labels 0 or 1.

Y = [1 if x >= 0.5 else 0 for x in train["target"]]
Y = np.array(Y)

df= train[['id','comment_text']]
df_labeled = df.assign(label = Y)

df_labeled.head()

	id	comment_text	label
0	5967432	amazing this is first time in years i actually...	0
1	5869644	i know more than the generals trust me	0
2	605006	what is this world coming too how were these t...	1
3	5094159	it does not matter who wins the leadership of ...	1
4	450628	trash sits on the south bank of the willamette...	0

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

MAX_NB_WORDS = 20000000

tokenizer = Tokenizer(lower=False, filters='',oov_token="<OOV>",num_words = MAX_NB_WORDS)
tokenizer.fit_on_texts(X)

sequences = tokenizer.texts_to_sequences(X)

GloVe

Load the content of the GloVe file into a dictionary with the word as key and the Word Embedding vector as value. Using a pre-trained word embedding allows us to use less data and conseguently reduce the training time. In this case it has been used the GloVe with 840B tokens and an embedding vector dimension of 300.

glove.42B.300d.zip link

from tqdm import tqdm

embedding_vector = {}
f = open('glove.840B.300d.txt')
for line in tqdm(f):
    value = line.split(' ')
    word = value[0]
    coef = np.array(value[1:],dtype = 'float32')
    embedding_vector[word] = coef

Create the Embedding Matrix by assigning to each distinct tokens founded by the tokenizer, the corresponing embedding vector loaded from GloVe.

embedding_matrix = np.zeros((nb_words,300))

for word,i in tqdm(tokenizer.word_index.items()):
    embedding_value = embedding_vector.get(word)
    if embedding_value is not None:
        embedding_matrix[i] = embedding_value

Define the model

The first layer Then follows two Bidirectional LSTM layers with 256 units and a Dense layer with 128 hidden units. Exploiting transfer learning it is possible to reduce training time and

from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.layers import Embedding
from keras.models import Model
from keras.layers.wrappers import Bidirectional
from keras.layers.core import Dense, Activation, Dropout
from keras.optimizers import Adam

model = Sequential()
model.add(Embedding(nb_words,300,weights = [embedding_matrix],input_length=MAX_SEQUENCE_LENGTH,trainable = False))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(256)))
model.add(Dropout(0.2))
model.add(Dense(128, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 320, 300)          40373400  
_________________________________________________________________
bidirectional_3 (Bidirection (None, 320, 512)          1140736   
_________________________________________________________________
dropout_4 (Dropout)          (None, 320, 512)          0         
_________________________________________________________________
bidirectional_4 (Bidirection (None, 512)               1574912   
_________________________________________________________________
dropout_5 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               65664     
_________________________________________________________________
dropout_6 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 129       
=================================================================
Total params: 43,154,841
Trainable params: 2,781,441
Non-trainable params: 40,373,400
_________________________________________________________________

history = model.fit(X_pad, Y, validation_split=0.2, nb_epoch=5, batch_size=128)

Train on 275467 samples, validate on 68867 samples
Epoch 1/5
275467/275467 [==============================] - 2158s 8ms/step - loss: 0.3569 - accuracy: 0.8428 - val_loss: 0.2810 - val_accuracy: 0.8791
Epoch 2/5
275467/275467 [==============================] - 2149s 8ms/step - loss: 0.2700 - accuracy: 0.8853 - val_loss: 0.2603 - val_accuracy: 0.8892
Epoch 3/5
275467/275467 [==============================] - 2133s 8ms/step - loss: 0.2524 - accuracy: 0.8933 - val_loss: 0.2532 - val_accuracy: 0.8917
Epoch 4/5
275467/275467 [==============================] - 2164s 8ms/step - loss: 0.2392 - accuracy: 0.8992 - val_loss: 0.2535 - val_accuracy: 0.8928
Epoch 5/5
275467/275467 [==============================] - 2198s 8ms/step - loss: 0.2236 - accuracy: 0.9054 - val_loss: 0.2512 - val_accuracy: 0.8959

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
README.md		README.md
binary-toxic-comment-classification.ipynb		binary-toxic-comment-classification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LSTM training for Toxic Comment classification

Introduction

Dataset

GloVe

Define the model

About

Releases

Packages

Languages

grecosalvatore/lstm-glove-binary-toxic-comment-classification

Folders and files

Latest commit

History

Repository files navigation

LSTM training for Toxic Comment classification

Introduction

Dataset

GloVe

Define the model

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages