In this working directory we will analyze categories and their corresponding text contents, and train a model to predict categories of new articles by their content.
The data consists of about 500000 articles from webhose.io. The article distribution over the categories is not balanced. The dataset is only from 2015.
The data cleaning can be found in this notebook.
The features of the data are checked, unnecessary ones are deleted. The titles and texts are checked for errors, unnecessary contents and other stuff that does not belong into article texts.
The articles are from different website sources. These are also highly unbalanced, so they are also balanced out.
Afterwards the texts are lemmatized and numbers etc are removed. The operations can be looked up in this file.
The data is separated into train and test data and saved to csvs.
We look at the data in more detail in this notebook.
Here the different publishers are investigated and we try to balance them out, to get a more variable dataset.
The categories are classified by our model and the results are investigated for the different categories. Here we will see, that espacially world and politics are hard to separate.
We look at the most important words in every category.
Also we use word2vec from Gensim to create own word embeddings for further modelling.
We try out different models like SVMs and Logistic Regression or Ensemble methods implemented in scikit-learn in this notebook. Espacially linear SVM and Logistic Regression give quite good results, but still show an overfit, that can not be removed by regularization.
Also we try out a simple fully connected neural network, and a bidirectional LSTM in this notebook. Both models highly overfit, either because they are too complex for the problem, or the dataset still shows inconsistencies.
We also shortly tried out Fast text in this notebook. Fast text is a text categorizer from Facebook. It gives us a result of about 76 % accuracy, so it's worse then our models.
In this notebook different bags of words and word embeddings are tested, to get the best features for categorization. The best model for our problem is the TFIDF-Vectorizer from sklearn.
Our models are finalized in this notebook.
Below we see the final results for our models on the test dataset. If we take the Naive Bayes as our Base model, we see that the Fast text categorizer already improves the result by about 4%. With Logistic Regression and Linear SVC we were able to improve our Accuracy to about 82%. The best final model is the linear SVC, but you have to take into account about 900k feature words to achieve this accuracy. If we reduce the feature size to 40k, we still achieve an accuracy of almost 82% with logistic Regression, that's why we will use this cheaper method instead.
In the EDA notebook you can also see some examples of wrongly classified articles. Here it gets clear, that the biggest problem is not the model, but wronlgy labeled articles, or either articles that fit multible labels. Of course some articles are of one category, having some content of other categories in description. To distinguish these, a more complex model would be necessary, but the first problem at hand to solve would be to introduce better labels or allow more than one per category.
Finally, we tried to visualize how words can be respresented in different categories in the text. Below is a small simulation showing word dependencies. If you want to know more about it, checkout here.