Performance Analysis of Probabilistic and Machine Learning Models.
This small project was part of the Information Theory and Probabilistic Programming
class for Fall 2021 at the University of Oklahoma.
Algorithms/Models:
- Probabilistic & traditional ML algorithms team:
- XGBoost.
- NGBoost.
- Naïve Bayes.
- Logistic Regression.
- Decision Tree.
- Random Forest.
- Support Vector Machines (SVMs)
- Deep learning team:
- Bidirectional LSTM + Convolutional Neural Network (CNN).
- Bidirectional GRU + Convolutional Neural Network (CNN).
Dataset:
- Stanford Large Movie Review Dataset
- Published by Stanford AI Lab.
- Collected from movie reviews on IMBD.
- Contains around 50K movie reviews, split into 25K for training and 25K for testing purposes. Though the dataset size is a bit small, it is a common dataset size for the sentiment classification task because it is challenging to find a larger labeled dataset.
- Citation (credit): Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher, Learning Word Vectors for Sentiment Analysis.
Data Preprocessing/Wrangling:
The next two figures show the most 200 common words in corpus (all 50K reviews).
Figure 1 Most 200 used words in dataset Figure 2 Most 200 used words in dataset (Ignoring stopwords)Vocabulary size:
Experimented with a few values (e.g., 10K, 15K, 20K, 50K) and decided to use 30,000 words as the vocabulary size. Larger vocabulary values were mostly words that were mentioned only once in the whole corpus, so to avoid overfitting the corpus data and generalize well, I found 30K to be the best vocabulary size for my methods.
Text representation:
- Word Embeddings: GloVe: Global Vectors for Word Representation, used the 300-dimensional vector representation.
- Term Frequency Inverse Document Frequency (TF-IDF): used an implementation which adds smoothing* when computing IDF.
*Smoothing is just adding 1 to the denominator term when computing the IDF score for a word, this way we can avoid dividing by zero.
Data preprocessing steps:
- Unpack abbreviations: English abbreviations like
I’ve
were replaced with the unpacked versionI have
, this step was initially performed when representing text using word embeddings because GloVe Word Embeddings do not contain vector representation for these abbreviations, so I had to unpack them to capture their meaning. However, for consistency purposes, this step was applied to the data which used TF-IDF representation, I tried to make the setup almost the same to really compare the performance of traditional and probabilistic machine learning algorithms with neural networks. - Remove punctuation.
- Remove non-English characters.
- Remove extra whitespaces.
- For word embeddings:
- Tokenize reviews.
- Create an embedding matrix.
- For TF-IDF:
- The corpus was represented in a TF-IDF representation with a
30,000
vocab size.
- The corpus was represented in a TF-IDF representation with a
Note: Regarding the accuracy of the models, after looking up online implementations using this dataset, it seems that the range of accuracy is 80% ~ 85%. Therefore, this was considered as a standard to compare to. It’s important to note that the main goal is not to achieve the highest possible accuracy in this application, the goal is to compare accuracies of probabilistic and deep learning models and see if the probabilistic solutions, which requires much less computation power than neural networks (most of the time), will achieve better results or not.
Method | Validation | Test |
---|---|---|
Deep learning team | ||
Bi-LSTM + Conv | 82.84% | 82.04% |
Bi-GRU + Conv | 83.5% | 83.76% |
Probabilistic & traditional team | ||
XGBoost | 99.3% | 89.02% |
NGBoost | 81.12% | 81.42% |
Naïve Bayes | 90.76% | 86.78% |
Logistic Regression | 95.05% | 90.12% |
Decision Trees | 100% | 71.4% |
Random Forests | 100% | 84.8% |
Support Vector Machine (SVM) | 98.68% | 90.46% |
Deep learning team winner:
Bi-GRU + Conv with a test accuracy of 83.67%
Probabilistic team and overall winner:
SVM with a test accuracy of 90.46%
Thank you.