Sentiment Classification

Performance Analysis of Probabilistic and Machine Learning Models.

This small project was part of the Information Theory and Probabilistic Programming class for Fall 2021 at the University of Oklahoma.

Algorithms/Models:

Probabilistic & traditional ML algorithms team:
- XGBoost.
- NGBoost.
- Naïve Bayes.
- Logistic Regression.
- Decision Tree.
- Random Forest.
- Support Vector Machines (SVMs)
Deep learning team:
- Bidirectional LSTM + Convolutional Neural Network (CNN).
- Bidirectional GRU + Convolutional Neural Network (CNN).

Dataset:

Stanford Large Movie Review Dataset
Published by Stanford AI Lab.
Collected from movie reviews on IMBD.
Contains around 50K movie reviews, split into 25K for training and 25K for testing purposes. Though the dataset size is a bit small, it is a common dataset size for the sentiment classification task because it is challenging to find a larger labeled dataset.
Citation (credit): Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher, Learning Word Vectors for Sentiment Analysis.

Data Preprocessing/Wrangling:

The next two figures show the most 200 common words in corpus (all 50K reviews).

Figure 1 Most 200 used words in dataset

Figure 2 Most 200 used words in dataset (Ignoring stopwords)

Vocabulary size:

Experimented with a few values (e.g., 10K, 15K, 20K, 50K) and decided to use 30,000 words as the vocabulary size. Larger vocabulary values were mostly words that were mentioned only once in the whole corpus, so to avoid overfitting the corpus data and generalize well, I found 30K to be the best vocabulary size for my methods.

Text representation:

Word Embeddings: GloVe: Global Vectors for Word Representation, used the 300-dimensional vector representation.
Term Frequency Inverse Document Frequency (TF-IDF): used an implementation which adds smoothing* when computing IDF.

*Smoothing is just adding 1 to the denominator term when computing the IDF score for a word, this way we can avoid dividing by zero.

Data preprocessing steps:

Unpack abbreviations: English abbreviations like I’ve were replaced with the unpacked version I have, this step was initially performed when representing text using word embeddings because GloVe Word Embeddings do not contain vector representation for these abbreviations, so I had to unpack them to capture their meaning. However, for consistency purposes, this step was applied to the data which used TF-IDF representation, I tried to make the setup almost the same to really compare the performance of traditional and probabilistic machine learning algorithms with neural networks.
Remove punctuation.
Remove non-English characters.
Remove extra whitespaces.
For word embeddings:
- Tokenize reviews.
- Create an embedding matrix.
For TF-IDF:
- The corpus was represented in a TF-IDF representation with a 30,000 vocab size.

Results:

Note: Regarding the accuracy of the models, after looking up online implementations using this dataset, it seems that the range of accuracy is 80% ~ 85%. Therefore, this was considered as a standard to compare to. It’s important to note that the main goal is not to achieve the highest possible accuracy in this application, the goal is to compare accuracies of probabilistic and deep learning models and see if the probabilistic solutions, which requires much less computation power than neural networks (most of the time), will achieve better results or not.

Method	Validation	Test
Deep learning team
Bi-LSTM + Conv	82.84%	82.04%
Bi-GRU + Conv	83.5%	83.76%
Probabilistic & traditional team
XGBoost	99.3%	89.02%
NGBoost	81.12%	81.42%
Naïve Bayes	90.76%	86.78%
Logistic Regression	95.05%	90.12%
Decision Trees	100%	71.4%
Random Forests	100%	84.8%
Support Vector Machine (SVM)	98.68%	*90.46%*

Deep learning team winner:
Bi-GRU + Conv with a test accuracy of 83.67%

Probabilistic team and overall winner:
SVM with a test accuracy of 90.46%

Thank you.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
figures		figures
LICENSE		LICENSE
README.md		README.md
github_it_fall21_sentiment_analysis.ipynb		github_it_fall21_sentiment_analysis.ipynb
github_it_fall21_sentiment_analysis_probabilistic_methods.ipynb		github_it_fall21_sentiment_analysis_probabilistic_methods.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Classification

Results:

About

Releases

Packages

Languages

License

MohamedAliHabib/ou-it-fall21-sentiment-classification

Folders and files

Latest commit

History

Repository files navigation

Sentiment Classification

Results:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages