Text Classification with TF-IDF, Word Embedding and Naive Bayes

This project introduces one of techniques of natural language processing: text classification. We shall explore step by step how to create new features through data analysis and do data cleaning. We also explain the conception of Term Frequency - Inverse Document Frequency and Word Embedding and how to implement them through the real data.
The article will show you how to apply Naives Bayes on text data after the tokenization and vectorization. We will use one of Selection Best Feature techniques to chose only features that contribute to the performance of the prediction. After that, we will analyse and compare the evaluation metrics including accuracy , confusion matrix and classification report between baseline accuracy, TF-IDF + Bayes and Word Embedding + Bayes.
Check out for more detail: Text Classification with TF-IDF, Word Embedding and Naive Bayes

The project covers:

`spam_EDA.ipynb`: Data Exploratory Analysis:

The work will anwser the quesions:

What percentage of the documents are spam?
What is the longest message ?
What is the average length of documents (number of characters) for not spam and spam documents?
What is the average number of digits per document for not spam and spam documents?
What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?

As the result, 3 new features are created: len_text', 'digits', 'non_alpha_char'

`NativeBayes_Tf-Idf.ipynb`: Transform text data into Term Frequency - Inverse Document Frequency, select the best features with f_classif and fit the transformed data to Bayesian algorithm:

The accurary score is 91%. It is 5% better than the baseline.
From Classification Report:
- The F1-score (spam class) is 74% and F1-score (Not Spam class) is 95%
From Confusion Matrix:
- The correction of prediction = True Positive (TP) + True Negative (TN) = 878 + 140 = 918
- The misclassification = False Negative (FN) + False Positive (FP) = 84 + 13 = 97

`NativeBayes_word2vec.ipynb`Transform text data into Word Embedding, select the best features with f_classif and fit the transformed data to Bayesian algorithm:

The accurary score is 94%. It is 8% better than the baseline (86%).
From Classification Report:
- The F1-score (spam class) is 80% and F1-score (Not Spam class) is 96%
From Confusion Matrix:
- The correction of prediction = TP + TN = 896 + 147 = 1043
- The mis-classification = FN + FP = 84 + 13 = 72

The outcome:

TF-IDF + Naives Bayes: improve 5% of accuracy of classification from 86% to 91%
Word2Vec + Naive Bayes: improve 8% of accuracy of classification from 86% to 94%

Requirements

Python >= 3.7 Jupyter Notebook

Dependencies

requirement.txt

Run the Notebook on the local:

Checkout the project : git clone https://github.com/diem-ai/text-classification.git
Install the latest version of libraries in requirements and dependencies
Run the following commands in order:

jupyter notebook spam_EDA.ipynb
jupyter notebook NativeBayes_Tf-Idf.ipynb
jupyter notebook NativeBayes_word2vec.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
Import_Library.ipynb		Import_Library.ipynb
LICENSE		LICENSE
NativeBayes_Tf-Idf.ipynb		NativeBayes_Tf-Idf.ipynb
NativeBayes_word2vec.ipynb		NativeBayes_word2vec.ipynb
README.md		README.md
blogger.png		blogger.png
evaluation_metrics.ipynb		evaluation_metrics.ipynb
spam_EDA.ipynb		spam_EDA.ipynb
text_functions.ipynb		text_functions.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Classification with TF-IDF, Word Embedding and Naive Bayes

The project covers:

`spam_EDA.ipynb`: Data Exploratory Analysis:

`NativeBayes_Tf-Idf.ipynb`: Transform text data into Term Frequency - Inverse Document Frequency, select the best features with f_classif and fit the transformed data to Bayesian algorithm:

`NativeBayes_word2vec.ipynb`Transform text data into Word Embedding, select the best features with f_classif and fit the transformed data to Bayesian algorithm:

The outcome:

Requirements

Dependencies

Run the Notebook on the local:

About

Releases

Packages

Languages

License

diem-ai/text-classification

Folders and files

Latest commit

History

Repository files navigation

Text Classification with TF-IDF, Word Embedding and Naive Bayes

The project covers:

spam_EDA.ipynb: Data Exploratory Analysis:

NativeBayes_Tf-Idf.ipynb: Transform text data into Term Frequency - Inverse Document Frequency, select the best features with f_classif and fit the transformed data to Bayesian algorithm:

NativeBayes_word2vec.ipynbTransform text data into Word Embedding, select the best features with f_classif and fit the transformed data to Bayesian algorithm:

The outcome:

Requirements

Dependencies

Run the Notebook on the local:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`spam_EDA.ipynb`: Data Exploratory Analysis:

`NativeBayes_Tf-Idf.ipynb`: Transform text data into Term Frequency - Inverse Document Frequency, select the best features with f_classif and fit the transformed data to Bayesian algorithm:

`NativeBayes_word2vec.ipynb`Transform text data into Word Embedding, select the best features with f_classif and fit the transformed data to Bayesian algorithm:

Packages