- Introduction
- Dataset
- Data Cleaning
- Exploratory Data Analysis (EDA)
- Data Preprocessing
- Modeling and Results
- Web App screenshots
The Clothing Review Sentiment Analysis project is aimed at predicting the sentiment of customer reviews for various clothing items. The goal is to classify reviews as positive or negative based on the content of the review text.
The project leverages Natural Language Processing (NLP) techniques and machine learning models to achieve high accuracy in sentiment classification. This project is valuable for retailers who wish to gain insights into customer satisfaction and improve their products based on feedback.
Clothing Reviews-Kaggle It includes the following columns:
- Clothing ID: Unique identifier for each clothing item.
- Age: Age of the reviewer.
- Title: Title of the review.
- Review Text: The actual text of the review.
- Rating: Rating given by the reviewer (1-5 scale).
- Recommended IND: Indicator of whether the reviewer recommends the item.
- Positive Feedback Count: Number of positive feedbacks received by the review.
- Division Name, Department Name, Class Name: Metadata about the clothing item.
Prior to conducting exploratory data analysis (EDA), a comprehensive data cleaning process was undertaken to ensure the dataset's quality and integrity. The following steps were applied:
-
Handling Missing Values: All instances of missing data were systematically addressed. Missing entries in critical columns were either imputed with suitable values or removed, depending on the context and impact on downstream analysis.
-
Class Imbalance Management: The dataset exhibited a significant imbalance between the two sentiment classes. To mitigate this,
undersampling was employed on the majority class
, bringing the dataset to a more balanced state. This step was crucial in ensuring that the models trained on the data were not biased towards the more prevalent class, thereby improving the robustness of the sentiment classification. -
Removal of Anomalous Entries: Certain entries were identified as biased or inconsistent, such as
instances where a rating of 5 was given, yet the recommendation indicator was 0
. These entries were removed to prevent any distortions in the model's learning process, ensuring that the training data accurately reflected the true sentiment of the reviewers.
Before diving into model building, an extensive exploratory data analysis was conducted utilizing plotly, seaborn and matplotlib
. This included:
-
Distribution of Ratings : Visualization of how ratings and recommended class are distributed across the dataset. Here are some of the visualizations:
-
Word Clouds: Created word clouds to visualize the most frequent words in positive and negative reviews.
-
Correlation Analysis: Checked for correlations between different features and review sentiments.
To prepare the data for modeling, the following preprocessing steps were undertaken:
- Text Cleaning: Removed HTML tags, special characters, and numbers from the review text.
- Stopword Removal: Common stopwords were removed to reduce noise in the data.
- Lemmatization: Converted words to their base form using lemmatization to standardize the text.
- TF-IDF Vectorization: Transformed the text data into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency).
A variety of machine learning models were tested to classify the reviews, with Logistic regression, SVM and decision tree standing out across different metrics: