Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentiment Analysis and Customer Churn Prediction #17

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions Customer Churn Prediction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# 📊 **Customer Churn Prediction for Telecom Industry** 📱

## Project Overview 🌟

Customer churn is a critical metric in the telecom industry, as it measures the percentage of customers who discontinue their subscriptions. By identifying high-risk customers early, telecom companies can focus their retention efforts and improve overall profitability. In this project, we explore a dataset to predict customer churn and provide strategies for improving customer retention.

## 🔍 **Problem Definition**

In the competitive telecom industry, customer churn is a significant challenge. Churn occurs when customers decide to leave a service, and the goal of this project is to predict which customers are most likely to churn using machine learning techniques. By accurately identifying churn risks, companies can focus on retaining high-risk customers and enhance overall customer satisfaction.

## 🧑‍💼 **Dataset Overview**

The dataset used in this project is from Kaggle's **Telco Customer Churn** dataset, which includes customer information, service usage, and subscription status. Key columns in the dataset include:

- **Customer ID**: Unique identifier for each customer.
- **Gender**: Gender of the customer (Male/Female).
- **Age**: Age of the customer.
- **Service Type**: The type of telecom service the customer subscribes to (e.g., phone service, internet service).
- **Churn**: Target variable (1 = Churn, 0 = No Churn).

You can download the dataset from Kaggle [here](https://www.kaggle.com/datasets/blastchar/telco-customer-churn).

## 🎯 **Project Objectives**

The main objectives of this project are:

1. **Exploration & Analysis**:
- What percentage of customers churn vs. stay with the service? 📊
- Are there patterns in churn based on gender? 👨‍🦰👩‍🦱
- Are certain service types more likely to lead to churn? 📞
- Which services generate the most profit? 💸
- What features are most predictive of customer churn? 🧠

2. **Modeling & Prediction**:
- Train several machine learning models to predict customer churn 🤖
- Evaluate models using the ROC-AUC curve 📈
- Compare models like Logistic Regression, Decision Trees, Random Forest, etc.

3. **Customer Retention Strategy**:
- Suggest strategies for retaining high-risk customers 🔒

## ⚙️ **How to Run the Project**

1. Clone this repository to your local machine:

```bash
git clone https://github.com/your-repo/customer-churn-prediction.git
```

2. Install the necessary dependencies:

```bash
pip install -r requirements.txt
```

3. Place the dataset (`telco-customer-churn.csv`) in the project directory.

4. Run the Jupyter notebook or Python script to start the analysis:

```bash
python churn_prediction.py
```

## 📊 **Key Results from the Analysis**

- **Churn Rate**:
Approximately **30%** of customers in the dataset have churned, which highlights the importance of retention strategies. 🚨

- **Churn by Gender**:
Gender analysis revealed that **women** were more likely to churn compared to men. This insight can be used to target retention efforts more effectively. 💡

- **Churn by Service Type**:
Customers using **mobile data services** had the highest churn rate, indicating a potential area for service improvement. 📱

- **Model Performance**:
The models were evaluated using the **ROC-AUC curve**, which assesses the ability of the model to distinguish between churn and non-churn customers.

**Top Models** (AUC Score):

- Random Forest Classifier: **0.85** 🔥
- Logistic Regression: **0.82** 🎯
- Decision Tree Classifier: **0.80** 📉

The **Random Forest Classifier** performed the best, achieving an AUC score of **0.85**, making it the most effective model for predicting customer churn. 📈

## 📈 **Key Metrics**

- **Accuracy**: Evaluates how well the model predicted churn vs. non-churn customers.
- **ROC-AUC Score**: Measures how well the model can distinguish between churned and retained customers. The higher the AUC, the better the model’s performance.

## 🏆 **Conclusion**

By accurately predicting which customers are at risk of churning, telecom companies can take proactive steps to retain those customers and reduce churn. The **Random Forest Classifier** emerged as the top-performing model for this task, with a high AUC score of **0.85**.

## 💡 **Recommendations**

1. **Improve Customer Service**: Focus on enhancing service quality for high-risk customers to prevent churn. 📞
2. **Personalized Offers**: Provide customized offers and promotions for customers at risk of leaving. 🎁
3. **Proactive Engagement**: Survey churned customers to understand their reasons for leaving and prevent future churn. 📝

## 🚀 **Future Improvements**

- **Feature Engineering**: Adding new features such as customer satisfaction scores, social media interactions, etc., could improve model performance. ✨
- **Hyperparameter Tuning**: Fine-tuning the models could further increase prediction accuracy. 🔧
- **Model Deployment**: Deploy the final model in a real-time environment to predict churn as new data arrives. 🌍
1 change: 1 addition & 0 deletions Customer Churn Prediction/customer-churn-prediction.ipynb

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions Customer Churn Prediction/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
numpy
pandas
missingno
matplotlib
seaborn
plotly
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
72 changes: 72 additions & 0 deletions Sentiment Analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Sentiment Analysis with Machine Learning Models

This project performs sentiment analysis on a text dataset, training multiple machine learning models to classify text as negative, neutral, or positive sentiment. Using Python and popular NLP and machine learning libraries, this project involves preprocessing text data, vectorizing it, training models, and evaluating them based on accuracy, confusion matrices, and classification reports.

## Project Structure

- `data/`: Contains the dataset (e.g., `train.csv`) with text samples and sentiment labels.
- `notebooks/`: Includes the Jupyter Notebook with the code for preprocessing, training, evaluation, and visualization.
- `README.md`: This file, detailing the project setup and steps.
- `results/`: Contains model evaluation outputs, such as confusion matrices and comparison plots.

## Dataset

The dataset used in this project is a text dataset with sentiment labels. Each row in the dataset includes:
- `textID`: Unique identifier for each sample
- `text`: The text content (tweet, comment, or sentence)
- `selected_text`: A part of the text that may indicate sentiment
- `sentiment`: Target sentiment label (negative, neutral, or positive)

## Requirements

- Python 3.x
- Jupyter Notebook
- Required libraries: `nltk`, `pandas`, `numpy`, `sklearn`, `seaborn`, `matplotlib`, `wordcloud`, `textblob`

You can install the dependencies with:
```bash
pip install nltk pandas numpy scikit-learn seaborn matplotlib wordcloud textblob
```

## Project Workflow

1. **Data Loading and Preprocessing**
- Load the dataset and handle missing values.
- Tokenize and clean text data, removing stopwords and punctuations.
- Encode sentiment labels using ordinal encoding for machine learning compatibility.

2. **Text Vectorization**
- Transform the cleaned text data into numerical vectors using `TfidfVectorizer` for feature extraction.

3. **Model Training and Evaluation**
- Train several machine learning models for sentiment classification:
- Naive Bayes
- Logistic Regression
- Support Vector Machine (SVM)
- Random Forest
- Evaluate each model on the test set, calculating accuracy and generating a classification report.
- Plot confusion matrices to visualize each model's performance in predicting each sentiment category.

4. **Results and Visualization**
- Visualize and compare model performance using bar plots of accuracy scores.
- Display confusion matrices for each model to examine misclassifications.

## Running the Code

To run the code, open the Jupyter Notebook in the `notebooks/` directory and follow these steps:
1. Run each cell sequentially to load, preprocess, vectorize, and train models.
2. View evaluation metrics and model performance comparisons in the output cells.

## Results

- **Accuracy Comparison**: Displays a bar plot comparing the accuracy of each model.
![Model Accuracy Comparison](model_comparison.png)

- **Confusion Matrices**: Provides insight into model performance across each sentiment class.
![Confusion Matrix for Random Forest](ConfusionmatrixRandomforest.png)

- **Classification Reports**: Summarize precision, recall, and F1-score for each sentiment label.

## Conclusion

This project demonstrates text classification for sentiment analysis using several machine learning models. The comparison helps in understanding which models perform best on specific types of sentiment data. Future improvements could include using more advanced NLP techniques, such as word embeddings or deep learning models.
Binary file added Sentiment Analysis/model_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 8 additions & 0 deletions Sentiment Analysis/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
nltk
pandas
numpy
scikit-learn
seaborn
matplotlib
wordcloud
textblob
Loading