This project is focused on predicting customer churn using a Random Forest classifier. The dataset used in this project is a typical customer churn dataset, with features related to customer demographics, account information, and service usage.
Customer churn prediction helps businesses to identify customers who are likely to leave the service. By predicting churn, companies can take proactive measures to retain customers and improve customer satisfaction.
This project includes:
- Data loading and cleaning
- Feature engineering
- Data preparation
- Model training using Random Forest with hyperparameter tuning
- Model evaluation
The dataset used in this project should be in CSV format and include the following columns:
customerID
: Unique identifier for each customergender
,SeniorCitizen
,Partner
,Dependents
,tenure
,PhoneService
,MultipleLines
,InternetService
,OnlineSecurity
,OnlineBackup
,DeviceProtection
,TechSupport
,StreamingTV
,StreamingMovies
,Contract
,PaperlessBilling
,PaymentMethod
,MonthlyCharges
,TotalCharges
,Churn
The target variable is Churn
, which indicates whether the customer has churned (Yes) or not (No).
-
Clone the repository:
git clone https://github.com/gabrieltonyy/customer-churn-prediction.git cd customer-churn-prediction
-
Install the required packages:
pip install -r requirements.txt
- Ensure your data file is named
Customer-Churn.csv
and is placed in the root directory of the project. - Run the script:
python churn.py
The model is trained using a Random Forest classifier. The script performs the following steps:
- Load and Clean Data: Loads the CSV file, handles missing values and duplicates.
- Feature Engineering: Currently no additional features are created.
- Prepare Data: Encodes categorical features, scales numerical features, and splits the data into training and testing sets.
- Train Model: Uses SMOTE to handle class imbalance and GridSearchCV for hyperparameter tuning.
- Evaluate Model: Evaluates the model using accuracy, classification report, confusion matrix, and ROC AUC score.
The best hyperparameters and evaluation metrics are logged during the process.
The script uses Python's built-in logging
module to log information at various stages of the pipeline. Logs include:
- Data loading status
- Data cleaning steps
- Feature engineering steps
- Data preparation steps
- Model training status and best parameters
- Model evaluation metrics