This project focuses on customer risk prediction in a banking environment using machine learning models. The primary challenge was dealing with imbalanced data where the minority class represents customers with a higher risk of default. Our goal was to achieve high recall, specifically targeting 80%, while maintaining acceptable levels of precision and accuracy.
- Goal: Achieve 80% recall on high-risk customers.
- Dataset: Imbalanced banking dataset.
- Models Used: CART, Random Forest (RF), Gradient Boosting Machine (GBM), LightGBM, and BalancedRandomClassifier.
- Evaluation Metrics: Accuracy, Precision, Recall, F1-Score.
-
Data Preprocessing:
- Handled missing data.
- Performed feature scaling and encoding for categorical variables.
- Addressed the class imbalance using specialized techniques.
-
Imbalanced Data Handling:
- Implemented RandomUnderSampler to reduce majority class size.
- Used TomekLinks to remove overlapping data points and further refine the dataset.
-
Modeling:
- Tried several machine learning models: CART, RF, GBM, and LightGBM.
- The most effective model for handling imbalance was BalancedRandomClassifier.
-
Hyperparameter Optimization:
- Applied hyperparameter tuning to the BalancedRandomClassifier using grid search to optimize performance.
-
Model Evaluation:
- The BalancedRandomClassifier provided the best results.
The dataset used in this project can be downloaded from Kaggle:
Credit Card Approval Prediction Dataset
- Customer Demographics: Age, gender, occupation, etc.
- Financial Indicators: Credit history, balance, transaction patterns.
- Target Variable: Customer risk level.
For privacy reasons, the dataset is not included in this repository.
- Best Model: BalancedRandomClassifier
- Final Performance Metrics:
- Accuracy: 74%
- Precision: 72%
- Recall: 80%
- F1-Score: 76%
Install the necessary dependencies by running:
pip install -r requirements.txt
- Clone the repository:
git clone https://github.com/aysecnkci/banking-risk-analysis-imbalanced-data.git
- Run the Jupyter notebook to preprocess the data, train the model, and evaluate it:
jupyter notebook risk_analysis_banking_imbalanced.ipynb
├── README.md
├── requirements.txt
├── notebooks/
│ └── risk_analysis_banking_imbalanced.ipynb
- Experiment with deep learning models to improve recall.
- Further tune hyperparameters to explore better performance.
This project is licensed under the MIT License - see the LICENSE file for details.
Thanks to the contributors and the machine learning community for resources and support.