Credit cards are widely used for online purchases and payments, making them a convenient tool for managing personal finances. However, this convenience comes with a risk: credit card fraud. Fraudulent activities can cause significant financial loss to both customers and financial institutions. This project, titled FindDefault, focuses on predicting fraudulent credit card transactions. The primary goal is to build a robust classification model that accurately distinguishes between legitimate and fraudulent transactions, thus helping credit card companies minimize losses and protect their customers.
The dataset used in this project contains credit card transactions made by European cardholders in September 2013. The dataset includes transactions from a two-day period, totaling 284,807 transactions, of which 492 are fraudulent. This results in a highly imbalanced dataset, with fraudulent transactions making up only 0.172% of all transactions.
Key Dataset Characteristics:
- Total Transactions: 284,807
- Fraudulent Transactions: 492
- Class Imbalance: Fraudulent transactions account for 0.172% of the data.
The dataset contains various features, such as transaction amount, timestamp, and anonymized numerical features, which are derived from the original data to protect sensitive information.
The project is structured into the following key phases:
- Objective: Gain insights into the data, identify patterns, relationships, and trends.
- Steps:
- Load the dataset and display the first few rows to understand its structure.
- Generate summary statistics to observe central tendencies, variability, and distribution of data.
- Visualize data distributions, correlations, and relationships using histograms, box plots, scatter plots, and heatmaps.
- Identify any significant patterns or anomalies, such as correlations between features and the target variable (fraud).
- Objective: Prepare the dataset for modeling by addressing issues such as missing values, outliers, and data inconsistencies.
- Steps:
- Check for missing values and handle them if necessary (e.g., imputation or removal).
- Identify and address outliers that could distort model performance.
- Standardize or normalize numerical features to ensure all features contribute equally to the model.
- Encode categorical variables, if any, using techniques like one-hot encoding or label encoding.
- Objective: Address the class imbalance to prevent the model from being biased towards the majority class.
- Approaches:
- Undersampling: Randomly reduce the number of non-fraudulent transactions to match the count of fraudulent transactions.
- Oversampling: Increase the number of fraudulent transactions to match the count of non-fraudulent transactions using techniques like Random Oversampling.
- SMOTE (Synthetic Minority Over-sampling Technique): Generate synthetic samples of the minority class to balance the dataset.
- ADASYN (Adaptive Synthetic Sampling): Similar to SMOTE but focuses on creating synthetic samples in regions with low density of minority class samples.
- Objective: Enhance model performance by creating new features or transforming existing ones.
- Steps:
- Create new features that capture important patterns or relationships not directly available in the original dataset.
- Apply dimensionality reduction techniques like PCA if necessary to reduce feature space and improve model efficiency.
- Transform features to reduce skewness, normalize distributions, or handle categorical data.
- Objective: Select appropriate machine learning algorithms for the classification task.
- Considered Models:
- 1. Logistic Regression Explanation: Logistic Regression is a statistical model used for binary classification tasks. It predicts the probability that a given input belongs to a specific class (fraudulent or legitimate) by applying a logistic function to a linear combination of input features. The output is a probability score between 0 and 1, which is then thresholded to make a binary decision.
Usage: In FindDefault project, Logistic Regression was selected due to its simplicity and interpretability. It provided clear insights into how features influence the prediction of fraudulent transactions. The model was trained using SMOTE to handle class imbalance, resulting in high performance with an ROC-AUC score of 0.99 on the training set and 0.97 on the test set. Its lower computational requirements made it suitable for real-time deployment.
Benefits: Interpretability: The model is highly interpretable, providing insights into the impact of each feature on the likelihood of fraud. Efficiency: Logistic Regression is computationally efficient and works well with large datasets, making it a good starting point for baseline comparisons.
Probability Estimates: It outputs probabilities, enabling threshold adjustments for different risk levels.
2. XGBoost Explanation: XGBoost (Extreme Gradient Boosting) is an ensemble learning method that builds multiple decision trees sequentially. Each tree tries to correct the errors of the previous ones, resulting in a strong predictive model. XGBoost is known for its speed, accuracy, and ability to handle large datasets and high-dimensional data.
Usage: XGBoost was considered in the FindDefault project as one of the potential models due to its ability to capture complex relationships in the data. It performed well during evaluation but was not chosen as the final model due to higher computational costs and the added complexity in interpretation and deployment. However, it remains a powerful option for scenarios where slight improvements in prediction accuracy are crucial, and resources are available to support its deployment.
Benefits: Intuitive Visualization: The tree structure is easy to visualize and explain to non-technical stakeholders. Feature Importance: It provides a clear ranking of feature importance, helping in feature selection and understanding which factors contribute most to fraud detection. Flexibility: The model can handle both numerical and categorical data, making it versatile.
3. Decision Tree Explanation: A Decision Tree is a non-parametric model that splits the dataset into subsets based on feature values, forming a tree-like structure. Each node represents a feature, and each branch represents a decision rule, ultimately leading to a classification at the leaf nodes. Decision Trees are easy to understand and interpret but may suffer from overfitting if not properly pruned.
Usage: In the FindDefault project, Decision Trees were evaluated as a model choice due to their ability to model complex patterns in the data. They provided valuable insights during the exploratory phase but were ultimately not selected as the final model. While Decision Trees offer interpretability, they can be prone to overfitting, especially in high-dimensional datasets. Nevertheless, they served as a useful comparison against more complex models like XGBoost and simpler ones like Logistic Regression.
Benefits: • Performance: XGBoost is known for its high predictive performance and is often the top choice in machine learning competitions. • Handling Missing Data: It automatically handles missing values, simplifying the preprocessing stage. • Customizability: The model provides extensive hyperparameter tuning options, allowing fine-grained control over its behavior.
- Objective: Train the model on the training dataset and validate its performance on the validation set.
- Steps:
- Split the dataset into training and testing sets to evaluate the model’s performance on unseen data.
- Train the models using the training set and perform cross-validation to assess the robustness of the model.
- Tune hyperparameters using techniques like GridSearchCV to find the optimal model parameters.
Hyperparameter Tuning To enhance model performance, hyperparameters were meticulously tuned using GridSearchCV. This technique involves systematically exploring a specified range of hyperparameters for each model to identify the optimal combination that maximizes performance metrics. GridSearchCV evaluates multiple parameter configurations by performing cross-validation, ensuring the selected hyperparameters provide the best trade-off between model accuracy and generalization.
Choosing the Best Model After tuning hyperparameters, the performance of various models was assessed, and the Logistic Regression model emerged as the top performer. This model, trained on a dataset balanced using SMOTE, demonstrated outstanding performance with an ROC-AUC score of 0.99 on the training set and 0.97 on the test set. These high scores indicate the model’s excellent ability to differentiate between fraudulent and legitimate transactions.
The decision to select Logistic Regression was based on several factors:
Simplicity: The model’s straightforward nature makes it easy to interpret and understand. Ease of Interpretation: Logistic Regression provides clear insights into the influence of each feature on the prediction, aiding in interpretability. Lower Computational Resource Requirements: Compared to more complex models like XGBoost, Logistic Regression is computationally efficient, making it suitable for deployment in real-time systems. By combining hyperparameter tuning with the selected optimal model, the project ensures that the final Logistic Regression model is both highly accurate and practical for real-world application.
- Objective: Evaluate the model's performance using various metrics and choose the best-performing model.
- Performance Metrics:
- Accuracy: The proportion of correctly classified transactions.
- Precision: The proportion of true positive predictions among all positive predictions.
- Recall: The proportion of actual fraudulent transactions that were correctly identified.
- F1-Score: The harmonic mean of precision and recall, providing a balance between the two.
- ROC-AUC: The area under the receiver operating characteristic curve, measuring the model’s ability to distinguish between classes.
- Visualization: Use a confusion matrix to visualize the model's performance and understand the types of errors it makes.
- Objective: Deploy the best-performing model in a production environment for real-time fraud detection.
- Steps:
- Serialize the model using the pickle module to save it for future use.
- Example code snippet to save the model:
with open('load_best_model.pkl', 'wb') as file: pickle.dump(load_best_model, file)
- Deploy the model to a production environment where it can be used to make real-time predictions on new transactions.
- Summary of Findings: Present key insights from the EDA phase, such as patterns or correlations that significantly impact the prediction of fraud.
- Feature Importance: Identify and discuss the most important features that influence the model’s decisions.
- Model Performance: Report the final model’s performance on the test set, including comparisons with baseline models. Highlight the model’s strengths and any potential limitations.
- Final Model: The Logistic Regression model with SMOTE balancing was selected for its excellent performance, achieving an ROC-AUC score of 0.99 on the train set and 0.97 on the test set.
Streamlit Integration Streamlit is used to create an interactive web application for real-time fraud detection. The Streamlit app provides a user-friendly interface to input transaction data and receive predictions.
Usage of Streamlit Setup Streamlit App
Create a streamlit_app.py file for the application interface. Implement functionality to input transaction data and display predictions.
Streamlit Application Code import streamlit as st import pickle import pandas as pd from sklearn.preprocessing import StandardScaler
with open('load_best_model.pkl', 'rb') as file: model = pickle.load(file)
st.title('Credit Card Fraud Detection') st.write('Enter the transaction details below:')
transaction_amount = st.number_input('Transaction Amount', min_value=0.0, step=0.01)
data = { 'amount': transaction_amount, # Add other features } input_df = pd.DataFrame([data])
if st.button('Predict'): # Standardize or preprocess the input data if needed prediction = model.predict(input_df) st.write('Prediction:', 'Fraudulent' if prediction[0] == 1 else 'Legitimate')
Running the Streamlit App
Install Streamlit if not already installed:
pip install streamlit
Run the Streamlit app:
streamlit run streamlit(credit_card)(1)(1).py
Evaluating and Selecting the Best Model for Balanced Data
We balanced the data using various techniques, including Undersampling, Oversampling, SMOTE, and ADASYN, and built several models such as Logistic Regression, XGBoost, Decision Tree.
All models showed good performance to varying extents. However, since Undersampling resulted in some loss of information, it is preferable to exclude these models from consideration.
The models trained with SMOTE and ADASYN performed well, but among them, the Logistic Regression model stood out. It achieved an impressive ROC score of 0.99 on the training set and 0.97 on the test set. This high ROC score indicates that Logistic Regression, enhanced with SMOTE, effectively distinguishes between classes.
Given its simplicity, ease of interpretation, and lower resource requirements compared to more complex models like Random Forest and XGBoost, the Logistic Regression model with SMOTE is deemed the best choice. Its efficiency and performance make it suitable for practical applications where both accuracy and resource constraints are critical.
Comprehensive Cost-Benefit Analysis
While most models performed well in terms of ROC score, Precision, and Recall, selecting the best model requires careful consideration of several factors, including infrastructure, resources, and computational power. Complex models like Random Forest, SVM, and XGBoost demand significant computational resources, leading to increased deployment costs. These models also come with disadvantages such as difficulty in interpretation and tuning, which adds to the complexity of model management and maintenance.
In contrast, simpler models like Logistic Regression are more cost-effective to build and deploy due to their lower computational requirements. Logistic Regression offers easier interpretation and implementation, which is crucial for understanding model decisions and ensuring regulatory compliance.
When evaluating the trade-offs between model complexity and performance, the financial implications of minor changes in the ROC score must be considered. If the monetary impact of a slight improvement in the ROC score is substantial, investing in a complex model might be justified despite higher costs. However, if the gains are marginal, a simpler, more cost-effective model like Logistic Regression is preferable due to its lower resource requirements and ease of use.
Business Summary For banks with smaller average transaction values, high precision is essential because we want to label only relevant transactions as fraudulent. Each flagged transaction can be verified by calling the customer, adding a human element to the verification process. However, when precision is low, this task becomes burdensome due to the increased need for human intervention.
Conversely, for banks handling larger transaction values, low recall is a significant concern as it means the model fails to detect some fraudulent transactions. Missing high-value fraudulent transactions can lead to substantial financial losses.
To safeguard against high-value fraudulent transactions, focusing on high recall is crucial for detecting actual fraudulent activities.
After evaluating several models, we observed that with a balanced dataset using the SMOTE technique, the Logistic Regression model performed exceptionally well, achieving a good ROC score and high recall. This model is not only effective but also easy to interpret and explain to business stakeholders.
Therefore, we recommend the Logistic Regression model with SMOTE for its excellent balance of performance and simplicity, ensuring accurate detection of high-value fraudulent transactions while keeping operational costs low.
Project Summary: The FindDefault project aimed to develop a reliable credit card fraud detection model, addressing the critical challenge of identifying fraudulent transactions among a vast number of legitimate ones. The project followed a systematic approach, ensuring that each phase contributed to building a robust and accurate predictive model.
- Exploratory Data Analysis (EDA): o The project began with a thorough exploration of the dataset, where key insights into the distribution and characteristics of the data were uncovered. This step was crucial for identifying potential challenges, such as the significant class imbalance, and understanding the relationships between various features.
- Data Cleaning and Preprocessing: o Data quality was ensured by addressing any missing values, outliers, and inconsistencies in the dataset. Numerical features were standardized, and necessary transformations were applied to ensure the data was in an optimal state for modeling.
- Handling Imbalanced Data: o Given the substantial imbalance between legitimate and fraudulent transactions, several techniques, including SMOTE and ADASYN, were employed to balance the dataset. This step was critical in preventing the model from becoming biased towards the majority class.
- Feature Engineering: o New features were engineered to capture additional information that was not directly available in the original dataset. Dimensionality reduction techniques, such as PCA, were applied where necessary to streamline the feature space and enhance model performance.
- Model Selection and Training: o Multiple machine learning models, including Logistic Regression, Decision Trees, and XGBoost, were considered and trained. Hyperparameter tuning was performed using GridSearchCV to identify the best model configurations, ensuring optimal performance.
- Model Evaluation: o The models were rigorously evaluated using various metrics, including Accuracy, Precision, Recall, F1-Score, and ROC-AUC. The Logistic Regression model, balanced with SMOTE, was selected as the final model due to its superior performance, achieving a high ROC-AUC score on both the training and test datasets.
- Model Deployment: o The final model was serialized using the pickle module, making it ready for deployment in a production environment. This deployment allows the model to make real-time predictions on new transactions, providing a valuable tool for detecting credit card fraud.
Documentation Benefits to the Company
The detailed documentation provided for the FindDefault project exemplifies a commitment to thoroughness and clarity, which directly benefits the company in several ways:
- Improved Collaboration: Comprehensive documentation ensures that all team members, regardless of their role, have a clear understanding of the project’s objectives, methodologies, and outcomes. This facilitates better collaboration and more efficient problem-solving.
- Streamlined Onboarding: New team members can quickly get up to speed by reviewing the documentation, reducing the time needed for onboarding and allowing them to contribute effectively from the outset.
- Enhanced Decision-Making: Clear documentation of model selection, evaluation criteria, and results allows stakeholders to make informed decisions based on the project’s findings. This transparency is crucial for building trust in the model's predictions and the overall project outcomes.
- Regulatory Compliance: In highly regulated industries like finance, having detailed documentation is essential for demonstrating compliance with industry standards and regulations. It provides a record of the methodologies used and the reasoning behind key decisions.
- Future-Proofing: Detailed documentation lays the foundation for future work, making it easier to revisit the project, iterate on the existing model, or integrate new technologies as they become available. This ensures that the project remains relevant and continues to provide value over time.
Future Work
- Integration of Real-Time Data Streams Description: Implementing a real-time data ingestion and processing pipeline to handle streaming credit card transaction data. This could involve integrating with tools like Apache Kafka or Spark Streaming to process transactions in real-time, allowing for immediate fraud detection. Benefit: Enhances the system’s capability to detect and respond to fraudulent transactions as they happen, reducing financial loss and increasing the system's practical value.
- Exploration of Deep Learning Models Description: Investigating the application of deep learning techniques such as Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks, which could better capture temporal patterns and dependencies in transaction data. Benefit: These models might uncover complex patterns in the data that traditional machine learning models cannot, potentially improving fraud detection accuracy.
- Incorporating Additional Data Sources Description: Expanding the dataset by incorporating external data sources, such as customer demographics, spending behavior patterns, or transaction metadata (e.g., device type, location data). Benefit: Additional contextual information could improve the model’s ability to differentiate between fraudulent and legitimate transactions, especially in edge cases.
- Model Explainability and Transparency Description: Implementing techniques like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to make the model’s decisions more interpretable to stakeholders. Benefit: Enhances trust and adoption of the model by providing insights into how and why certain predictions are made, which is particularly important in regulated industries.
- Deployment of an AutoML Framework Description: Exploring the use of Automated Machine Learning (AutoML) frameworks to automate model selection, hyperparameter tuning, and feature engineering. This could include tools like Google AutoML, H2O.ai, or Azure AutoML. Benefit: Reduces the time and effort required for model development, while potentially uncovering better-performing models through automated processes.
- Implementation of Anomaly Detection Techniques Description: Integrating unsupervised learning methods, such as Isolation Forests or Autoencoders, to detect anomalous transactions that may not have been labeled in the training data. Benefit: Improves the detection of novel or rare types of fraud that the supervised models may not have been trained to recognize.
- Periodic Model Retraining and Monitoring Description: Setting up a system for continuous model monitoring and periodic retraining to adapt to changes in transaction patterns or emerging fraud tactics. Benefit: Ensures that the model remains accurate and effective over time, especially as fraudsters evolve their strategies.