A machine learning solution for predicting insurance premiums using LightGBM. This project demonstrates end-to-end ML pipeline development with focus on code quality and maintainability.
This project implements a robust machine learning pipeline for predicting insurance premiums. It features:
- Custom feature engineering for insurance data
- LightGBM model with optimized parameters
- 5-fold cross-validation
- RMSLE (Root Mean Squared Logarithmic Error) optimization
- Type-safe implementation with comprehensive error handling
- Python: 3.11+
- Core Libraries:
lightgbm
: Gradient boosting frameworkpandas
: Data manipulationnumpy
: Numerical operationsscikit-learn
: ML utilities
insurance-premium-predictor/
├── src/
│ ├── features/ # Feature engineering
│ │ └── feature_engineering.py
│ ├── models/ # Model implementations
│ │ └── models.py
│ ├── utils/ # Utility functions
│ │ └── metrics.py
│ └── train.py # Training pipeline
├── data/ # Data directory
├── submissions/ # Model predictions
└── requirements.txt # Project dependencies
- Automated categorical variable handling
- Domain-specific feature creation:
- Income per dependent
- Claims per year
- Policy duration analysis
- LightGBM with early stopping
- Optimized hyperparameters
- Cross-validation for robust evaluation
- Type hints throughout
- Comprehensive error handling
- Detailed logging
- Modular code structure
The model is evaluated using 5-fold cross-validation with RMSLE as the metric:
- Mean CV RMSLE: [1.1425]
- Standard Deviation: [+/- 0.0055]
d
This project is licensed under the MIT License - see the LICENSE file for details.
Your Name
- GitHub: @psukh28
- LinkedIn: surya-praanv-sukumaran
- Data source: Playground Series S4-E12
- Inspiration: Insurance premium prediction challenge
- Libraries: LightGBM, scikit-learn, pandas