This is the readme file of the sdsc2102 Group Project
Name: Du Junye SID: 56641800
Name: Yang Wentao SID: 56643528
Name: Wu Jianrui SID: 56641885
Name: Zhou Xin SID: 56644501
Python Verison and Package version:
Note: Please make sure that the packages are installed correctly to make the program run normally.
Python version and cuda configuration:
- Python 3.9.7
- cuda 11.0
Missing value handling and processing:
- missingno 0.5.1
- ast.literal_eval
Scientific calculation package:
- Numpy 1.21.5
- Pandas 1.3.4
- Scipy 1.8.0
Machine Laerning tools:
- scikit-learn 1.0.2
- xgboost 1.5.2
- lightGBM 3.3.2
- catboost 1.0.5
- torch 1.10.1+cu113
- torchaudio 0.10.1+cu113
- torchvision 0.11.2+cu113
Visualization tools:
- Matplotlib 3.4.3
- Seaborn 0.11.2
- Plotly 5.5.0
- Cufflinks 0.17.3
Intepretation tools:
- Shap, eli5
Outline:
I. Data Exploration and Data Cleaning
Handling missing data
w>>>##### Filing missing values with supplymentary data, or with the mean of its columns
Filtering and extracting useful information from the data
Formatting the unstandardized dates
II. Data Visualization
Distribution of numeric features
Distribution of categorical features
III. Predixction
XGBoost model
LightGBM model
CatBoost model
Feature importance and model intepretation
IV.Summary
The findings during the process
Device Information and running time:
CPU: Intel(R) Xeon(R) Gold 5216R CPU @ 2.10GHZ
GPU: Tesla V100S*2
Expected running time: 5-10 minutes
References:
- Hands on Machine Learning with Scikit-Learn & TensorFlow by Aurélien Géron (O'Reilly). CopyRight 2017 Aurélien Géron
- Reference Lecture Note of SDSC2102