Predicting the probability that an online transaction is fraudulent, as denoted by the binary target isFraud.
The dataset is from Kaggle, containing 59,0540 rows and 433 features.
The timespan of the total data set is 365 days, where that of the training set is 182 days and that of the test set is 183 days. Thus, the validation strategy used in this project is time-based validation, training for the first 5 months and predicting the last month.
- Apply PCA to highly correlated and redundant V1-V339 features
- Perform Adversarial Validation to find features that are important in differentiating cards as the training set and test set have different sets of cards
-
Combining features to generate new features
Feature A and B by themselves may not correlate with the target variable but FeatureA+B may correlate with the target variable. -
Frequency Encoding for categorical features
Replace categorical values with corresponding frequency -
Group statistics
For example, group bycard1
, get mean or std ofTransactionAmt
for each group. This can let the model know whether a row has abnormalTransactionAmt
for their group.
- Normalize D columns to prevent them from increasing by time
- Convert
TransactionDT
into datetime by providing a reference datetime
- Parameter Tuning with Hyperopt
- XGBoost (ROC_AUC: 0.9280)
- CatBoost (ROC_AUC: 0.9146)