English | 简体中文
AutoX is an efficient automl tool, mainly aimed at data mining competitions with tabular data. Its features include:
- SOTA: AutoX outperforms other solutions in many competition datasets(see Evaluation).
- Easy to use: The design of interfaces is similar to sklearn.
- Generic & Universal: Supporting tabular data, including binary classification, multi-class classification and regression problems.
- Auto: Fully automated pipeline without human-intervention.
- Out of the box: Providing flexible modules which can be used alone.
- Summary of magics: Organize and publish magics of competitions.
1. git clone https://github.com/4paradigm/autox.git
2. cd autox
3. python setup.py install
├── autox
│ ├── ensemble
│ ├── feature_engineer
│ ├── feature_selection
│ ├── file_io
│ ├── join_tables
│ ├── metrics
│ ├── models
│ ├── process_data
│ └── util.py
│ ├── CONST.py
│ ├── autox.py
├── run_oneclick.py
└── demo
└── test
├── setup.py
├── README.md
- Full-Automl
from autox import AutoX
path = data_dir
autox = AutoX(target = 'loss', train_name = 'train.csv', test_name = 'test.csv',
id = ['id'], path = path)
sub = autox.get_submit()
sub.to_csv("submission.csv", index = False)
- Semi-Automl: run_demo.ipynb
index | data_type | data_name(link) | metric | AutoX | AutoGluon | H2o |
---|---|---|---|---|---|---|
1 | regression | zhidemai | mse | 1.1231 | 1.9466 | 1.1927 |
2 | regression | Tabular Playground Series - Aug 2021 | rmse | 7.87731 | 10.3944 | 7.8895 |
3 | regression | House Prices | rmse | 0.13043 | 0.13104 | 0.13161 |
4 | binary classification | Titanic | accuracy | 0.77751 | 0.78229 | 0.79186 |
- cat: Categorical, Categorical variable without order.
- ord: Ordinal, Categorical variable with order.
- num: Numeric, Numeric variable.
- datetime: Time variable with Datetime format.
- timestamp: Time variable with Timestamp format.
- 1.Initialize AutoX
1.1 Read data
1.2 Concat train and test
1.3 Identify columns type in data
1.4 Data preprocess
- 2.Feature engineer
Every feature engineer class inclues the following features:
1. auto select columns which will be executed with current operation
2. review the selected columns
3. modify the columns
4. execute the operation, and return features whose samples' number and order are consistent with orginal table.
- 3.Features combination
Combine the raw features and derived features, and return wide table.
- 4.train_test_split
Split the wide table into train and test.
- 5.Features filter
Filter the features according to the distribution of train and test.
- 6.Model training
Inputs of models are filtered features.
model class inclues the following features:
1. get the default parameters
2. model training
3. parameters tuning
4. get the features importance
5. prediction
- 7.Prediction
- info_['id']: List, unique keys to identify the sample.
- info_['target']: String, label column.
- info_['shape_of_train']: Int, the number of samples in the train set.
- info_['shape_of_test']: Int, the number of samples in the test set.
- info_['feature_type']: Dict of Dict, data type of the features.
- info_['train_name']: String, the table name of main table of train.
- info_['test_name']: String, the table name of main table of test.
- dfs_['train_test']: The combined data of train data and test data.
- dfs_['FE_feature_name']: Derived tables by feature engineering, such as FE_count, FE_groupby.
- dfs_['FE_all']: The merged table which contains raw tables and derived tables.
- concat_train_test: concat the train and test data.
- split_train_test: split train and test data.
- get_submit: get the submission.
- extract year, month, day, hour, weekday info from time columns
- delete invalid(nunique equal to 1) features
- delete invalid (label is nan) samples
- count feature
-
target encoding feature
-
shift feature
AutoX supports fellowing models:
1. Lightgbm
2. Xgboost
3. Tabnet
AutoX supports two ensemble methods(Bagging will be used in default).
1. Stacking;
2. Bagging。
competition | magics |
---|---|
kaggle criteo | |
zhidemai |
Log | Solution |
---|