Skip to content

I built a template for ml projects. By using this template you can build your pipeline more easily. This template also can help you to extract some descriptions about your dataset and plot your data.

License

Notifications You must be signed in to change notification settings

omiderfanmanesh/Machine-Learning-Project-Template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bank-marketing

Bank marketing campaign is a method for increasing the number of clients and in this paper, I am going to analyse the data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls.This current analysis consider the impact of different classifier models and pca component analysis on this dataset. Moreover, all these model will be evaluated with metric of f1 score. As the result, I obtained the score of 0.90 from Random forest classifier model.

How to use this template

Template structure:

.
├── build
├── configs
│   ├── defaults.py -it has some specific configs for models or dataset.
│   ├── __init__.py
├── data
│   ├── bank.py - Bank is a sub-class of BasedDataset and it contains all configs to load the dataset from files
│   ├── based
│   │   ├── based_dataset.py -a super class for all dataset, it contains data manipulations methods
│   │   ├── encoder_enum.py -contains constants of encoders
│   │   ├── file_types.py -contains constants of file type e.g. CSV, TSV, TXT,...
│   │   ├── __init__.py
│   │   ├── sampling_types.py -contains constants of sampling types e.g. SMOTE,...
│   │   ├── scale_types.py -contains constants of scalers e.g. MinMax Scaler
│   │   └── transformers_enums.py -contains types of data transformers e.g. Log Transformer
│   ├── dataset -include all datasets files
│   │   ├── bank-additional
│   │   │   ├── bank-additional.csv
│   │   │   ├── bank-additional-full.csv
│   │   │   └── bank-additional-names.txt
│   │   ├── bank.csv
│   │   ├── bank-full.csv
│   │   ├── bank-names.txt
│   │   ├── description.txt
│   ├── __init__.py
│   ├── loader.py - it loads the dataset that we need
│   ├── preprocessing
│   │   ├── encoders.py - Encoder helps to encode data and has all codes needed for this purpose
│   │   ├── __init__.py
│   │   ├── pca.py - PCA is for applying pca analysis and drawing plots
│   │   └── scalers.py - Scaler is for changing the scale of values in a dataset
├── eda -ontcains files and folders that are needed for analysing and retrieving information about the dataset, features, categories, ...
│   ├── bank_analyser.py -contains codes for retrieving information from a dataset
│   ├── bank_plot.py -contains codes related to draw plots from a dataset
│   ├── based
│   │   ├── based_analyzer.py -BankAnalyzer is a sub-class of BasedAnalyzer and it is help to analyse features
│   │   ├── based_plots.py -BankPlots is a sub-class of BasedPlot that helps to plot features' values
│   │   ├── __init__.py
│   ├── __init__.py
├── engine
│   ├── analyser.py -this file runs BankAnalyzer
│   ├── __init__.py
│   └── trainer.py -this file is for training, hyperparameters tuning and doing cross-validation
├── LICENSE
├── model
│   ├── based
│   │   ├── based_model.py -this is a super class for all models that contains methods for training, evaluation, plotting, ...
│   │   ├── __init__.py
│   │   ├── metric_types.py -contains constants of metrics
│   │   ├── model_selection.py -contains the name of models that you can run for your dataset
│   │   ├── task_mode.py -it helps you to select what task you are going to do e.g. classification or regression
│   │   └── tuning_mode.py -contains the types of tuning algorithms e.g GridSearch
│   ├── decision_tree.py -contains all configuration needed to run Decision Tree algorithms
│   ├── __init__.py
│   ├── logistic_regression.py -contains all configuration needed to run Logistic Regression algorithm
│   ├── random_forest.py -contains all configuration needed to run Random forest algorithms
│   └── svm.py - contains all configuration needed to run Support vector machine algorithms
├── notebooks - contains all notebooks that is needed when you want to run this project by jupiter notebook
│   ├── data_exploration.ipynb
│   └── trainer.ipynb
├── README.md
├── report
│   └── report.pdf
├── test
│   └── test.py
├── tools
│   ├── analysis.py -contains main() for running the analyser.py
│   └── train.py -contains main() for running the trainer.py
└── utils
    ├── __init__.py
    └── runtime_mode.py -all types of running modes for this project e.g. training, tuning or cross-validation

Just add your dataset file to data/dataset and put your all configuration at config/defaults.py. I will explain configuration settings below:

BASIC CONFIGS

Initialize the BASIC configs for our project:

_C.BASIC = CN()

set the Random SEED number to use as random_state, ...

_C.BASIC.SEED = 2021

pca = True will apply principal component analysis to data.

_C.BASIC.PCA = False 

select training model e.g. SVM, RandomForest, ...

_C.BASIC.MODEL = Model.SVM 

runtime mode means how you are going to run this code. if you need to train model use RuntimeMode.TRAIN or if you need to do cross validation use RuntimeMode.CROSS_VAL and finaly if you need to do hyperparameter tuning use RuntimeMode.TUNING

_C.BASIC.RUNTIME_MODE = RuntimeMode.TRAIN  

task mode means this is a classification task or regression

_C.BASIC.TASK_MODE = TaskMode.CLASSIFICATION  

data resampling is a technique for handling imbalance dataset. if you set SAMPLING_STRATEGY to None, it means don't use resampling. if you set SAMPLING_STRATEGY to one of (Sampling.SMOTE, Sampling.RANDOM_UNDER_SAMPLING), it will be applied to train set of your dataset. note that the order is important e.g. if you want to do first over sampling then under sampling, you can write (Sampling.SMOTE, Sampling.RANDOM_UNDER_SAMPLING). It will apply first SMOTE then RANDOM_UNDER_SAMPLING.

_C.BASIC.SAMPLING_STRATEGY = None # don't use any sampling strategy
# _C.BASIC.SAMPLING_STRATEGY = (Sampling.SMOTE) # use just SMOTE
# _C.BASIC.SAMPLING_STRATEGY = (Sampling.SMOTE, Sampling.RANDOM_UNDER_SAMPLING) # use first SMOTE then RANDOM_UNDER_SAMPLING

MODEL CONFIGS

Initialize the MODEL configs for our project:

_C.MODEL = CN()

Number of target classes for classification task. In this our case, we have a binary classification model.

_C.MODEL.NUM_CLASSES = 2

if you set _C.BASIC.RUNTIME_MODE = RuntimeMode.CROSS_VAL, the value of K will be used for KFold cross-validation.

_C.MODEL.K_FOLD = 5  

If you want to shuffle the data set it to True otherwise False

_C.MODEL.SHUFFLE = True  

Dataset

Initialize the DATASET configs for our project:

_C.DATASET = CN()

write the address of dataset file

_C.DATASET.DATASET_ADDRESS = '../data/dataset/bank.csv'  

if you have a brief description file about your dataset, write the address here

_C.DATASET.DATASET_BRIEF_DESCRIPTION = '../data/dataset/description.txt'  

this is name of target column of our dataset

_C.DATASET.TARGET = 'y' 

if you have a categorical targets,set True otherwise False

_C.DATASET.HAS_CATEGORICAL_TARGETS = True

if you want to drop specific columns from dataframe, write their names in DROP_COLS

_C.DATASET.DROP_COLS = (
    # 'duration', # it will not drop
    'day', # it will drop from df
    'balance',
    'month',
    'job',
    'previous',
    'campaign',
    'education',
    # 'poutcome',
    # 'age',
    'pdays',
    'marital',
    'contact',
    'housing',
    'loan',
    # 'default'
)

METRIC

Initialize the EVALUATION configs for our project:

_C.EVALUATION = CN()

select your metric for your model. Supported metrics are 'accuracy', 'balanced_accuracy', 'top_k_accuracy', 'average_precision', 'neg_brier_score', 'f1', 'f1_micro', 'f1_macro', 'f1_weighted', 'f1_samples', 'neg_log_loss', 'precision', 'recall', 'jaccard', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted'

_C.EVALUATION.METRIC = MetricTypes.F1_SCORE_MICRO  

set True if you need to plot the confusion matrix at the end of evaluation process

_C.EVALUATION.CONFUSION_MATRIX = False  

CATEGORICAL FEATURES ENCODER CONFIG

Initialize the ENCODER configs for our project:

_C.ENCODER = CN()

if you have categorical column, write its name in CAPITAL letter and follow this structure:

_C.ENCODER.{COLUMN NAME} = TYPE OF ENCODER

_C.ENCODER.JOB = EncoderTypes.BINARY
_C.ENCODER.MARITAL = EncoderTypes.BINARY
_C.ENCODER.EDUCATION = EncoderTypes.ORDINAL
_C.ENCODER.DEFAULT = EncoderTypes.BINARY
_C.ENCODER.HOUSING = EncoderTypes.BINARY
_C.ENCODER.LOAN = EncoderTypes.BINARY
_C.ENCODER.CONTACT = EncoderTypes.BINARY
_C.ENCODER.MONTH = EncoderTypes.ORDINAL
_C.ENCODER.POUTCOME = EncoderTypes.BINARY
_C.ENCODER.Y = EncoderTypes.LABEL  # if your target is categorical

if you want to do use a custom encoders set for example:

_C.ENCODER.Y = EncoderTypes.CUSTOM

then you should write a method inside your dataset class and call it before encoding phase. CUSTOM means you don't want to use encoders of EncoderTypes.

SCALER

Initialize the SCALER configs for our project:

_C.SCALER = CN()

select the type of scaler (STANDARD SCALER, MINMAX SCALER, ...) that you want to apply to your data

_C.SCALER = ScaleTypes.STANDARD 

TRANSFORMATION

Initialize the TRANSFORMATION configs for our project:

_C.TRANSFORMATION = CN()

If you need to transform a specific column, write transformation type here

_C.TRANSFORMATION.AGE = TransformersType.BOX_COX  #
_C.TRANSFORMATION.BALANCE = TransformersType.BOX_COX  #
_C.TRANSFORMATION.DAY = TransformersType.NONE  #
_C.TRANSFORMATION.DURATION = TransformersType.BOX_COX  #
_C.TRANSFORMATION.CAMPAIGN = TransformersType.BOX_COX  #
_C.TRANSFORMATION.PDAYS = TransformersType.LOG  #
_C.TRANSFORMATION.PREVIOUS = TransformersType.BOX_COX  #

DECOMPOSITION

if you set _C.BASIC.PCA = True, the PCA configuarion will be loaded from this section:

Initialize the SCALER configs for our project:

_C.PCA = CN()

number of components

_C.PCA.N_COMPONENTS = 0.7  

set True if you want to plot pca components

_C.PCA.PLOT = False   

SAMPLING

All configuration that we need for data resampling is in this section. For example,
if you set _C.BASIC.SAMPLING_STRATEGY = (Sampling.SMOTE), The configs of SMOTE will be loaded from this part.

_C.RANDOM_UNDER_SAMPLER = CN()
_C.RANDOM_UNDER_SAMPLER.SAMPLING_STRATEGY = 'auto'  # float, str, dict, callable, default=’auto’
_C.RANDOM_UNDER_SAMPLER.RANDOM_STATE = 2021  # int, RandomState instance, default=None
_C.RANDOM_UNDER_SAMPLER.REPLACEMENT = False  # bool, default=False

_C.RANDOM_OVER_SAMPLER = CN()
_C.RANDOM_OVER_SAMPLER.SAMPLING_STRATEGY = 'minority'  # float, str, dict or callable, default=’auto’
_C.RANDOM_OVER_SAMPLER.RANDOM_STATE = 2021  # int, RandomState instance, default=None
# _C.RANDOM_OVER_SAMPLER.SHRINKAGE = 0  # float or dict, default=None

_C.SMOTE = CN()
_C.SMOTE.SAMPLING_STRATEGY = 'auto'  # float, str, dict or callable, default=’auto’ {'minority'}
_C.SMOTE.RANDOM_STATE = 2021  # int, RandomState instance, default=None
_C.SMOTE.K_NEIGHBORS = 5  # int or object, default=5
_C.SMOTE.N_JOBS = -1  # int, default=None

_C.SMOTENC = CN()
_C.SMOTENC.CATEGORICAL_FEATURES = ('job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
                                   'month')  # ndarray of shape (n_cat_features,) or (n_features,)
_C.SMOTENC.SAMPLING_STRATEGY = 'minority'  # float, str, dict or callable, default=’auto’
_C.SMOTENC.RANDOM_STATE = 2021  # int, RandomState instance, default=None
_C.SMOTENC.K_NEIGHBORS = 5  # int or object, default=5
_C.SMOTENC.N_JOBS = -1  # int, default=None

_C.SVMSMOTE = CN()
_C.SVMSMOTE.SAMPLING_STRATEGY = 'auto'  # float, str, dict or callable, default=’auto’ {'minority'}
_C.SVMSMOTE.RANDOM_STATE = 2021  # int, RandomState instance, default=None
_C.SVMSMOTE.K_NEIGHBORS = 3  # int or object, default=5
_C.SVMSMOTE.N_JOBS = -1  # int, default=None
_C.SVMSMOTE.M_NEIGHBORS = 10  # int or object, default=10
# _C.SVMSMOTE.SVM_ESTIMATOR = 5  # estimator object, default=SVC()
_C.SVMSMOTE.OUT_STEP = 0.5  # float, default=0.5

Models

when you select _C.BASIC.MODEL = Model.SVM , the configuration of SVM model will be loaded from here:

support vector machine for classification task

_C.SVM = CN()
_C.SVM.NAME = 'SVM'

_C.SVM.C = 10  # float, default=1.0
_C.SVM.KERNEL = 'rbf'  # {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}, default='rbf'
_C.SVM.DEGREE = 1  # int, default=3
_C.SVM.GAMMA = 'scale'  # {'scale', 'auto'} or float, default='scale'
_C.SVM.COEF0 = 0.0  # float, default=0.0
_C.SVM.SHRINKING = True  # bool, default=True
_C.SVM.PROBABILITY = False  # bool, default=False
_C.SVM.TOL = 1e-3  # float, default=1e-3
_C.SVM.CACHE_SIZE = 200  # float, default=200
_C.SVM.CLASS_WEIGHT = None  # dict or 'balanced', default=None
_C.SVM.VERBOSE = True  # bool, default=False
_C.SVM.MAX_ITER = -1  # int, default=-1
_C.SVM.DECISION_FUNCTION_SHAPE = 'ovr'  # {'ovo', 'ovr'}, default='ovr'
_C.SVM.BREAK_TIES = False  # bool, default=False
_C.SVM.RANDOM_STATE = _C.BASIC.RAND_STATE  # int or RandomState instance, default=None

support vector machine for regression task

_C.SVR = CN()
_C.SVR.NAME = 'SVM'

_C.SVR.KERNEL = 'rbf'  # {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}, default='rbf'
_C.SVR.DEGREE = 3  # int, default=3
_C.SVR.GAMMA = 'scale'  # {'scale', 'auto'} or float, default='scale'
_C.SVR.COEF0 = 0.0  # float, default=0.0
_C.SVR.TOL = 1e-3  # float, default=1e-3
_C.SVR.C = 1.0  # float, default=1.0
_C.SVR.EPSILON = 0.1  # float, default=0.1
_C.SVR.SHRINKING = True  # bool, default=True
_C.SVR.CACHE_SIZE = 200  # float, default=200
_C.SVR.VERBOSE = True  # bool, default=False
_C.SVR.MAX_ITER = -1  # int, default=-1

when you set _C.BASIC.RUNTIME_MODE = RuntimeMode.TUNING , configurations for hyperparameter tuning of selected model (e.g. SVM) will be loaded from:

_C.SVM.HYPER_PARAM_TUNING = CN()
_C.SVM.HYPER_PARAM_TUNING.KERNEL = ('linear', 'poly', 'rbf')
_C.SVM.HYPER_PARAM_TUNING.C = (0.1, 1, 10)
_C.SVM.HYPER_PARAM_TUNING.DEGREE = (1, 2, 3)
_C.SVM.HYPER_PARAM_TUNING.GAMMA = ('scale', 'auto', 1, 0.1, 0.01, 0.001, 0.0001)
_C.SVM.HYPER_PARAM_TUNING.COEF0 = None
_C.SVM.HYPER_PARAM_TUNING.SHRINKING = None
_C.SVM.HYPER_PARAM_TUNING.PROBABILITY = None
_C.SVM.HYPER_PARAM_TUNING.TOL = None
_C.SVM.HYPER_PARAM_TUNING.CACHE_SIZE = None
_C.SVM.HYPER_PARAM_TUNING.CLASS_WEIGHT = None
_C.SVM.HYPER_PARAM_TUNING.MAX_ITER = None
_C.SVM.HYPER_PARAM_TUNING.DECISION_FUNCTION_SHAPE = None
_C.SVM.HYPER_PARAM_TUNING.BREAK_TIES = None

About

I built a template for ml projects. By using this template you can build your pipeline more easily. This template also can help you to extract some descriptions about your dataset and plot your data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published