This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.
This dataset contains direct marketing campaign data for a bank. The marketing campaigns were based on phone calls. The dataset contains attributes on the client being called, last contact with client, campaign attributes and a few socio-economic attributes. We are seeking to predict: as a result of the phone call, whether the client signed up for the banking product (term deposit). Therefore, this is a classification problem, with target label 'y' being 'yes' for when the client signed up, and 'no' when they did not. "
The best performing model was found using Azure AutoML; the model is a VotingEnsemble classifier, and achieved an accuracy of 91.6%
The pipeline architecture consists of:
- Tabular dataset that loads data from a csv file
- A data cleaning process, which for example, converts categorical features into one-hot encoded features
- Hyperparameter tuning step which performed random parameter sampling, to finetune 2 hyperparameters of the classifier model: C (inverse of regularization strength) and max_iter (maximum number of iterations for solvers to converge). The hyperparameter tuning used an early stopping strategy
- The classification algorithm used was scikit learn's Logistic Regression (aka Logit) classifier.
What are the benefits of the parameter sampler you chose? For the parameter C (inverse of regularization strength) the default value is 1.0, so a uniform distribution from 0 to 2.5 was used. This would ensure a fairly wide space from which to pick a random value for the C parameter. For max_iter parameter, an integral value is required, so a random integer sampling was used, within range of 1 to 250, which would also enable a wide space for random values to be chosen for max_iter parameter.
The benefits of random sampling over bayesian sampling is that in case of bayesian sampling, in the beginning some random values are chosen, but subsequently, if some of those random values performed well, the bayesian sampling starts to focus in the neighborhood of the well performing values. Therefore, while Bayesian sampling might theoretically be more efficient, there is a risk that it might get stuck in a local optimal state, and lose sight of a global optimum. To avoid this, I have utilized random sampling strategy to increase the chances of discovering the globally optimal values.
What are the benefits of the early stopping policy you chose? A bandit policy was used, which terminates any runs where the primary metric is not within the specified slack factor, thereby conserving computing resources.
The advantage of the bandit policy over other policies like Median Stopping Policy or Truncation Selection Policy, is that Bandit policy is more aggressive in terminating low performing runs, which could result in freeing up comupting resources for attempting more runs with varying values of hyperparameters.
In 1-2 sentences, describe the model and hyperparameters generated by AutoML. The best model generated by AutoML is a VotingEnsemble classifier, which predicts based on the weighted average of predicted class probabilities. AutoML utilizes various algorithms such as Xgboost, ExtremeRandomTrees, StandardScalerWrapper, RandomForest etc, and can even use one or more of these models in a single ensemble.
For the best fitted VotingEnsemble model generated by AutoML, the following hyperparameter values were used by AutoML. These were obtained by first getting the best fitted model by calling "best_run, fitted_model = automl_run.get_output()" and then looping through each of the steps in the fitted_model in print_model routine. The parameter values used by AutoML for each step of the best fitted model are:
datatransformer
{'enable_dnn': None,
'enable_feature_sweeping': None,
'feature_sweeping_config': None,
'feature_sweeping_timeout': None,
'featurization_config': None,
'force_text_dnn': None,
'is_cross_validation': None,
'is_onnx_compatible': None,
'logger': None,
'observer': None,
'task': None,
'working_dir': None}
prefittedsoftvotingclassifier
{'estimators': ['0', '31', '23', '19', '32', '30'],
'weights': [0.4,
0.3333333333333333,
0.06666666666666667,
0.06666666666666667,
0.06666666666666667,
0.06666666666666667]}
0 - maxabsscaler
{'copy': True}
0 - lightgbmclassifier
{'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 1.0,
'importance_type': 'split',
'learning_rate': 0.1,
'max_depth': -1,
'min_child_samples': 20,
'min_child_weight': 0.001,
'min_split_gain': 0.0,
'n_estimators': 100,
'n_jobs': 1,
'num_leaves': 31,
'objective': None,
'random_state': None,
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'silent': True,
'subsample': 1.0,
'subsample_for_bin': 200000,
'subsample_freq': 0,
'verbose': -10}
31 - maxabsscaler
{'copy': True}
31 - lightgbmclassifier
{'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 0.2977777777777778,
'importance_type': 'split',
'learning_rate': 0.0842121052631579,
'max_bin': 50,
'max_depth': -1,
'min_child_samples': 114,
'min_child_weight': 8,
'min_split_gain': 0.8421052631578947,
'n_estimators': 400,
'n_jobs': 1,
'num_leaves': 65,
'objective': None,
'random_state': None,
'reg_alpha': 0.7894736842105263,
'reg_lambda': 0.7368421052631579,
'silent': True,
'subsample': 0.7426315789473684,
'subsample_for_bin': 200000,
'subsample_freq': 0,
'verbose': -10}
23 - standardscalerwrapper
{'class_name': 'StandardScaler',
'copy': True,
'module_name': 'sklearn.preprocessing._data',
'with_mean': False,
'with_std': False}
23 - xgboostclassifier
{'base_score': 0.5,
'booster': 'gbtree',
'colsample_bylevel': 1,
'colsample_bynode': 1,
'colsample_bytree': 1,
'eta': 0.1,
'gamma': 0,
'learning_rate': 0.1,
'max_delta_step': 0,
'max_depth': 5,
'max_leaves': 31,
'min_child_weight': 1,
'missing': nan,
'n_estimators': 100,
'n_jobs': 1,
'nthread': None,
'objective': 'reg:logistic',
'random_state': 0,
'reg_alpha': 1.25,
'reg_lambda': 2.0833333333333335,
'scale_pos_weight': 1,
'seed': None,
'silent': None,
'subsample': 0.8,
'tree_method': 'auto',
'verbose': -10,
'verbosity': 0}
19 - standardscalerwrapper
{'class_name': 'StandardScaler',
'copy': True,
'module_name': 'sklearn.preprocessing._data',
'with_mean': False,
'with_std': False}
19 - xgboostclassifier
{'base_score': 0.5,
'booster': 'gbtree',
'colsample_bylevel': 1,
'colsample_bynode': 1,
'colsample_bytree': 1,
'eta': 0.3,
'gamma': 0.1,
'grow_policy': 'lossguide',
'learning_rate': 0.1,
'max_bin': 255,
'max_delta_step': 0,
'max_depth': 8,
'max_leaves': 7,
'min_child_weight': 1,
'missing': nan,
'n_estimators': 400,
'n_jobs': 1,
'nthread': None,
'objective': 'reg:logistic',
'random_state': 0,
'reg_alpha': 0,
'reg_lambda': 0.8333333333333334,
'scale_pos_weight': 1,
'seed': None,
'silent': None,
'subsample': 0.7,
'tree_method': 'hist',
'verbose': -10,
'verbosity': 0}
32 - standardscalerwrapper
{'class_name': 'StandardScaler',
'copy': True,
'module_name': 'sklearn.preprocessing._data',
'with_mean': False,
'with_std': True}
32 - lightgbmclassifier
{'boosting_type': 'goss',
'class_weight': None,
'colsample_bytree': 0.8911111111111111,
'importance_type': 'split',
'learning_rate': 0.1,
'max_bin': 180,
'max_depth': 10,
'min_child_samples': 455,
'min_child_weight': 9,
'min_split_gain': 0.15789473684210525,
'n_estimators': 200,
'n_jobs': 1,
'num_leaves': 53,
'objective': None,
'random_state': None,
'reg_alpha': 0.42105263157894735,
'reg_lambda': 0.15789473684210525,
'silent': True,
'subsample': 1,
'subsample_for_bin': 200000,
'subsample_freq': 0,
'verbose': -10}
30 - truncatedsvdwrapper
{'n_components': 0.45526315789473687, 'random_state': None}
30 - xgboostclassifier
{'base_score': 0.5,
'booster': 'gbtree',
'colsample_bylevel': 1,
'colsample_bynode': 1,
'colsample_bytree': 0.8,
'eta': 0.5,
'gamma': 0,
'learning_rate': 0.1,
'max_delta_step': 0,
'max_depth': 6,
'max_leaves': 63,
'min_child_weight': 1,
'missing': nan,
'n_estimators': 10,
'n_jobs': 1,
'nthread': None,
'objective': 'reg:logistic',
'random_state': 0,
'reg_alpha': 0,
'reg_lambda': 2.3958333333333335,
'scale_pos_weight': 1,
'seed': None,
'silent': None,
'subsample': 0.7,
'tree_method': 'auto',
'verbose': -10,
'verbosity': 0}
Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one? Model comparison: In hyperparameter tuning, a LogisticRegression classifier was used. The best hyperparameter values found were: C: 1.671915880 max_iter: 56 The best model using the above hyperparameters achieved an accuracy of 0.9074355
In AutoML, the best model was a VotingEnsemble and achieved an accuracy of 0.91572.
The model generated by AutoML was 0.9% better in accuracy than the model from hyperparameter tuning.
Difference in architecture The model used for hyperparameter tuning was a Logistic Regression model, which predicts the probability of a target variable, using a sigmoid function. The best model identified by AutoML is a VotingEnsemble classifier, which which predicts based on the weighted average of predicted class probabilities.
Reasons for differing performance
- AutoML attemps to train a large number of models of different algorithms, each iteration itself using a different set of hyperparameters. Since it is not constrained to a model of a single architecture, it is expected that AutoML will generally outperform any single model.
- AutoML utilizes a variety of feature engineering techniques such as data normalization, dimensionality reduction etc, which contributes to its superior performance
What are some areas of improvement for future experiments? Why might these improvements help the model? Some areas of improvement are:
- Increasing the training time for both hyperparameter tuning as well as AutoML experiments, which could help identify even better performing models.
- For hyperparameter tuning, using more powerful algorithms such as ensemble models (StackingClassifier, VotingClassifier) may help produce models that could match the performance of AutoML models.
Cluster is being cleaned up in code, using cluster.delete() method