In this project, I created two classification models to predict whether a patient is likely to have a stroke or not.
The first classification model uses Azure AutoML to test various classifications models and selects the best one based on overall Accuracy. The second classification model uses logistic regression via sklearn and Hyperdrive to test various combinations of hyperparameters, - C
(Inverse of regularization strength) and max_iter
(Maximum number of iterations taken to converge), in an attempt to find the optimal model. Once the best AutoML model was found, it was deployed as a webservice and test patient data was sent to the deployed model to generate a response prediction, i.e. True or False.
The following steps were used to create the project in AzureML.
- Load the external dataset into AzureML
- Use the Python SDK to create a compute instance.
- Configure and submit an AutoML run using the SDK.
- Once the Azure AutoML run is finished, find, save and register the best model with the Azure ML service.
- Using the SDK, configure the deployment settings and the inference entry script (scoring.py) that is used to pass input to the deployed web endpoint.
- Test the deployed model endpoint using a python file (endpoint.py) as well as a JSON request sent to the endpoint through the SDK.
- Delete the compute cluster and the deployed endpoint service
The dataset contains characteristics of 5,110 patients who ultimately did, or did not, suffer a stroke. The patient data points include the following attributes:
-
id: unique identifier
-
gender: "Male", "Female" or "Other"
-
age: age of the patient
-
hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
-
heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
-
ever_married: "No" or "Yes"
-
work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
-
Residence_type: "Rural" or "Urban"
-
avg_glucose_level: average glucose level in blood
-
bmi: body mass index
-
smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
-
stroke: 1 if the patient had a stroke or 0 if not
The dataset can be downloaded from this link on Kaggle.
The author of the dataset is fedesoriano
The model will give a binary prediction whether a patient is likely to have a stroke or not, based on various demographic (gender, age, BMI) and behavioral attributes (type of employment, smoking status) of the patient.
The dataset was downloaded from Kaggle.com and uploaded to my AzureML workspace as a registered dataset.
TODO: Give an overview of the automl
settings and configuration you used for this experiment
These settings were used to configure the automl experiment.
The primary metric used to evaluate the various automl models used was Accuracy.
The experiment was configured to run for a maximum of 30 minutes, with 4 models tested concurrently at any given time.
The experiment was a classification task with the goal of predicting the value in column "stroke". Azure AutoML was used to automatically generate the various features as well as testing the dataset for imbalance, missing values, etc.
TODO: What are the results you got with your automated ML model? What were the parameters of the model? How could you have improved it?
TODO Remeber to provide screenshots of the RunDetails
widget as well as a screenshot of the best model trained with it's parameters.
The best automl model was a Voting Ensemble with an accuracy score of 95.245% that consisted of 10 different models of varying weights with different parameters. The primary models used within the ensemble were XGBoostClassifiers, LightGBM and RandomForest models with various scaler wrappers.
The final model's accuracy could have been improved by allowing more training time to search for optimal weighted models. Additionally, the class imbalance issue could be addressed to improve the real-world accuracy of the model.
In terms of the most important factors that predict whether or not a person will have a stroke, Age was found to be the most important factor, followed by Average Glucose Levels.
This VotingEnsemble model was registered as a Model in Azure ML and eventually deployed.
Since this task involved binary classification, a Logistic Regression model from the sklearn library was chosen. This model is quick to train and tune and easy to explain while still providing a high level of prediction accuracy.
Two hyperparameters were selected for tuning with Azure HyperDrive:
- C (inverse of regularization strength): This parameters determines the degree of regularization and helps to prevent the model from overfitting. Smaller values lead to stronger regularization. The values tried were 0.001,0.01,0.1,1.0,10.0 and 50.0.
- Maximum iterations: It was set to choose from 10 and 25. This parameter defines the maximum number of iterations allowed for the algorithm's solver to converge on a solution.
The RandomParameterSampling method was used to search the hyperparameter grid space. A Bandit early stopping policy was also used to prevent compute resources from being wasted on unnecessary optimization attempts.
The best hyperdrive model had an accuracy of 95.82% and used a C value of 50 (indicating little regularization) and a Max Iterations value of 25. The accuracy of the model could possibly have been improved by increasing the number of Max Iterations and allowing Hyperdrive to test more than 24 models.
The best model was deployed as a webservice with a REST endpoint. The REST endpoint for the model is:
http://1df50e91-db34-42db-ac68-8307731487f8.eastus2.azurecontainer.io/score
The model was deployed using the Python SDK via the following steps:
- Register the best model and provide a name to use in the deployed service.
- Create a Deployment Configuration and Inference Configuration
- Deploy the model as a web service using ACI (Azure Container Instances)
- Send a REST call with JSON-formatted input data to the REST API's scoring URI. Authentication is enabled so the key has to be known in order to use the REST endpoint.
- The deployed model service will respond with a prediction.
A prediction can be generated by passing input as a JSON file to the endpoint. One way to do so is via the endpoint.py file.
The JSON-formatted API request input data should have the following format:
"data": [
{
"gender": "Male",
"age": 67,
"hypertension": "False",
"heart_disease": "True",
"ever_married": "True",
"work_type": "Private",
"Residence_type": "Urban",
"avg_glucose_level": 228.69,
"bmi": 36.6,
"smoking_status": "formerly smoked"
}]
As a final cleanup step, the Python SDK was also used to delete the compute cluster and the deployed model (i.e. service) via these commands:
service.delete()
compute_target.delete()
A screencast walk-through of the model and the deployed service endpoint can be found at this link
In the screencast, you can see that the model has been registered with Azure ML and application insights enabled. Data is sent to the deployed model in two ways. First, by using the endpoint.py script and second by using the Notebook and Python SDK to pass the request to the endpoint. The model responds with Result: False False, indicating that the sample patients' (two of them) whose data was submitted are unlikely to suffer from a stroke.