This project aims to predict the risk of heart disease in patients using a variety of machine learning models. The dataset includes several features related to health and lifestyle, and the target variable is whether a patient develops heart disease over ten years.
The dataset contains the following columns:
male
: Gender of the patient (1 = Male, 0 = Female)age
: Age of the patientcurrentSmoker
: Whether the patient is a current smoker (1 = Yes, 0 = No)cigsPerDay
: Number of cigarettes smoked per dayBPMeds
: Whether the patient is on blood pressure medication (1 = Yes, 0 = No)prevalentStroke
: History of stroke (1 = Yes, 0 = No)prevalentHyp
: History of hypertension (1 = Yes, 0 = No)diabetes
: Whether the patient has diabetes (1 = Yes, 0 = No)totChol
: Total cholesterol levelsysBP
: Systolic blood pressurediaBP
: Diastolic blood pressureBMI
: Body Mass IndexheartRate
: Heart rateglucose
: Glucose levelTenYearCHD
: Whether the patient developed heart disease over ten years (1 = Yes, 0 = No)
-
Handling Missing Values:
- Dropped the
education
column due to a significant number of missing values. - Filled missing values in categorical columns with the mode.
- Filled missing values in continuous columns with the median.
- Dropped the
-
Resampling:
- The dataset was imbalanced, so resampling was performed to balance the classes.
Various machine learning models were trained and evaluated:
- Random Forest Classifier
- AdaBoost Classifier
- Gradient Boosting Classifier
- Logistic Regression
- Support Vector Classifier (SVC)
- K-Nearest Neighbors (KNN)
- Decision Tree Classifier
- Gaussian Naive Bayes
- XGBoost Classifier
Each model was evaluated using the following metrics:
- Accuracy
- Classification Report (Precision, Recall, F1-Score)
- Confusion Matrix
The Random Forest Classifier achieved the highest accuracy of 97.36% and demonstrated robust performance across all evaluation metrics.
A predictive system was developed to classify whether a new patient has a risk of developing heart disease based on their health parameters. This system uses the trained Random Forest model.
To use the predictive system, provide the following patient information:
- Gender
- Age
- Smoking status
- Number of cigarettes per day
- Blood pressure medication status
- Stroke history
- Hypertension history
- Diabetes status
- Total cholesterol level
- Systolic blood pressure
- Diastolic blood pressure
- Body Mass Index (BMI)
- Heart rate
- Glucose level
The system will output whether the patient is at risk of heart disease.
This project successfully demonstrates the use of machine learning for predicting heart disease risk. The Random Forest Classifier, in particular, proved to be highly effective in this task. This model can potentially assist healthcare professionals in early diagnosis and intervention for patients at risk of heart disease.