This repository contains implementations of various machine learning models to predict the impact of mutations using mutation data and protein sequence data. The models implemented are:
- Random Forest
- Support Vector Machine (SVM)
- Convolutional Neural Network (CNN)
- Gated Recurrent Unit (GRU)
Two main datasets are used in this project:
- Mutation Data: Contains information about mutations in BRCA1 and BRCA2 genes.
- Protein Data: Contains protein sequences related to BRCA1 and BRCA2 genes.
The Random Forest model was trained using the mutation data. The model achieved the following accuracy across 5-fold cross-validation:
- Average Accuracy: 0.95
The SVM model was trained using the mutation data. The model achieved the following accuracy across 5-fold cross-validation:
- Average Accuracy: 0.94
The CNN model was trained using the mutation data. The model achieved the following accuracy across 5-fold cross-validation:
- Average Accuracy: 0.97
The GRU model was trained using the mutation data. The model achieved the following accuracy across 5-fold cross-validation:
- Average Accuracy: 0.97
- Clone the repository:
git clone https://github.com/yonas650/ML-Models-for-Mutation-Impact-analysis.git
- Navigate to the project directory:
cd ML-Models-for-Mutation-Impact-analysis
- Install the required dependencies:
pip install -r requirements.txt
- Run the models:
python random_forest.py python svm.py python cnn.py python gru.py
- Python 3.x
- pandas
- numpy
- scikit-learn
- imbalanced-learn
- xgboost
- torch (for CNN and GRU models)
- matplotlib
- seaborn
This project is licensed under the MIT License.