a Malware Detection machine learning model aims to predict the likelihood of a Windows machine getting infected by various families of malware. Using telemetry data from Windows Defender, which includes machine configurations, operating system details, and antivirus product states, the project utilizes a stacked model approach to enhance detection accuracy with model accuracy of 0.6698 where the highest model accuracy in kaggle was 0.7114 .
The project combines the following machine learning models:
-
Random Forest: Aggregates predictions from multiple decision trees for improved accuracy and reduced overfitting.
-
XGBoost: handle outliers by gradient boosting run on large datasets efficiently.
-
LightGBM: using leaf-wise tree growth to improve performance on large-scale data.
To run the project and use the provided notebooks:
- Navigate to the repository directory:
cd Malware-Detection
- Start Jupyter Notebook:
jupyter notebook
- Download the dataset: You can obtain the dataset from the Microsoft Malware Prediction Competition on Kaggle: here.
- Alternatively, you can use this direct link to download the dataset: [link to the dataset file]
- Open the notebooks: In Jupyter Notebook, open the provided
.ipynb
files and execute the cells to run the code. Note that preprocessing or dataset formatting might be required before running the notebooks.
Malware-Detection/
├── data/ # Dataset files.
├── models/ # Saved trained models.
├── stacked-model-on-all-rows.ipynb # Main notebook .
└── README.md # Project documentation.
The project uses a stacked model approach with the following components:
Random Forest: provide roc auc score of 0.6681.
XGBoost: handles outliers effectively and provide auc score of 0.6352 .
LightGBM: provide roc auc score of 0.6692 .
Stacking model : used logisitic regression model to collect output of all 3 models and push it in one model to imporove roc score to 0.6698.
abdulrahman tried different classification model including
SVC which took 4 hour to run on a 100k row sample of the data.
KNN where best K was determined by GridSearchCV (best k was 30 ) and raised rocauc score of 0.62.
logisitic regression raised rocauc score of 0.6153.
random forest hyperparameters is made on MLflow.
before hyperparameter tuning.
XGboost is also made on MLflow
LightGBM is determined by GridSearchCV ran on a 100k row sample of the data
ROC-AUC is used as the evaluation metric to assess model performance, particularly suitable for imbalanced datasets and the needed metric score to sumbit in kaggle competition .
To set up the project locally, follow these steps:
git clone https://github.com/yourusername/Malware-Detection.git
Create a Conda environment and install the required dependencies:
conda create --name malware_detection python=3.11.8
conda activate malware_detection
pip install pandas numpy scikit-learn lightgbm xgboost category-encoders statsmodels matplotlib seaborn
Abdulrahman weclomes Contributions, To contribute, please fork the repository and submit a pull request with your proposed changes.
This project is based on the Microsoft Malware Prediction Competition on Kaggle. Special thanks to the following tools and libraries used:
Pandas.
NumPy.
Scikit-learn.
LightGBM.
XGBoost.
Category Encoders.
Statsmodels.
Matplotlib.
Seaborn.