Skip to content

abdohisham12/malware-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🛡️ Malware Detection Project

Overview

a Malware Detection machine learning model aims to predict the likelihood of a Windows machine getting infected by various families of malware. Using telemetry data from Windows Defender, which includes machine configurations, operating system details, and antivirus product states, the project utilizes a stacked model approach to enhance detection accuracy with model accuracy of 0.6698 where the highest model accuracy in kaggle was 0.7114 .

Stacked Model Approach

The project combines the following machine learning models:

  • Random Forest: Aggregates predictions from multiple decision trees for improved accuracy and reduced overfitting.

  • XGBoost: handle outliers by gradient boosting run on large datasets efficiently.

  • LightGBM: using leaf-wise tree growth to improve performance on large-scale data.

    malware detection overview

Table of Contents

Usage

To run the project and use the provided notebooks:

  1. Navigate to the repository directory: cd Malware-Detection
  2. Start Jupyter Notebook: jupyter notebook
  3. Download the dataset: You can obtain the dataset from the Microsoft Malware Prediction Competition on Kaggle: here.
  4. Open the notebooks: In Jupyter Notebook, open the provided .ipynb files and execute the cells to run the code. Note that preprocessing or dataset formatting might be required before running the notebooks.

Project Structure

Malware-Detection/

├── data/ # Dataset files.
├── models/ # Saved trained models.
├── stacked-model-on-all-rows.ipynb # Main notebook .
└── README.md # Project documentation.

Model Details

The project uses a stacked model approach with the following components:

models

Random Forest: provide roc auc score of 0.6681.
XGBoost: handles outliers effectively and provide auc score of 0.6352 .
LightGBM: provide roc auc score of 0.6692 .
Stacking model : used logisitic regression model to collect output of all 3 models and push it in one model to imporove roc score to 0.6698.

abdulrahman tried different classification model including SVC which took 4 hour to run on a 100k row sample of the data.
KNN where best K was determined by GridSearchCV (best k was 30 ) and raised rocauc score of 0.62.
logisitic regression raised rocauc score of 0.6153.

hyperparameters tuning

random forest hyperparameters is made on MLflow.
before hyperparameter tuning.
rf roc score

after random forest

XGboost is also made on MLflow xgboost

LightGBM is determined by GridSearchCV ran on a 100k row sample of the data

Evaluation Metric

ROC-AUC is used as the evaluation metric to assess model performance, particularly suitable for imbalanced datasets and the needed metric score to sumbit in kaggle competition .

Installation

To set up the project locally, follow these steps:

Step 1: Clone the repository

git clone https://github.com/yourusername/Malware-Detection.git

Step 2 : Set up the environment

Create a Conda environment and install the required dependencies:

conda create --name malware_detection python=3.11.8     
conda activate malware_detection   
pip install pandas numpy scikit-learn lightgbm xgboost category-encoders statsmodels matplotlib seaborn   

Contributing

Abdulrahman weclomes Contributions, To contribute, please fork the repository and submit a pull request with your proposed changes.

Acknowledgments

This project is based on the Microsoft Malware Prediction Competition on Kaggle. Special thanks to the following tools and libraries used:

Pandas.
NumPy.
Scikit-learn.
LightGBM.
XGBoost.
Category Encoders.
Statsmodels.
Matplotlib.
Seaborn.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published