🛡️ Malware Detection Project

Overview

a Malware Detection machine learning model aims to predict the likelihood of a Windows machine getting infected by various families of malware. Using telemetry data from Windows Defender, which includes machine configurations, operating system details, and antivirus product states, the project utilizes a stacked model approach to enhance detection accuracy with model accuracy of 0.6698 where the highest model accuracy in kaggle was 0.7114 .

Stacked Model Approach

The project combines the following machine learning models:

Random Forest: Aggregates predictions from multiple decision trees for improved accuracy and reduced overfitting.
XGBoost: handle outliers by gradient boosting run on large datasets efficiently.
LightGBM: using leaf-wise tree growth to improve performance on large-scale data.

Usage

To run the project and use the provided notebooks:

Navigate to the repository directory: cd Malware-Detection
Start Jupyter Notebook: jupyter notebook
Download the dataset: You can obtain the dataset from the Microsoft Malware Prediction Competition on Kaggle: here.
- Alternatively, you can use this direct link to download the dataset: [link to the dataset file]
Open the notebooks: In Jupyter Notebook, open the provided .ipynb files and execute the cells to run the code. Note that preprocessing or dataset formatting might be required before running the notebooks.

Project Structure

Malware-Detection/

├── data/ # Dataset files.
├── models/ # Saved trained models.
├── stacked-model-on-all-rows.ipynb # Main notebook .
└── README.md # Project documentation.

Model Details

The project uses a stacked model approach with the following components:

models

Random Forest: provide roc auc score of 0.6681.
XGBoost: handles outliers effectively and provide auc score of 0.6352 .
LightGBM: provide roc auc score of 0.6692 .
Stacking model : used logisitic regression model to collect output of all 3 models and push it in one model to imporove roc score to 0.6698.

abdulrahman tried different classification model including SVC which took 4 hour to run on a 100k row sample of the data.
KNN where best K was determined by GridSearchCV (best k was 30 ) and raised rocauc score of 0.62.
logisitic regression raised rocauc score of 0.6153.

hyperparameters tuning

random forest hyperparameters is made on MLflow.
before hyperparameter tuning.

after

XGboost is also made on MLflow

LightGBM is determined by GridSearchCV ran on a 100k row sample of the data

Evaluation Metric

ROC-AUC is used as the evaluation metric to assess model performance, particularly suitable for imbalanced datasets and the needed metric score to sumbit in kaggle competition .

Installation

To set up the project locally, follow these steps:

Step 1: Clone the repository

git clone https://github.com/yourusername/Malware-Detection.git

Step 2 : Set up the environment

Create a Conda environment and install the required dependencies:

conda create --name malware_detection python=3.11.8     
conda activate malware_detection   
pip install pandas numpy scikit-learn lightgbm xgboost category-encoders statsmodels matplotlib seaborn

Contributing

Abdulrahman weclomes Contributions, To contribute, please fork the repository and submit a pull request with your proposed changes.

Acknowledgments

This project is based on the Microsoft Malware Prediction Competition on Kaggle. Special thanks to the following tools and libraries used:

Pandas.
NumPy.
Scikit-learn.
LightGBM.
XGBoost.
Category Encoders.
Statsmodels.
Matplotlib.
Seaborn.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
previous version of the model		previous version of the model
scores		scores
README.md		README.md
stacked-model-on-all-rows (f).ipynb		stacked-model-on-all-rows (f).ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ Malware Detection Project

Overview

Stacked Model Approach

Table of Contents

Usage

Project Structure

Model Details

models

hyperparameters tuning

Evaluation Metric

Installation

Step 1: Clone the repository

Step 2 : Set up the environment

Contributing

Acknowledgments

About

Releases

Packages

Languages

abdohisham12/malware-detection

Folders and files

Latest commit

History

Repository files navigation

🛡️ Malware Detection Project

Overview

Stacked Model Approach

Table of Contents

Usage

Project Structure

Model Details

models

hyperparameters tuning

Evaluation Metric

Installation

Step 1: Clone the repository

Step 2 : Set up the environment

Contributing

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages