Skip to content

ipleiria-ciic/data-augmentation-iiot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPT and Interpolation-based Data Augmentation for Multiclass Intrusion Detection in IIoT

Description

The Industrial Internet of Things (IIoT) leverages interconnected devices for data collection, monitoring, and analysis in industrial processes. Despite its benefits, IIoT introduces cybersecurity vulnerabilities due to inadequate security protocols. This work focuses on intrusion detection in IIoT networks, addressing challenges of limited and imbalanced datasets.

Prior works have proposed Machine Learning (ML) for intrusion detection in IIoT, with ML models reliant on diverse and representative training data. Limited datasets and class imbalance hinder model generalization, emphasizing the need for data augmentation.

Workflow

Figure 1: Workflow with alternative scenarios for IIoT traffic data augmentation and classification evaluation.

Data augmentation involves creating artificial data to address imbalances. In tabular data, methods like SMOTE generate synthetic samples. Recent works, such as REalTabFormer and GReaT, explore GPT-based models for generating realistic tabular data.

This work evaluates the impact of data augmentation on intrusion detection in IIoT. We compare the performance of GPT-based methods with SMOTE and interpolation-based methods. We employ a dataset of IIoT traffic data, comparing model performance using different augmentation methods.

TL;DR

The evaluation employs IIoT traffic data, in particular, the EdgeIIoTset dataset. The dataset contains up to 2 million records, representing 15 different classes of network attacks.

Results reveal varied impacts on different algorithms. XGBoost exhibits a consistent response regardless of the application of data augmentation. Random Forest benefits, Tabnet exhibits somewhat uncertain behavior, and MLP improves with SMOTE augmentation. Further analysis indicate that GPT-based methods may generate out-of-distribution data, influencing the classification performance.

results

Figure 2: Comparative Results of Multiclass Classification Performance Using Macro Average (%).

This work underscores the nuanced impact of data augmentation on intrusion detection in IIoT. GPT-based methods introduce challenges, emphasizing the importance of systematic evaluation. Notably, XGBoost, a top-performing algorithm in this task, shows limited improvement with data augmentation.

Repository structure

dataAugmentationTests/ 📁                  
├── notebooks/ 📓
│   ├── 1_data_analysis_<augmentation_method>.ipynb     📊: Data analysis
│   ├── 2_<augmentation_method>_augmentation.ipynb      🔄: Data augmentation
│   ├── 3_<augmentation_method>_evaluation.ipynb        📈: Evaluation
│   └── ...                       
├── src/ 📜
│   ├── utils.py           🛠️: Utility functions
│   └── ...                
├── results/📋
│   ├── metrics/           📝: Evaluation metrics CSV files
│   └── conf_matrix/       📉: Confusion matrix CSV files             
├── data/ 📂
├── old_repo/              🗄️: Previous repository backup
├── assets/                🖼️: Figures and logos
│
├── .gitignore 🚫
├── README.md              📖: Project README file
└── requirements.txt       📄: Dependencies

To-do list

  • Update requirements.txt

Citation (to be updated)

@article{melicias2023gpt,
    author = {Melícias, Francisco S. and Ribeiro, Tiago F. R. and Rabadão, Carlos and Santos, Leonel and Costa, Rogério Luís de C.},
    title = {GPT and Interpolation-based Data Augmentation for Multiclass Intrusion Detection in IIoT},
    journal = {IEEE Access},
    year={2024},
    doi={10.1109/ACCESS.2024.3360879},
    corresponding_author = {Rogério Luís de C. Costa (e-mail: rogerio.l.costa@ipleiria.pt)}
}

Acknowledgements

This work is partially funded by FCT - Fundação para a Ciência e a Tecnologia, I.P., through projects UIDB/04524/2020, and under the Scientific Employment Stimulus - Institutional Call - CEECINST/00051/2018, and by ANI - Agência Nacional de Inovação, S.A., through project POCI-01-0247-FEDER-046083.