A Structural Analysis of Intrusion Detection System Datasets and Their Practical Implications

Intrusion Detection Systems (IDS) rely heavily on the quality and representativeness of the datasets used for training and evaluation. However, many publicly available IDS datasets present significant challenges, such as extreme class imbalance, redundant records, and inconsistencies in feature representation. This paper presents an exploratory and comparative analysis of multiple IDS datasets, focusing on data quality, feature characterization, class distribution, and inherent limitations that may impact machine learning-based detection approaches. Our findings highlight critical issues that must be considered before applying these datasets in real-world IDS studies.

Caio Bruno Bezerra de Souza
Centro de Informática - Universidade Federal de Pernambuco (CIn-UFPE)
Brazil

Luiz Henrique Brito Almeida da Silva
Centro de Informática - Universidade Federal de Pernambuco (CIn-UFPE)
Brazil

Jose Ronaldo de Souza Silva
Centro de Informática - Universidade Federal de Pernambuco (CIn-UFPE)
Brazil

Marcos Rocha de Moraes Falcão
Centro de Informática - Universidade Federal de Pernambuco (CIn-UFPE)
Brazil

Andson Marreiros Balieiro
Centro de Informática - Universidade Federal de Pernambuco (CIn-UFPE)
Brazil