The cybersecurity of Industrial Control Systems that manage critical infrastructure such as Water Distribution Systems has become increasingly important as digital connectivity expands. BATADAL benchmark data is a good source of testing intrusion detection techniques, but it presents several important problems, such as imbalance in the number of classes, multivariate time dependence, and stealthy attacks. We consider a hybrid ensemble learning model that will enhance the detection ability of cyber-attacks in WDS by using the complementary capabilities of machine learning and deep learning models. Three base learners, namely, Random Forest , eXtreme Gradient Boosting , and Long Short-Term Memory network, have been strictly compared and seven ensemble types using simple averaged and stacked learning with a logistic regression meta-learner. Random Forest analysis identified top predictors turned into temporal and statistical features, and Synthetic Minority Oversampling Technique (SMOTE) was used to overcome the class imbalance issue. The analyics indicates that the single Long Short-Term Memory network model is of poor performance (F1 = 0.000, AUC = 0.4460), but tree-based models, especially eXtreme Gradient Boosting, perform well (F1 = 0.7470, AUC=0.9684). The hybrid stacked ensemble of Random Forest , eXtreme Gradient Boosting , and Long Short-Term Memory network scored the highest, with the attack class of 0.7205 with an F1-score and a AUC of 0.9826 indicating that the heterogeneous stacking between model precision and generalization can work. The proposed framework establishes a robust and scalable solution for cyber-attack detection in time-dependent industrial systems, integrating temporal learning and ensemble diversity to support the secure operation of critical infrastructure.
翻译:随着数字连接性的扩展,管理供水系统等关键基础设施的工业控制系统的网络安全日益重要。BATADAL基准数据是测试入侵检测技术的良好来源,但其存在若干重要问题,例如类别数量不平衡、多元时间依赖性和隐蔽攻击。我们提出一种混合集成学习模型,通过结合机器学习与深度学习模型的互补能力,以增强供水系统中网络攻击的检测性能。本研究严格比较了三种基学习器——随机森林、极限梯度提升和长短期记忆网络,并采用七种集成策略,包括简单平均集成和以逻辑回归为元学习器的堆叠学习。通过随机森林分析识别出关键预测变量,并将其转化为时序与统计特征,同时采用合成少数类过采样技术(SMOTE)以缓解类别不平衡问题。分析表明,单一长短期记忆网络模型性能较差(F1 = 0.000,AUC = 0.4460),而基于树的模型(尤其是极限梯度提升)表现良好(F1 = 0.7470,AUC = 0.9684)。由随机森林、极限梯度提升和长短期记忆网络构成的混合堆叠集成模型取得了最佳结果,其攻击类别的F1分数达0.7205,AUC为0.9826,表明异质堆叠能有效平衡模型精度与泛化能力。该框架为时序依赖的工业系统网络攻击检测提供了鲁棒且可扩展的解决方案,通过融合时序学习与集成多样性,为关键基础设施的安全运行提供支持。