In dynamic Windows malware detection, deep learning models are extensively deployed to analyze API sequences. Methods based on API sequences play a crucial role in malware prevention. However, due to the continuous updates of APIs and the changes in API sequence calls leading to the constant evolution of malware variants, the detection capability of API sequence-based malware detection models significantly diminishes over time. We observe that the API sequences of malware samples before and after evolution usually have similar malicious semantics. Specifically, compared to the original samples, evolved malware samples often use the API sequences of the pre-evolution samples to achieve similar malicious behaviors. For instance, they access similar sensitive system resources and extend new malicious functions based on the original functionalities. In this paper, we propose a framework MME(Mitigating the impact of Malware Evolution), a framework that can enhance existing API sequence-based malware detectors and mitigate the adverse effects of malware evolution. To help detection models capture the similar semantics of these post-evolution API sequences, our framework represents API sequences using API knowledge graphs and system resource encodings and applies contrastive learning to enhance the model's encoder. Results indicate that, compared to regular Text-CNN, our framework can significantly reduce the false positive rate by 13.10% and improve the F1-Score by 8.47% on five years of data, achieving the best experimental results. Additionally, evaluations show that our framework can save on the human costs required for model maintenance. We only need 1% of the budget per month to reduce the false positive rate by 11.16% and improve the F1-Score by 6.44%.
翻译:在动态Windows恶意软件检测中,深度学习模型被广泛部署用于分析API序列。基于API序列的方法在恶意软件防护中发挥着关键作用。然而,由于API的持续更新以及API序列调用的变化导致恶意软件变体不断演化,基于API序列的恶意软件检测模型的检测能力随时间显著下降。我们观察到演化前后的恶意软件样本通常具有相似的恶意语义。具体而言,相较于原始样本,演化后的恶意软件样本常利用演化前样本的API序列实现类似的恶意行为,例如访问相似的敏感系统资源,并在原有功能基础上扩展新的恶意功能。本文提出一个框架MME(缓解恶意软件演化影响),该框架能够增强现有基于API序列的恶意软件检测器,并减轻恶意软件演化的负面影响。为帮助检测模型捕捉这些演化后API序列的相似语义,我们的框架采用API知识图谱和系统资源编码表示API序列,并应用对比学习增强模型的编码器。实验结果表明,与常规Text-CNN相比,我们的框架在五年数据上能显著降低误报率13.10%,提升F1分数8.47%,取得了最佳实验结果。此外,评估显示我们的框架能够节省模型维护所需的人力成本,每月仅需1%的预算即可降低误报率11.16%,提升F1分数6.44%。