Tackling binary program analysis problems has traditionally implied manually defining rules and heuristics, a tedious and time-consuming task for human analysts. In order to improve automation and scalability, we propose an alternative direction based on distributed representations of binary programs with applicability to a number of downstream tasks. We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs in order to learn a high dimensional representation of binary executable programs. We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks - functional algorithm classification and vulnerability discovery. We compare the proposed approach to our own strong baseline as well as published results and demonstrate improvement over state-of-the-art methods for both tasks. We evaluated Bin2vec on 49191 binaries for the functional algorithm classification task, and on 30 different CWE-IDs including at least 100 CVE entries each for the vulnerability discovery task. We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code-based inst2vec approach, while working on binary code. For almost every vulnerability class in our dataset, our prediction accuracy is over 80% (and over 90% in multiple classes).
翻译:处理二进制程序分析问题历来意味着人工定义规则和逻辑,这是人类分析家的一项繁琐和耗时的任务。为了改进自动化和可缩放性,我们根据分散的二进制程序表达方式提出替代方向,并适用于一些下游任务。我们引入了Bin2vec,这是利用图表革命网络(GCN)和计算程序图的新方法,以便学习二进制可执行程序的高度代表方式。我们通过利用我们的表述方式解决两个不同语义的二进制分析任务----功能算法分类和脆弱性发现,显示了这一方法的多功能性。我们将拟议方法与我们自身的强基线和公布的结果进行比较,并展示了两项任务的最新方法的改进。我们用49191个图表和计算程序图表来利用双进制系统网络(GCN),以便学习二进制可执行程序的高度代表方式。我们通过使用我们的代表方式解决两个不同词义的二进制分析任务----功能算法分类错误――功能性分类和易变性发现。我们用的方法将新的状态-艺术结果通过将分类误差减少40%,比以我们自身的强基准基线为基础以及公布结果,并显示两种方法的精确度为80级。我们的数据等级,在基于80级和多级的编码中,每级中,我们使用的计算方法中几乎都以使用。