Malicious packages in public registries pose serious threats to software supply chain security. While current software component analysis (SCA) tools rely on databases like OSV and Snyk to detect these threats, these databases suffer from delayed updates and incomplete coverage. However, they miss intelligence from unstructured sources like social media and developer forums, where new threats are often first reported. This delay extends the lifecycle of malicious packages and increases risks for downstream users. To address this, we developed a novel and comprehensive approach to construct a platform IntelliRadar to collect disclosed malicious package names from unstructured web content. Specifically, by exhaustively searching and snowballing the public sources of malicious package names, and incorporating large language models (LLMs) with domain-specialized Least to Most prompts, IntelliRadar ensures comprehensive collection of historical and current disclosed malicious package names from diverse unstructured sources. As a result, we constructed a comprehensive malicious package database containing 34,313 malicious NPM and PyPI package names. Our evaluation shows that IntelliRadar achieves high performance (97.91% precision) on malicious package intelligence extraction. Compared to existing databases, IntelliRadar identifies 7,542 more malicious package names than OSV and 12,684 more than Snyk. Furthermore, 76.6% of NPM components and 70.3% of PyPI components in IntelliRadar were collected earlier than in Snyk's database. IntelliRadar is also more cost-efficient, with a cost of $0.003 per piece of malicious package intelligence and only $7 per month for continuous monitoring. Furthermore, we identified and received confirmation for 1,981 malicious packages in downstream package manager mirror registries through the IntelliRadar.
翻译:公共注册表中的恶意软件包对软件供应链安全构成严重威胁。当前软件组件分析(SCA)工具依赖OSV、Snyk等数据库检测此类威胁,但这些数据库存在更新延迟与覆盖不全的问题,且未能纳入社交媒体、开发者论坛等非结构化来源中常被率先披露的新型威胁情报。这种延迟延长了恶意软件包的生命周期,增加了下游用户的风险。为此,我们开发了一种新颖且全面的方法,构建了IntelliRadar平台,用于从非结构化网络内容中收集已披露的恶意软件包名称。具体而言,通过穷尽搜索与滚雪球式采集恶意软件包名称的公开来源,并结合大语言模型(LLMs)与领域专用的Least to Most提示策略,IntelliRadar确保了从多样化非结构化来源全面收集历史及当前已披露的恶意软件包名称。基于此,我们构建了一个包含34,313个恶意NPM与PyPI软件包名称的综合恶意软件包数据库。评估结果表明,IntelliRadar在恶意软件包情报提取方面实现了高性能(精确率达97.91%)。与现有数据库相比,IntelliRadar比OSV多识别7,542个恶意软件包名称,比Snyk多识别12,684个。此外,IntelliRadar中76.6%的NPM组件与70.3%的PyPI组件比Snyk数据库更早被收录。IntelliRadar还具有更高的成本效益,每条恶意软件包情报的获取成本为0.003美元,持续监测的月均成本仅为7美元。进一步地,我们通过IntelliRadar在下游包管理器镜像注册表中识别并确认了1,981个恶意软件包。