Malicious packages in public registries pose serious threats to software supply chain security. While current software component analysis (SCA) tools rely on databases like OSV and Snyk to detect these threats, these databases suffer from delayed updates and incomplete coverage. However, they miss intelligence from unstructured sources like social media and developer forums, where new threats are often first reported. This delay extends the lifecycle of malicious packages and increases risks for downstream users. To address this, we developed a novel and comprehensive approach to construct a platform IntelliRadar to collect disclosed malicious package names from unstructured web content. Specifically, by exhaustively searching and snowballing the public sources of malicious package names, and incorporating large language models (LLMs) with domain-specialized Least to Most prompts, IntelliRadar ensures comprehensive collection of historical and current disclosed malicious package names from diverse unstructured sources. As a result, we constructed a comprehensive malicious package database containing 34,313 malicious NPM and PyPI package names. Our evaluation shows that IntelliRadar achieves high performance (97.91% precision) on malicious package intelligence extraction. Compared to existing databases, IntelliRadar identifies 7,542 more malicious package names than OSV and 12,684 more than Snyk. Furthermore, 76.6% of NPM components and 70.3% of PyPI components in IntelliRadar were collected earlier than in Snyk's database. IntelliRadar is also more cost-efficient, with a cost of $0.003 per piece of malicious package intelligence and only $7 per month for continuous monitoring. Furthermore, we identified and received confirmation for 1,981 malicious packages in downstream package manager mirror registries through the IntelliRadar.
翻译:公共注册表中的恶意软件包对软件供应链安全构成严重威胁。当前软件组件分析(SCA)工具虽依赖OSV和Snyk等数据库检测此类威胁,但这些数据库存在更新延迟与覆盖不全的问题,且未能纳入社交媒体、开发者论坛等非结构化来源中首次披露新威胁的情报。这种延迟延长了恶意软件包的生命周期,增加了下游用户的风险。为此,我们开发了一种新颖的综合方法,构建了IntelliRadar平台,用于从非结构化网络内容中收集已披露的恶意软件包名称。具体而言,通过穷举搜索和滚雪球式收集恶意软件包名称的公开来源,并结合大语言模型(LLMs)与领域专业化的“从简到繁”提示策略,IntelliRadar确保了从多样非结构化来源中全面收集历史及当前已披露的恶意软件包名称。最终,我们构建了一个包含34,313个恶意NPM和PyPI软件包名称的综合性恶意软件包数据库。评估表明,IntelliRadar在恶意软件包情报提取上实现了高性能(精确率达97.91%)。与现有数据库相比,IntelliRadar比OSV多识别7,542个、比Snyk多识别12,684个恶意软件包名称。此外,IntelliRadar中76.6%的NPM组件和70.3%的PyPI组件收集时间早于Snyk数据库。IntelliRadar还具有更高成本效益,每条恶意软件包情报成本为0.003美元,持续监测每月仅需7美元。基于IntelliRadar,我们还在下游包管理器镜像注册表中识别并确认了1,981个恶意软件包。