Context: Open-source Pre-Trained Models (PTMs) provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs to support the reliable identification and reuse of models for SE. Objective: To address this gap, we derive a taxonomy encompassing 147 SE tasks and apply an SE-oriented classification to PTMs in a popular open-source ML repository, Hugging Face (HF). Method: Our repository mining study followed a five-phase pipeline: (i) identification SE tasks from the literature; (ii) collection of PTM data from the HF API, including model card descriptions and metadata, and the abstracts of the associated arXiv papers; (iii) text processing to ensure consistency; (iv) a two-phase validation of SE relevance, involving humans and LLM assistance, supported by five pilot studies with human annotators and a generalization test; (v) and data analysis. This process yielded a curated catalogue of 2,205 SE PTMs. Results: We find that most SE PTMs target code generation and coding, emphasizing implementation over early or late development stages. In terms of ML tasks, text generation dominates within SE PTMs. Notably, the number of SE PTMs has increased markedly since 2023 Q2, while evaluation remains limited: only 9.6% report benchmark results, mostly scoring below 50%. Conclusions: Our catalogue reveals documentation and transparency gaps, highlights imbalances across SDLC phases, and provides a foundation for automated SE scenarios, such as the sampling and selection of suitable PTMs.
翻译:背景:开源预训练模型(PTMs)为各类机器学习(ML)任务提供了丰富资源,但这些资源缺乏针对软件工程(SE)需求的分类体系,难以支持SE领域可靠地识别和复用模型。目标:为填补这一空白,我们构建了一个涵盖147项SE任务的分类法,并将其应用于主流开源ML仓库Hugging Face(HF)中的PTMs,实现SE导向的分类。方法:我们的仓库挖掘研究遵循五阶段流程:(i)从文献中识别SE任务;(ii)通过HF API收集PTM数据,包括模型卡片描述与元数据,以及相关arXiv论文摘要;(iii)文本处理以确保一致性;(iv)采用人工与LLM辅助的双阶段SE相关性验证,辅以五次人工标注试点研究和泛化测试;(v)数据分析。该流程最终构建出包含2,205个SE PTMs的精选目录。结果:我们发现大多数SE PTMs聚焦于代码生成与编程,强调实现阶段而非开发早期或后期。在ML任务层面,文本生成在SE PTMs中占主导地位。值得注意的是,自2023年第二季度以来SE PTMs数量显著增长,但评估仍显不足:仅9.6%的模型报告了基准测试结果,且多数得分低于50%。结论:我们的目录揭示了文档与透明度缺口,凸显了软件开发生命周期各阶段的不均衡性,并为自动化SE场景(如合适PTMs的采样与选择)提供了基础。