The rapid acceleration of scientific publishing has created substantial challenges for researchers attempting to discover, contextualize, and interpret relevant literature. Traditional keyword-based search systems provide limited semantic understanding, while existing AI-driven tools typically focus on isolated tasks such as retrieval, clustering, or bibliometric visualization. This paper presents an integrated system for scientific literature exploration that combines large-scale data acquisition, hybrid retrieval, semantic topic modeling, and heterogeneous knowledge graph construction. The system builds a comprehensive corpus by merging full-text data from arXiv with structured metadata from OpenAlex. A hybrid retrieval architecture fuses BM25 lexical search with embedding-based semantic search using Reciprocal Rank Fusion. Topic modeling is performed on retrieved results using BERTopic or non-negative matrix factorization depending on computational resources. A knowledge graph unifies papers, authors, institutions, countries, and extracted topics into an interpretable structure. The system provides a multi-layered exploration environment that reveals not only relevant publications but also the conceptual and relational landscape surrounding a query. Evaluation across multiple queries demonstrates improvements in retrieval relevance, topic coherence, and interpretability. The proposed framework contributes an extensible foundation for AI-assisted scientific discovery.
翻译:科学出版的快速加速给研究人员发现、情境化及解读相关文献带来了巨大挑战。传统基于关键词的检索系统语义理解能力有限,而现有的人工智能驱动工具通常专注于检索、聚类或文献计量可视化等孤立任务。本文提出了一种集成的科学文献探索系统,该系统结合了大规模数据采集、混合检索、语义主题建模和异构知识图谱构建。该系统通过整合arXiv全文数据与OpenAlex结构化元数据构建了全面的语料库。混合检索架构融合了BM25词法搜索与基于嵌入的语义搜索(采用逆序融合排序)。根据计算资源情况,使用BERTopic或非负矩阵分解对检索结果进行主题建模。知识图谱将论文、作者、机构、国家及提取的主题统一为可解释的结构。该系统提供了一个多层级的探索环境,不仅能揭示相关出版物,还能展现查询所涉及的概念与关系网络。通过多组查询验证,该系统在检索相关性、主题一致性和可解释性方面均表现出改进。所提出的框架为人工智能辅助的科学发现提供了一个可扩展的基础平台。