LIT！基于可检查工具的可靠性优化大语言模型 (It's LIT! Reliability-Optimized LLMs with Inspectable Tools)

from arxiv, Accepted to the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop on Multi-Turn Interactions in Large Language Models

Large language models (LLMs) have exhibited remarkable capabilities across various domains. The ability to call external tools further expands their capability to handle real-world tasks. However, LLMs often follow an opaque reasoning process, which limits their usefulness in high-stakes domains where solutions need to be trustworthy to end users. LLMs can choose solutions that are unreliable and difficult to troubleshoot, even if better options are available. We address this issue by forcing LLMs to use external -- more reliable -- tools to solve problems when possible. We present a framework built on the tool-calling capabilities of existing LLMs to enable them to select the most reliable and easy-to-troubleshoot solution path, which may involve multiple sequential tool calls. We refer to this framework as LIT (LLMs with Inspectable Tools). In order to support LIT, we introduce a new and challenging benchmark dataset of 1,300 questions and a customizable set of reliability cost functions associated with a collection of specialized tools. These cost functions summarize how reliable each tool is and how easy it is to troubleshoot. For instance, a calculator is reliable across domains, whereas a linear prediction model is not reliable if there is distribution shift, but it is easy to troubleshoot. A tool that constructs a random forest is neither reliable nor easy to troubleshoot. These tools interact with the Harvard USPTO Patent Dataset and a new dataset of NeurIPS 2023 papers to solve mathematical, coding, and modeling problems of varying difficulty levels. We demonstrate that LLMs can achieve more reliable and informed problem-solving while maintaining task performance using our framework.

翻译：大语言模型（LLMs）已在多个领域展现出卓越能力。调用外部工具的能力进一步扩展了其处理现实任务的范围。然而，LLMs的推理过程通常不透明，这限制了其在需要向终端用户提供可信解决方案的高风险领域中的应用。即使存在更优选项，LLMs仍可能选择不可靠且难以排查的解决方案。为解决这一问题，我们强制LLMs在可能情况下使用外部——更可靠的——工具来解决问题。我们基于现有LLMs的工具调用能力构建了一个框架，使其能够选择最可靠且易于排查的解决方案路径，该路径可能涉及多个顺序工具调用。我们将此框架称为LIT（基于可检查工具的大语言模型）。为支持LIT，我们引入了一个包含1300个问题的新挑战性基准数据集，以及一组与专用工具集合关联的可定制可靠性成本函数。这些成本函数综合评估每个工具的可靠性及故障排查难度。例如，计算器在各领域均可靠，而线性预测模型在存在分布偏移时不可靠，但易于排查；构建随机森林的工具既不可靠也难以排查。这些工具与哈佛USPTO专利数据集及一个新的NeurIPS 2023论文数据集交互，以解决不同难度级别的数学、编程和建模问题。我们证明，使用本框架的LLMs能在保持任务性能的同时，实现更可靠且信息充分的问題解决。