Large language models (LLMs) often generate unreliable answers, while heuristic uncertainty methods fail to fully distinguish correct from incorrect predictions, causing users to accept erroneous answers without statistical guarantees. We address this issue through the lens of false discovery rate (FDR) control, ensuring that among all accepted predictions, the proportion of errors does not exceed a target risk level. To achieve this in a principled way, we propose LEC, which reinterprets selective prediction as a constrained decision problem by enforcing a Linear Expectation Constraint over selection and error indicators. Then, we establish a finite-sample sufficient condition, which relies only on a held-out set of exchangeable calibration samples, to compute an FDR-constrained, coverage-maximizing threshold. Furthermore, we extend LEC to a two-model routing mechanism: given a prompt, if the current model's uncertainty exceeds its calibrated threshold, we delegate it to a stronger model, while maintaining a unified FDR guarantee. Evaluations on closed-ended and open-ended question-answering (QA) datasets show that LEC achieves tighter FDR control and substantially improves sample retention over prior methods. Moreover, the two-model routing mechanism achieves lower risk levels while accepting more correct samples than each individual model.
翻译:大型语言模型(LLM)常生成不可靠的答案,而启发式不确定性方法无法完全区分正确与错误预测,导致用户在缺乏统计保证的情况下接受错误答案。我们通过错误发现率(FDR)控制的视角解决此问题,确保在所有接受的预测中,错误比例不超过目标风险水平。为实现这一目标,我们提出LEC方法,通过在选择指标与错误指标上施加线性期望约束,将选择性预测重新阐释为约束决策问题。随后,我们建立了一个有限样本充分条件,该条件仅依赖于一组可交换的校准样本,以计算FDR约束下覆盖范围最大化的阈值。此外,我们将LEC扩展至双模型路由机制:给定一个提示,若当前模型的不确定性超过其校准阈值,则将其委托给更强的模型,同时保持统一的FDR保证。在封闭式和开放式问答(QA)数据集上的评估表明,LEC实现了更严格的FDR控制,并显著提高了样本保留率,优于现有方法。此外,双模型路由机制在接收更多正确样本的同时,达到了更低的风险水平。