We introduce \emph{Metric-Fair Prompting}, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering, each {(question, option)} pair is treated as a binary instance with label $+1$ (correct) or $-1$ (incorrect). To promote {individual fairness}~--~treating similar instances similarly~--~we compute question similarity using NLP embeddings and solve items in \emph{joint pairs of similar questions} rather than in isolation. The prompt enforces a global decision protocol: extract decisive clinical features, map each \((\text{question}, \text{option})\) to a score $f(x)$ that acts as confidence, and impose a Lipschitz-style constraint so that similar inputs receive similar scores and, hence, consistent outputs. Evaluated on the {MedQA (US)} benchmark, Metric-Fair Prompting is shown to improve performance over standard single-item prompting, demonstrating that fairness-guided, confidence-oriented reasoning can enhance LLM accuracy on high-stakes clinical multiple-choice questions.
翻译:我们提出了一种名为“度量公平提示”的公平感知提示框架,该框架指导大型语言模型在度量公平约束下进行决策。在多项选择医学问答应用中,每个{(问题,选项)}对被视作带有标签$+1$(正确)或$-1$(不正确)的二元实例。为促进{个体公平}——即对相似实例进行相似处理——我们使用自然语言处理嵌入计算问题相似度,并以{相似问题的联合对}而非孤立方式求解项目。该提示强制实施全局决策协议:提取决定性临床特征,将每个$(\text{问题}, \text{选项})$映射为作为置信度的评分$f(x)$,并施加利普希茨式约束,使得相似输入获得相似评分,从而产生一致输出。在{MedQA(美国)}基准测试中,度量公平提示相较于标准单项目提示展现出性能提升,证明公平引导、置信度导向的推理能够增强大型语言模型在高风险临床多项选择题上的准确性。