Structured output from large language models (LLMs) has enhanced efficiency in processing generated information and is increasingly adopted in industrial applications. Prior studies have investigated the impact of structured output on LLMs' generation quality, often presenting one-way findings. Some suggest that structured format enhances completeness and factual accuracy, while others argue that it restricts the reasoning capacity of LLMs and leads to reductions in standard evaluation metrics. Potential limitations of these assessments include restricted testing scenarios, weakly controlled comparative settings, and reliance on coarse metrics. In this work, we present a refined analysis using causal inference. Based on one assumed and two guaranteed constraints, we derive five potential causal structures characterizing the influence of structured output on LLMs' generation: (1) collider without m-bias, (2) collider with m-bias, (3) single cause from instruction, (4) single cause from output format, and (5) independence. Across seven public and one developed reasoning tasks, we find that coarse metrics report positive, negative, or neutral effects of structured output on GPT-4o's generation. However, causal inference reveals no causal impact in 43 out of 48 scenarios. In the remaining 5, 3 involve multifaceted causal structures influenced by concrete instructions. Further experiments show that OpenAI-o3 are more resilient to output formats than general-purpose GPT-4o and GPT-4.1, highlighting an unaware advantage of reasoning models.
翻译:大型语言模型(LLMs)的结构化输出提升了生成信息处理的效率,并日益在工业应用中广泛采用。先前研究探讨了结构化输出对LLMs生成质量的影响,常呈现单向结论:部分研究认为结构化格式增强了完整性与事实准确性,而另一些则主张其限制了LLMs的推理能力并导致标准评估指标下降。这些评估的潜在局限包括测试场景受限、对照设置控制薄弱以及对粗粒度指标的依赖。本研究采用因果推断方法进行精细化分析。基于一个假设性约束与两个确定性约束,我们推导出五种可能描述结构化输出影响LLMs生成的因果结构:(1)无m偏差的碰撞因子结构,(2)含m偏差的碰撞因子结构,(3)指令单因结构,(4)输出格式单因结构,以及(5)独立结构。在七项公开任务与一项自建推理任务的实验中,粗粒度指标报告了结构化输出对GPT-4o生成的正向、负向或中性效应。然而,因果推断揭示在48个场景中的43个不存在因果影响。其余5个场景中,3个涉及受具体指令调控的多维因果结构。进一步实验表明,OpenAI-o3系列模型相比通用型GPT-4o和GPT-4.1对输出格式更具鲁棒性,这凸显了推理模型尚未被充分认知的优势。