With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at https://anonymous.4open.science/r/AOE-Benchmark/.
翻译:随着大语言模型(LLMs)的出现,人们期望LLMs能够有效地从复杂的现实世界文档(如论文、报告)中提取显式信息。然而,大多数LLMs生成的是混乱、无组织且不可追溯的段落式答案。为弥补这一差距,我们引入了Arranged and Organized Extraction Benchmark(AOE),这是一个新的双语基准测试,包含不同长度的数据和文档,旨在系统评估LLMs理解碎片化文档并将孤立信息重构为一张组织化表格的能力。与依赖固定模式和狭窄任务领域的传统文本到表格任务不同,AOE涵盖了三个多样化领域的11项精心设计的任务,要求模型根据不同的输入查询生成针对特定上下文的模式。在实验中,我们评估了开源和闭源的最先进LLMs。结果表明,即使是最先进的模型也面临显著困难。该基准测试可在https://anonymous.4open.science/r/AOE-Benchmark/获取。