Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.
翻译:评估大型语言模型(LLMs)生成的表格质量仍是一个开放挑战:现有指标要么将表格扁平化为文本(忽略结构信息),要么依赖固定参考标准从而限制泛化能力。本文提出TabReX,一种基于无参考、属性驱动的框架,通过图推理方法评估表格生成质量。TabReX将源文本和生成表格均转换为规范化知识图谱,通过LLM引导的匹配过程进行对齐,并计算可解释的、符合评估量规的分数,以量化结构保真度与事实准确性。该指标可在敏感性与特异性之间实现可控权衡,产生与人类判断对齐的评估结果及单元格级错误追踪。为系统评估指标鲁棒性,我们构建了TabReX-Bench大规模基准测试集,涵盖六个领域、十二种规划器驱动的扰动类型及三个难度层级。实验结果表明:TabReX与专家排序的相关性最高,在更严苛的扰动下保持稳定,并能通过细粒度的模型-提示词分析建立结构化生成系统可信可解释评估的新范式。