While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.
翻译:尽管表格理解日益依赖于仅使用像素的设置,即将表格作为视觉表示进行处理,但当前基准测试主要使用合成渲染,这些渲染缺乏真实世界表格的复杂性和视觉多样性。此外,现有的视觉表格理解(VTU)数据集提供固定的示例,这些示例具有单一的可视化和预定义的指令,无法访问底层序列化数据以进行重新表述。我们引入了TABLET,这是一个大规模VTU数据集,包含20个任务中的400万个示例,基于200万个唯一表格,其中88%保留了原始可视化。每个示例包括配对的图像-HTML表示、全面的元数据以及链接回源数据集的来源信息。在TABLET上对Qwen2.5-VL-7B等视觉语言模型进行微调,可以提高在已见和未见VTU任务上的性能,同时增强对真实世界表格可视化的鲁棒性。通过保留原始可视化并在统一的大规模集合中保持示例的可追溯性,TABLET为未来VTU模型的鲁棒训练和可扩展评估奠定了基础。