We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.
翻译:我们提出了ThaiOCRBench,这是首个用于评估视觉语言模型在泰语富文本视觉理解任务上的综合性基准。尽管多模态建模领域近期取得了进展,但现有基准主要聚焦于高资源语言,导致泰语在需要理解文档结构的任务中代表性不足。ThaiOCRBench通过提供一个多样化、人工标注的数据集来填补这一空白,该数据集包含13个任务类别共计2,808个样本。我们在零样本设置下评估了涵盖专有和开源系统的广泛最先进视觉语言模型。结果显示存在显著的性能差距,专有模型(例如Gemini 2.5 Pro)的表现优于开源模型。值得注意的是,在开源模型中,细粒度文本识别和手写内容提取任务表现出最明显的性能下降。通过详细的错误分析,我们识别出语言偏见、结构不匹配和幻觉内容等关键挑战。ThaiOCRBench为评估低资源、文字复杂的场景中的视觉语言模型提供了一个标准化框架,并为改进泰语文档理解提供了可操作的见解。