Artificial intelligence has made great progress in recent years, particularly in the development of Vision--Language Models (VLMs) that understand both visual and textual data. However, these advancements remain largely limited to English, reducing their accessibility for non--English speakers. It is essential to extend these capabilities to a broader range of languages. This paper explores the challenges of adapting an English-trained VLM to different languages. To this end, we will explore and compare different methods for their performance and computational cost. We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation. To evaluate these methods, we use a combination of standard multimodal benchmarks translated into the target language and manual assessments by native experts. The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of training and evaluation. These findings suggest that future efforts should focus on native-language dataset collection and improved translation strategies.
翻译:近年来,人工智能取得了显著进展,特别是在理解视觉与文本数据的视觉语言模型(VLM)开发方面。然而,这些进展主要局限于英语,降低了非英语使用者的可及性。将此类能力扩展至更广泛的语言范围至关重要。本文探讨了将英语训练的VLM适配至不同语言所面临的挑战。为此,我们将探索并比较不同方法在性能与计算成本上的表现。我们考虑了基于翻译的流程、LoRA微调,以及一种将视觉适配与语言适配分离的两阶段微调策略。为评估这些方法,我们结合使用了翻译为目标语言的标准多模态基准测试和母语专家的手动评估。结果表明,数据集翻译仍是多语言VLM性能的主要瓶颈,数据质量限制了训练与评估的有效性。这些发现表明,未来工作应聚焦于原生语言数据集的收集与改进的翻译策略。