Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.
翻译:关注低资源语言是推动生成式人工智能民主化的重要步骤。在本工作中,我们致力于缩小罗马尼亚语在多模态自然语言处理领域的资源差距。我们将广泛使用的Flickr30k数据集翻译为罗马尼亚语,并借助开源大语言模型进一步扩展其视觉问答功能。我们通过在罗马尼亚语视觉问答任务上微调开源视觉语言模型,验证了所构建数据集的有效性。我们从三个主流模型家族中选取了代表性模型:LLaMA 3.2、LLaVA 1.6和Qwen2。在微调过程中,我们采用参数高效的LoRA方法。实验表明,我们的模型在罗马尼亚语视觉问答任务上表现出增强的能力,同时在未经过专门训练的任务(如罗马尼亚语图像描述生成)上也取得进步。其中,七十亿参数的Qwen2-VL-RoVQA模型在两项任务中均获得最高评分,其BERTScore F1值相较于原始版本分别提升+6.05%和+2.61%。最后,这些模型相比原始版本显著减少了语法错误,表明其在语言理解能力提升的同时,罗马尼亚语表达的流畅性也得到改善。