This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.
翻译:本文介绍了提交至FAME2026挑战赛的UZH-CL系统。该挑战赛聚焦于独特多语言条件下的跨模态验证,特别是针对未见及未闻的语言。我们的方法研究了两种不同的架构:一种是从零开始训练、采用对比损失与正交投影损失的双编码器基线系统;另一种是基于ImageBind结合LoRA的基础模型方案。为应对挑战赛中数据稀缺与语言限制的问题,我们从VoxBlink中构建了外部阿拉伯语数据集。性能最佳的系统ImageBind-LoRA展现出卓越的跨语言泛化能力:尽管仅使用阿拉伯语音频进行微调,其在评估集(英语与德语)上实现了24.73%的等错误率,最终在竞赛中获得第二名。