In this work, we introduce TurkEmbed4Retrieval, a retrieval specialized variant of the TurkEmbed model originally designed for Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. By fine-tuning the base model on the MS MARCO TR dataset using advanced training techniques, including Matryoshka representation learning and a tailored multiple negatives ranking loss, we achieve SOTA performance for Turkish retrieval tasks. Extensive experiments demonstrate that our model outperforms Turkish colBERT by 19,26% on key retrieval metrics for the Scifact TR dataset, thereby establishing a new benchmark for Turkish information retrieval.
翻译:本文介绍了TurkEmbed4Retrieval,这是专为检索任务设计的TurkEmbed模型变体,该原始模型最初用于自然语言推理(NLI)和语义文本相似度(STS)任务。通过在MS MARCO TR数据集上对基础模型进行微调,并采用先进的训练技术——包括套娃表示学习和定制的多重负样本排序损失——我们在土耳其语检索任务中实现了最先进的性能。大量实验表明,在Scifact TR数据集的关键检索指标上,我们的模型比土耳其语colBERT模型提升了19.26%,从而为土耳其语信息检索设立了新的基准。