Translating a morphology-rich, low-resource language like Latin poses significant challenges. This paper introduces a reproducible draft-based refinement pipeline that elevates open-source Large Language Models (LLMs) to a performance level statistically comparable to top-tier proprietary systems. Our method first uses a fine-tuned NLLB-1.3B model to generate a high-quality, structurally faithful draft. A zero-shot LLM (Llama-3.3 or Qwen3) then polishes this draft, a process that can be further enhanced by augmenting the context with retrieved out-context examples (RAG). We demonstrate the robustness of this approach on two distinct benchmarks: a standard in-domain test set (Rosenthal, 2023) and a new, challenging out-of-domain (OOD) set of 12th-century Latin letters (2025). Our central finding is that this open-source RAG system achieves performance statistically comparable to the GPT-5 baseline, without any task-specific LLM fine-tuning. We release the pipeline, the Chartres OOD set, and evaluation scripts and models to facilitate replicability and further research.
翻译:翻译形态丰富、资源稀缺的语言(如拉丁语)面临重大挑战。本文提出了一种可复现的基于草稿优化的流程,将开源大型语言模型(LLMs)的性能提升至与顶级专有系统统计可比拟的水平。我们的方法首先使用微调后的NLLB-1.3B模型生成高质量、结构忠实的草稿,随后通过零样本LLM(Llama-3.3或Qwen3)对该草稿进行润色,此过程可通过检索外部语境示例(RAG)增强上下文来进一步优化。我们在两个不同基准测试上验证了该方法的鲁棒性:标准领域内测试集(Rosenthal, 2023)以及新构建的具有挑战性的领域外(OOD)12世纪拉丁文书信集(2025)。核心研究结果表明,该开源RAG系统在无需任务特定LLM微调的情况下,达到了与GPT-5基线统计可比拟的性能。我们公开了流程代码、沙特尔领域外数据集、评估脚本及模型,以促进研究的可复现性与后续探索。