Historical documents in the Sinosphere are known to share common formats and practices, particularly in veritable records compiled by court historians. This shared linguistic heritage has led researchers to use Classical Chinese resources for cross-lingual transfer when processing historical documents from Korea and Japan, which remain relatively low-resource. In this paper, we question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun, the ancient written languages of Korea and Japan, respectively. Our experiments across machine translation, named entity recognition, and punctuation restoration tasks show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja, with performance differences within $\pm{}0.0068$ F1-score for sequence labeling tasks and up to $+0.84$ BLEU score for translation. These limitations persist consistently across various model sizes, architectures, and domain-specific datasets. Our analysis reveals that the benefits of Classical Chinese resources diminish rapidly as local language data increases for Hanja, while showing substantial improvements only in extremely low-resource scenarios for both Korean and Japanese historical documents. These findings emphasize the need for careful empirical validation rather than assuming benefits from indiscriminate cross-lingual transfer.
翻译:汉字文化圈的历史文献以共享的格式与惯例而闻名,尤其是由宫廷史官编纂的实录。这一共同的语言遗产使得研究者在处理资源相对匮乏的韩国与日本历史文献时,常使用古典汉语资源进行跨语言迁移。本文质疑了从古典汉语到韩国的汉文与日本的汉文(即古代书面语)的跨语言可迁移性假设。我们在机器翻译、命名实体识别和标点恢复任务上的实验表明,古典汉语数据集对以汉文书写的古代韩国文献的语言模型性能影响甚微:序列标注任务的F1分数差异在±0.0068以内,翻译任务的BLEU分数差异最高为+0.84。这些局限性在不同模型规模、架构及领域特定数据集中均持续存在。我们的分析揭示,随着汉文本地语言数据的增加,古典汉语资源的益处迅速减弱;仅在韩国与日本历史文献的极低资源场景中,才显示出显著改进。这些发现强调,需要谨慎的实证验证,而非盲目假设跨语言迁移的益处。