Word meaning, representation, and interpretation play fundamental roles in natural language understanding (NLU), natural language processing (NLP), and natural language generation (NLG) tasks. Many of the inherent difficulties in these tasks stem from Multi-word Expressions (MWEs), which complicate the tasks by introducing ambiguity, idiomatic expressions, infrequent usage, and a wide range of variations. Significant effort and substantial progress have been made in addressing the challenging nature of MWEs in Western languages, particularly English. This progress is attributed in part to the well-established research communities and the abundant availability of computational resources. However, the same level of progress is not true for language families such as Chinese and closely related Asian languages, which continue to lag behind in this regard. While sub-word modelling has been successfully applied to many Western languages to address rare words improving phrase comprehension, and enhancing machine translation (MT) through techniques like byte-pair encoding (BPE), it cannot be applied directly to ideograph language scripts like Chinese. In this work, we conduct a systematic study of the Chinese character decomposition technology in the context of MWE-aware neural machine translation (NMT). Furthermore, we report experiments to examine how Chinese character decomposition technology contributes to the representation of the original meanings of Chinese words and characters, and how it can effectively address the challenges of translating MWEs.
翻译:词义、表征与解读在自然语言理解(NLU)、自然语言处理(NLP)及自然语言生成(NLG)任务中具有基础性作用。这些任务中的许多固有困难源于多词表达(MWEs),其通过引入歧义性、惯用表达、低频使用及广泛变体等形式使任务复杂化。针对西方语言(尤其是英语)中MWEs的挑战性本质,学界已付出显著努力并取得实质性进展,这部分归功于成熟的研究社群及丰富的计算资源。然而,对于汉语及密切相关的亚洲语系语言而言,同等程度的进展尚未实现,在此方面仍处于滞后状态。虽然子词建模已成功应用于许多西方语言,以解决罕见词处理、提升短语理解能力,并通过字节对编码(BPE)等技术增强机器翻译(MT)性能,但该方法无法直接适用于汉语等表意文字书写系统。本研究系统性地探讨了在多词表达感知神经机器翻译(NMT)框架下的汉字分解技术。此外,我们通过实验验证了汉字分解技术如何促进汉语词汇及字符原始语义的表征,并探究其如何有效应对多词表达翻译的挑战。