The Rigveda, among the oldest Indian texts in Vedic Sanskrit, employs a distinctive pitch-accent system : udātta, anudātta, svarita whose marks encode melodic and interpretive cues but are often absent from modern e-texts. This work develops a parallel corpus of accented-unaccented ślokas and conducts a controlled comparison of three strategies for automatic accent placement in Rigvedic verse: (i) full fine-tuning of ByT5, a byte-level Transformer that operates directly on Unicode combining marks, (ii) a from-scratch BiLSTM-CRF sequence-labeling baseline, and (iii) LoRA-based parameter-efficient fine-tuning atop ByT5. Evaluation uses Word Error Rate (WER) and Character Error Rate (CER) for orthographic fidelity, plus a task-specific Diacritic Error Rate (DER) that isolates accent edits. Full ByT5 fine-tuning attains the lowest error across all metrics; LoRA offers strong efficiency-accuracy trade-offs, and BiLSTM-CRF serves as a transparent baseline. The study underscores practical requirements for accent restoration - Unicode-safe preprocessing, mark-aware tokenization, and evaluation that separates grapheme from accent errors - and positions heritage-language technology as an emerging NLP area connecting computational modeling with philological and pedagogical aims. Results establish reproducible baselines for Rigvedic accent restoration and provide guidance for downstream tasks such as accent-aware OCR, ASR/chant synthesis, and digital scholarship.
翻译:《梨俱吠陀》作为吠陀梵语中最古老的印度文献之一,采用独特的高低音调系统:升调(udātta)、降调(anudātta)与滑调(svarita),其标记编码了旋律与释义线索,但在现代电子文本中常缺失。本研究构建了带重音与无重音诗节平行语料库,并系统比较了《梨俱吠陀》偈颂自动重音标注的三种策略:(i)基于字节级Transformer模型ByT5(可直接处理Unicode组合标记)的完整微调;(ii)从头训练的BiLSTM-CRF序列标注基线模型;(iii)在ByT5基础上采用LoRA的参数高效微调方法。评估指标采用衡量拼写保真度的词错误率(WER)与字符错误率(CER),以及专门隔离重音编辑任务的重音符号错误率(DER)。实验表明:完整微调的ByT5在所有指标上均取得最低错误率;LoRA在效率与精度间实现了优异平衡;BiLSTM-CRF则作为可解释性强的基线模型。本研究明确了重音复原的技术要求——包括Unicode安全预处理、标记感知的分词技术、以及区分字形错误与重音错误的评估体系,并将遗产语言技术定位为连接计算建模与文献学、教学目标的自然语言处理新兴领域。研究成果为《梨俱吠陀》重音复原建立了可复现的基线,并为下游任务(如重音感知的光学字符识别、自动语音合成/吟诵生成及数字学术研究)提供了技术指引。