In this work, we develop language models for the Sanskrit language, namely Bidirectional Encoder Representations from Transformers (BERT) and its variants: A Lite BERT (ALBERT), and Robustly Optimized BERT (RoBERTa) using Devanagari Sanskrit text corpus. Then we extracted the features for the given text from these models. We applied the dimensional reduction and clustering techniques on the features to generate an extractive summary for a given Sanskrit document. Along with the extractive text summarization techniques, we have also created and released a Sanskrit Devanagari text corpus publicly.
翻译:摘要:本文利用梵文文本语料库,开发了梵文语言模型,即双向编码器表示来自转换器(BERT)及其变种:Lite BERT(ALBERT)和RoBERTa。然后,我们从这些模型中提取所给文本的特征,对这些特征进行降维和聚类技术,生成给定梵文文献的摘要。除了摘要方法外,本文还公开发布了一个梵文梵文字语料库。