LuxIT：基于单语种子数据的卢森堡语指令微调数据集 (LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data)

The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach. To investigate the practical utility of the dataset, we fine-tune several smaller-scale LLMs on LuxIT. Subsequent benchmarking against their base models on Luxembourgish language proficiency examinations, however, yields mixed results, with performance varying significantly across different models. LuxIT represents a critical contribution to Luxembourgish natural language processing and offers a replicable monolingual methodology, though our findings highlight the need for further research to optimize its application.

翻译：在低资源语言环境下，指令微调大型语言模型（LLMs）的有效性常因缺乏高质量训练数据而受限。为应对这一挑战，我们提出了LuxIT，一个专为卢森堡语设计的新型单语指令微调数据集。该数据集通过从原生卢森堡语文本语料库中合成构建，并选用在卢森堡语处理中表现优异的DeepSeek-R1-0528模型进行生成。生成后，我们采用LLM作为评判者的方法实施了质量保证流程。为探究该数据集的实际效用，我们在LuxIT上对多个小规模LLMs进行了微调。然而，随后在卢森堡语能力测试中与基础模型的基准对比结果呈现复杂性，不同模型间的性能差异显著。LuxIT对卢森堡语自然语言处理领域作出了重要贡献，并提供了一种可复现的单语方法，但我们的研究结果也表明，仍需进一步探索以优化其应用效果。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日