Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.
翻译:大语言模型(LLMs)在多样化的语言相关任务中展现出显著潜力,然而它们是否能够从原始文本中捕捉更深层的语言属性——如句法结构、语音线索及韵律模式——仍不明确。为分析LLMs能否有效学习这些特征并将其应用于重要的自然语言相关任务,我们引入了一个新颖的多语言文体分类数据集,该数据集源自古登堡计划(一个提供数千部公共领域文学作品免费访问的大规模数字图书馆),包含六种语言(英语、法语、德语、意大利语、西班牙语和葡萄牙语)中每个二元分类任务(诗歌 vs. 小说;戏剧 vs. 诗歌;戏剧 vs. 小说)的数千个句子。我们为每个任务补充了三组显式语言特征集(句法树结构、隐喻计数和语音度量),以评估它们对分类性能的影响。实验表明,尽管LLM分类器能够从原始文本或显式提供的特征中学习潜在的语言结构,但不同特征在不同任务中的贡献不均,这突显了在模型训练中纳入更复杂语言信号的重要性。