Large Language Models (LLMs) have revolutionized both general natural language processing and domain-specific applications such as code synthesis, legal reasoning, and finance. However, while prior studies have explored individual model capabilities, a systematic cross-domain comparison that unifies linguistic, reasoning, and code understanding abilities remains underexplored. In this work, we present a comprehensive evaluation of five general-purpose and three code-specific state-of-the-art LLMs across six diverse benchmarks encompassing linguistic competence, mathematical reasoning, and trustworthiness. Additionally, we analyze model behavior on the CoNaLa dataset for code explanation, comparing natural language and code-specialized LLMs. Our findings reveal that models optimized for code (e.g., CodeLLaMA variants) exhibit strong reasoning and syntactic precision, that even for non-coding tasks can show measurable performance gains, in contrast to general-purpose models like Mistral-7B and Llama-3-8B.
翻译:大型语言模型(LLMs)已在通用自然语言处理及代码合成、法律推理、金融等特定领域应用中引发革命性变革。然而,尽管已有研究探索了各模型的独立能力,但将语言理解、推理与代码理解能力相统一的系统性跨领域比较仍显不足。本研究通过涵盖语言能力、数学推理与可信度的六个多样化基准,对五种通用型及三种代码专用型前沿LLMs进行了全面评估。此外,我们基于CoNaLa数据集分析了模型在代码解释任务中的表现,对比了通用语言模型与代码专用模型的行为特征。研究结果表明,针对代码优化的模型(如CodeLLaMA系列变体)展现出强大的推理能力与句法精确性,即使在非编码任务中也能实现可量化的性能提升,这与Mistral-7B、Llama-3-8B等通用模型形成鲜明对比。