LegalEval-Q：面向大语言模型生成法律文本质量评估的新基准 (LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text)

As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks tend to focus mainly on factual accuracy while largely neglecting important linguistic quality aspects such as clarity, coherence, and terminology. To address this gap, we propose three steps: First, we develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology. Second, we create a specialized set of legal questions. Third, we analyze 49 LLMs using this evaluation framework. Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of $2.7\%$ noted at 72 billion parameters. Second, engineering choices such as quantization and context length have a negligible impact, as indicated by statistical significance thresholds above 0.016. Third, reasoning models consistently outperform base architectures. A significant outcome of our research is the release of a ranking list and Pareto analysis, which highlight the Qwen3 series as the optimal choice for cost-performance tradeoffs. This work not only establishes standardized evaluation protocols for legal LLMs but also uncovers fundamental limitations in current training data refinement approaches. Code and models are available at: https://github.com/lyxx3rd/LegalEval-Q.

翻译：随着大语言模型（LLMs）在法律应用中的日益普及，现有评估基准主要关注事实准确性，却普遍忽视了清晰度、连贯性和术语准确性等关键语言质量维度。为填补这一空白，本研究提出三步框架：首先，我们开发了一个基于清晰度、连贯性与术语准确性的法律文本质量回归评估模型。其次，我们构建了专业化的法律问题集。最后，基于该评估框架对49个大语言模型进行了系统性分析。研究发现三个关键结论：第一，模型质量在140亿参数规模趋于饱和，720亿参数仅带来2.7%的边际提升。第二，量化策略与上下文长度等工程选择的影响可忽略（统计显著性阈值>0.016）。第三，推理增强模型持续优于基础架构。本研究的重要成果是发布了包含帕累托分析的模型排名榜单，揭示Qwen3系列在成本效益权衡中的最优表现。本工作不仅建立了法律大语言模型的标准化评估协议，同时揭示了当前训练数据精炼方法的根本性局限。代码与模型已开源：https://github.com/lyxx3rd/LegalEval-Q。