LexGenius：面向大语言模型法律通用智能的专家级基准 (LexGenius: An Expert-Level Benchmark for Large Language Models in Legal General Intelligence)

Legal general intelligence (GI) refers to artificial intelligence (AI) that encompasses legal understanding, reasoning, and decision-making, simulating the expertise of legal experts across domains. However, existing benchmarks are result-oriented and fail to systematically evaluate the legal intelligence of large language models (LLMs), hindering the development of legal GI. To address this, we propose LexGenius, an expert-level Chinese legal benchmark for evaluating legal GI in LLMs. It follows a Dimension-Task-Ability framework, covering seven dimensions, eleven tasks, and twenty abilities. We use the recent legal cases and exam questions to create multiple-choice questions with a combination of manual and LLM reviews to reduce data leakage risks, ensuring accuracy and reliability through multiple rounds of checks. We evaluate 12 state-of-the-art LLMs using LexGenius and conduct an in-depth analysis. We find significant disparities across legal intelligence abilities for LLMs, with even the best LLMs lagging behind human legal professionals. We believe LexGenius can assess the legal intelligence abilities of LLMs and enhance legal GI development. Our project is available at https://github.com/QwenQKing/LexGenius.

翻译：法律通用智能（GI）指涵盖法律理解、推理与决策的人工智能（AI），其能够模拟跨领域法律专家的专业知识。然而，现有基准多为结果导向，未能系统评估大语言模型（LLMs）的法律智能水平，阻碍了法律通用智能的发展。为此，我们提出LexGenius——一个用于评估大语言模型法律通用智能的专家级中文法律基准。该基准遵循“维度-任务-能力”框架，涵盖七个维度、十一项任务与二十种能力。我们采用近期真实案例与考试题目构建选择题，结合人工与大语言模型双重审核以降低数据泄露风险，并通过多轮校验确保数据的准确性与可靠性。基于LexGenius，我们对12个前沿大语言模型进行了评估与深入分析。研究发现，大语言模型在法律智能各项能力上存在显著差异，即使最优模型仍落后于人类法律专业人士。我们相信LexGenius能够有效评估大语言模型的法律智能能力，并推动法律通用智能的发展。项目已开源：https://github.com/QwenQKing/LexGenius。