This paper introduces LegalRikai: Open Benchmark, a new benchmark comprising four complex tasks that emulate Japanese corporate legal practices. The benchmark was created by legal professionals under the supervision of an attorney. This benchmark has 100 samples that require long-form, structured outputs, and we evaluated them against multiple practical criteria. We conducted both human and automated evaluations using leading LLMs, including GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1. Our human evaluation revealed that abstract instructions prompted unnecessary modifications, highlighting model weaknesses in document-level editing that were missed by conventional short-text tasks. Furthermore, our analysis reveals that automated evaluation aligns well with human judgment on criteria with clear linguistic grounding, and assessing structural consistency remains a challenge. The result demonstrates the utility of automated evaluation as a screening tool when expert availability is limited. We propose a dataset evaluation framework to promote more practice-oriented research in the legal domain.
翻译:本文介绍了LegalRikai:开放基准,这是一个包含四项模拟日本企业法律实践复杂任务的新基准。该基准由法律专业人士在律师监督下创建,包含100个需要长形式、结构化输出的样本,并依据多项实践标准对其进行了评估。我们使用包括GPT-5、Gemini 2.5 Pro和Claude Opus 4.1在内的领先大语言模型进行了人工与自动化评估。人工评估结果表明,抽象指令会引发不必要的修改,凸显了模型在文档级编辑方面的弱点,这些弱点在传统的短文本任务中被忽略。此外,我们的分析显示,自动化评估在具有明确语言学基础的评估标准上与人工判断高度一致,而评估结构一致性仍是一个挑战。结果表明,在专家资源有限的情况下,自动化评估可作为有效的筛选工具。我们提出了一个数据集评估框架,以促进法律领域更注重实践的研究。