Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.
翻译:先前的研究表明,大型语言模型(LLMs)在特许金融分析师(CFA)考试中表现不佳。然而,近期的推理模型已在多个学科的研究生级别学术和专业考试中取得了优异成绩。本文评估了最先进的推理模型在一套模拟CFA考试上的表现,该套考试包含三个一级考试、两个二级考试和三个三级考试,共计980道题目。采用与先前研究相同的通过/未通过标准,我们发现大多数模型均通过了所有三个级别。按总体表现排序,通过的模型依次为:Gemini 3.0 Pro、Gemini 2.5 Pro、GPT-5、Grok 4、Claude Opus 4.1和DeepSeek-V3.1。具体而言,Gemini 3.0 Pro在一级考试中创下了97.6%的纪录分数。在二级考试中,模型表现同样强劲,GPT-5以94.3%的分数领先。在三级考试中,Gemini 2.5 Pro在多项选择题上获得最高分86.4%,而Gemini 3.0 Pro在建构反应题上取得了92.0%的分数。