Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out ``easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a \textbf{model that conflates ``thinking longer'' with ``thinking better''}. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \textbf{\emph{emergent brevity for free}}: the model learns to solve harder problems without inflating the output length, \textbf{ despite the absence of any explicit length penalization}. RLVR experiments using this approach on \textit{Qwen3-4B-Thinking-2507} (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at \href{https://github.com/MBZUAI-Paris/Frugal-AI}{GitHub}, with datasets and models on \href{https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc}{Hugging Face}.
翻译:为逐步推理训练的大型语言模型(LLMs)常变得过于冗长,增加了推理成本。标准的可验证奖励强化学习(RLVR)流程为提高训练效率会过滤掉“简单”问题,使模型主要训练于需要较长推理链的难题。这导致输出长度分布向上偏移,产生一个将“思考更久”与“思考更好”混为一谈的模型。本研究表明,保留并适度加权中等难度问题可作为一种隐式长度正则化器。让模型接触可解的短链任务能约束其输出分布,防止冗长失控。其结果是免费涌现的简洁性:模型学会解决更难问题而不增加输出长度,尽管未使用任何显式长度惩罚。在Qwen3-4B-Thinking-2507模型(16k令牌限制)上采用此方法的RLVR实验,在保持基线AIME25 pass@1准确率的同时,生成的解决方案平均长度缩短近一半。代码发布于GitHub,数据集与模型存储于Hugging Face。