We present a systematic study of subtraction in large language models (LLMs). While prior benchmarks emphasize addition and multiplication, subtraction has received comparatively little attention despite being structurally distinct as a non-commutative operation. We evaluate eight pretrained LLMs spanning four families on addition and subtraction problems. Our experiments reveal that subtraction accuracy lags behind addition by a wide margin. We find that the errors for ($a-b$) are concentrated in cases where ($a<b$). In such cases, LLMs frequently produce the correct magnitude but omit the negative sign. Probing analyses show that LLMs internally encode whether results should be negative, yet this information is often not reflected in generated outputs. We further test well-known techniques such as few-shot learning and instruction-tuning to see if they can improve the LLMs' performance. Our results suggest that while few-shot prompting yields modest gains, the instruction-tuned models achieve near-perfect accuracies in generating the negative sign. Together, these findings provide a clearer characterization of the limitations and recoverability of LLMs' arithmetic capabilities in subtraction.
翻译:本文对大型语言模型(LLMs)的减法运算能力进行了系统性研究。尽管现有基准测试主要关注加法与乘法运算,但减法作为一种非交换运算在结构上具有独特性,却较少受到关注。我们在涵盖四个模型系列的八个预训练LLMs上评估了加法与减法问题。实验结果表明,减法运算的准确率显著落后于加法。我们发现,对于表达式(a-b)的错误主要集中在(a<b)的情形。在此类情况下,LLMs常能生成正确的数值幅度,但会忽略负号。通过探测分析发现,LLMs在内部编码中能够识别结果应为负数,但该信息往往未在生成输出中体现。我们进一步测试了少样本学习与指令微调等经典技术,以探究其能否提升LLMs的性能。结果显示,虽然少样本提示带来有限改进,但经过指令微调的模型在生成负号方面达到了接近完美的准确率。这些发现共同为LLMs在减法运算中的能力局限性与可修复性提供了更清晰的表征。