In this study, we experiment with the ability of LLMs to do temporal reasoning. Using a Norwegian book from 1940 containing trivia questions, we prompt the LLMs to answer the questions as if it were 1940. We also pose the questions in both English and Norwegian. Correct answers are often presented as sentences, and grading is done by means of LLM-as-judge, with sampled checks by a native speaker. Prompting in English consistently gave better results than in Norwegian, an unexpected result. In contrast, using larger LLMs improved results. We tested the DeepSeek-R1, Gemma3, Qwen3, and Llama3.1 model families, and also the largest available LLM especially crafted for Norwegian.
翻译:在本研究中,我们探索了大语言模型(LLMs)进行时间推理的能力。利用一本1940年出版的挪威语书籍中的趣味问答,我们提示大语言模型以1940年的视角回答问题。我们同时以英语和挪威语两种语言提出问题。正确答案通常以句子形式呈现,评分采用大语言模型作为评判者(LLM-as-judge)的方法,并由母语者进行抽样核查。使用英语提示始终比挪威语获得更好的结果,这是一个意外的发现。相比之下,使用规模更大的大语言模型能够提升表现。我们测试了DeepSeek-R1、Gemma3、Qwen3和Llama3.1等模型系列,以及专门为挪威语定制的最大的可用大语言模型。