The GPT-4 technical report suggests that downstream performance can be predicted from pre-training signals, but offers little methodological detail on how to quantify this. This work address this gap by modeling knowledge retention, the capacity of a pre-trained language model to memorize factual information from its corpus, and introduce a principled method to estimate it prior to training. We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. SMI is validated through large-scale document retrieval over the disclosed pre-training corpora of 21 public and 3 custom models, combined with a robust multi-template QA evaluation. Experiments show that SMI significantly outperforms repetition-based baselines and achieves $R^2$ > 0.7 in predicting QA accuracy for models above 1B parameters, without additional training. The analysis further reveals diminishing returns from scaling data and model size and provides evidence for an intrinsic upper bound on knowledge retention achievable by pre-training alone, motivating retrieval and other augmentation strategies.
翻译:GPT-4技术报告指出,下游任务性能可通过预训练信号进行预测,但未提供量化该过程的具体方法细节。本研究通过建模知识保留——即预训练语言模型记忆其语料库中事实信息的能力——填补了这一空白,并提出一种在训练前估算该能力的原理性方法。我们提出了规模依赖互信息(SMI),这是一种信息论预测指标,综合了知识频率、知识特异性与模型规模,用于预测闭卷问答(QA)准确率。SMI通过对21个公开模型和3个定制模型的已公开预训练语料库进行大规模文档检索,并结合稳健的多模板QA评估进行验证。实验表明,SMI显著优于基于重复的基线方法,在预测参数量超过10亿的模型的QA准确率时达到$R^2$ > 0.7,且无需额外训练。分析进一步揭示了扩展数据和模型规模带来的收益递减现象,并为仅通过预训练可实现的知识保留内在上限提供了证据,从而为检索及其他增强策略提供了理论依据。