Computer Adaptive Testing (CAT) aims to accurately estimate an individual's ability using only a subset of an Item Response Theory (IRT) instrument. For many applications of CAT, one also needs to ensure diverse item exposure across different testing sessions, preventing any single item from being over or underutilized. In CAT, items are selected sequentially based on a running estimate of a respondent's ability. Prior methods almost universally see item selection through an optimization lens, motivating greedy item selection procedures. While efficient, these deterministic methods tend to have poor item exposure. Existing stochastic methods for item selection are ad-hoc, where item sampling weights lack theoretical justification. In this manuscript, we formulate stochastic CAT as a Bayesian model averaging problem. We seek item sampling probabilities, treated in the long run frequentist sense, that perform optimal model averaging for the ability estimate in a Bayesian sense. In doing so we derive a cross-entropy information criterion that yields optimal stochastic mixing. We tested our new method on the eight independent IRT models that comprise the Work Disability Functional Assessment Battery, comparing it to prior art. We found that our stochastic methodology had superior item exposure while not compromising in terms of test accuracy and efficiency.
翻译:计算机自适应测试(CAT)旨在仅使用项目反应理论(IRT)工具的一个子集来准确估计个体的能力。在CAT的许多应用中,还需要确保不同测试会话之间的项目暴露多样性,防止任何单个项目被过度使用或使用不足。在CAT中,项目是根据对受访者能力的动态估计顺序选择的。先前的方法几乎普遍从优化视角看待项目选择,从而推动了贪婪项目选择程序。虽然高效,但这些确定性方法往往具有较差的项目暴露性。现有的随机项目选择方法是临时的,其中项目抽样权重缺乏理论依据。在本手稿中,我们将随机CAT表述为一个贝叶斯模型平均问题。我们寻求项目抽样概率(从长期频率论意义上处理),在贝叶斯意义上对能力估计执行最优模型平均。在此过程中,我们推导出一个交叉熵信息准则,该准则产生最优的随机混合。我们在构成工作残疾功能评估电池的八个独立IRT模型上测试了我们的新方法,并将其与现有技术进行了比较。我们发现,我们的随机方法在项目暴露方面具有优越性,同时在测试准确性和效率方面没有妥协。