Offline Behavior Distillation (OBD), which condenses massive offline RL data into a compact synthetic behavioral dataset, offers a promising approach for efficient policy training and can be applied across various downstream RL tasks. In this paper, we uncover a misalignment between original and distilled datasets, observing that a high-quality original dataset does not necessarily yield a superior synthetic dataset. Through an empirical analysis of policy performance under varying levels of training loss, we show that datasets with greater state diversity outperforms those with higher state quality when training loss is substantial, as is often the case in OBD, whereas the relationship reverses under minimal loss, which contributes to the misalignment. By associating state quality and diversity in reducing pivotal and surrounding error, respectively, our theoretical analysis establishes that surrounding error plays a more crucial role in policy performance when pivotal error is large, thereby highlighting the importance of state diversity in OBD scenario. Furthermore, we propose a novel yet simple algorithm, state density weighted (SDW) OBD, which emphasizes state diversity by weighting the distillation objective using the reciprocal of state density, thereby distilling a more diverse state information into synthetic data. Extensive experiments across multiple D4RL datasets confirm that SDW significantly enhances OBD performance when the original dataset exhibits limited state diversity.
翻译:离线行为蒸馏(OBD)通过将海量离线强化学习数据压缩为紧凑的合成行为数据集,为高效策略训练提供了前景广阔的方法,并可应用于各类下游强化学习任务。本文揭示了原始数据集与蒸馏数据集之间的错位现象,发现高质量的原始数据集未必能产生更优的合成数据集。通过对不同训练损失水平下策略性能的实证分析,我们证明当训练损失较大时(这在OBD中常见),具有更高状态多样性的数据集优于状态质量更高的数据集;而在损失极小时,这种关系发生逆转,这正是导致错位现象的原因。通过理论分析将状态质量与多样性分别关联于关键误差与周边误差的降低,我们证明当关键误差较大时,周边误差对策略性能的影响更为关键,从而凸显了OBD场景中状态多样性的重要性。进一步地,我们提出一种新颖而简洁的算法——状态密度加权(SDW)OBD,该算法通过状态密度倒数为蒸馏目标加权来增强状态多样性,从而将更多样的状态信息蒸馏至合成数据中。在多个D4RL数据集上的大量实验证实,当原始数据集状态多样性有限时,SDW能显著提升OBD性能。