AlphaZero-like Monte Carlo Tree Search systems, originally introduced for two-player games, dynamically balance exploration and exploitation using neural network guidance. This combination makes them also suitable for classical search problems. However, the original method of training the network with simulation results is limited in sparse reward settings, especially in the early stages, where the network cannot yet give guidance. Hindsight Experience Replay (HER) addresses this issue by relabeling unsuccessful trajectories from the search tree as supervised learning signals. We introduce Adaptable HER (\ours{}), a flexible framework that integrates HER with AlphaZero, allowing easy adjustments to HER properties such as relabeled goals, policy targets, and trajectory selection. Our experiments, including equation discovery, show that the possibility of modifying HER is beneficial and surpasses the performance of pure supervised or reinforcement learning.
翻译:最初为双人博弈设计的类AlphaZero蒙特卡洛树搜索系统,通过神经网络引导动态平衡探索与利用。这种结合使其同样适用于经典搜索问题。然而,利用模拟结果训练网络的原始方法在稀疏奖励场景中存在局限,尤其在早期阶段网络尚未能提供有效引导时。后见经验回放通过将搜索树中未成功的轨迹重新标注为监督学习信号来解决此问题。我们提出适应性HER(\\ours{}),一个将HER与AlphaZero结合的灵活框架,允许轻松调整HER特性,如重标注目标、策略目标及轨迹选择。我们的实验(包括方程发现)表明,修改HER的可能性具有显著优势,其性能超越了纯监督学习或强化学习方法。