采用大语言模式强化学习指导培训前指导培训 (Guiding Pretraining in Reinforcement Learning with Large Language Models)

Reinforcement learning algorithms typically struggle in the absence of a dense, well-shaped reward function. Intrinsically motivated exploration methods address this limitation by rewarding agents for visiting novel states or transitions, but these methods offer limited benefits in large environments where most discovered novelty is irrelevant for downstream tasks. We describe a method that uses background knowledge from text corpora to shape exploration. This method, called ELLM (Exploring with LLMs) rewards an agent for achieving goals suggested by a language model prompted with a description of the agent's current state. By leveraging large-scale language model pretraining, ELLM guides agents toward human-meaningful and plausibly useful behaviors without requiring a human in the loop. We evaluate ELLM in the Crafter game environment and the Housekeep robotic simulator, showing that ELLM-trained agents have better coverage of common-sense behaviors during pretraining and usually match or improve performance on a range of downstream tasks.

翻译：强化学习算法通常在缺乏浓密、形状良好的奖赏功能的情况下挣扎。具有内在动机的探索方法通过奖励访问新国家或转型的代理商来应对这一限制, 但是这些方法在大型环境中提供了有限的好处, 大部分发现的新事物与下游任务无关。我们描述了一种使用文本公司的背景知识来影响探索的方法。这种方法叫做ELLM( 与LLMS( 探索LLMS)), 奖励实现由描述该代理商当前状态的语言模型所建议目标的代理商。通过利用大型语言模型预培训, ELLM( ELLM) 引导代理商在不需要人类参与的情况下, 对人类有意义和令人难以置信的有用行为进行引导。我们评估了Crafter游戏环境中的ELLM( ELLM) 和Houskeep机器人模拟器, 表明受ELLM( ELLM) 培训的代理商在培训前更好地涵盖常见行为, 通常匹配或改进一系列下游任务的业绩。