进化策略优化 (Evolutionary Policy Optimization)

On-policy reinforcement learning (RL) algorithms are widely used for their strong asymptotic performance and training stability, but they struggle to scale with larger batch sizes, as additional parallel environments yield redundant data due to limited policy-induced diversity. In contrast, Evolutionary Algorithms (EAs) scale naturally and encourage exploration via randomized population-based search, but are often sample-inefficient. We propose Evolutionary Policy Optimization (EPO), a hybrid algorithm that combines the scalability and diversity of EAs with the performance and stability of policy gradients. EPO maintains a population of agents conditioned on latent variables, shares actor-critic network parameters for coherence and memory efficiency, and aggregates diverse experiences into a master agent. Across tasks in dexterous manipulation, legged locomotion, and classic control, EPO outperforms state-of-the-art baselines in sample efficiency, asymptotic performance, and scalability.

翻译：基于策略的强化学习（RL）算法因其强大的渐进性能和训练稳定性而被广泛使用，但它们难以随批量增大而扩展，因为额外的并行环境会因策略诱导的多样性有限而产生冗余数据。相比之下，进化算法（EAs）天然具有可扩展性，并通过基于随机种群的搜索鼓励探索，但通常样本效率较低。我们提出进化策略优化（EPO），这是一种混合算法，结合了EAs的可扩展性和多样性，以及策略梯度的性能和稳定性。EPO维护一个基于潜在变量调节的智能体种群，共享行动者-评论家网络参数以确保一致性和内存效率，并将多样化的经验聚合到一个主智能体中。在灵巧操作、足式运动和经典控制等任务中，EPO在样本效率、渐进性能和可扩展性方面均优于最先进的基线方法。