具有优化培训数据集的最佳行为方-批评政策 (Optimal Actor-Critic Policy with Optimized Training Datasets)

Actor-critic (AC) algorithms are known for their efficacy and high performance in solving reinforcement learning problems, but they also suffer from low sampling efficiency. An AC based policy optimization process is iterative and needs to frequently access the agent-environment system to evaluate and update the policy by rolling out the policy, collecting rewards and states (i.e. samples), and learning from them. It ultimately requires a huge number of samples to learn an optimal policy. To improve sampling efficiency, we propose a strategy to optimize the training dataset that contains significantly less samples collected from the AC process. The dataset optimization is made of a best episode only operation, a policy parameter-fitness model, and a genetic algorithm module. The optimal policy network trained by the optimized training dataset exhibits superior performance compared to many contemporary AC algorithms in controlling autonomous dynamical systems. Evaluation on standard benchmarks show that the method improves sampling efficiency, ensures faster convergence to optima, and is more data-efficient than its counterparts.

翻译：以ACC为基础的政策优化过程具有迭接性,需要经常访问代理环境系统,通过推出政策、收集奖赏和国家(例如抽样)以及从中学习来评价和更新政策。最终需要大量样本才能学习最佳政策。为了提高取样效率,我们提议了一项优化培训数据集的战略,该数据集包含的样本要少得多。数据集优化由最佳的单例操作、政策参数适合性模型和基因算法模块组成。由优化培训数据集培训的最佳政策网络在控制自主动态系统方面表现优于当代AC算法。对标准基准的评估表明,该方法提高了取样效率,确保了更快地与Opima接轨,并且比对应方的数据效率更高。