Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in either reinforcement learning (RL). While powerful, this learning framework is mathematically distinct from supervised learning, which has been the main workhorse for the recent achievements in AI. Moreover, RL typically operates in a stationary environment with episodic resets, limiting its utility to more realistic settings. Here, we extend supervised learning to address learning to control in non-stationary, reset-free environments. Using this framework, called ''Prospective Learning with Control (PL+C)'', we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective learning with control, foraging -- which is a canonical task for any mobile agent -- be it natural or artificial. We illustrate that modern RL algorithms fail to learn in these non-stationary reset-free environments, and even with modifications, they are orders of magnitude less efficient than our prospective foraging agents.
翻译:未来的最优控制是人工智能的下一个前沿领域。当前解决该问题的方法通常基于强化学习(RL)。尽管强化学习功能强大,但其数学框架与监督学习有本质区别,而监督学习是近期人工智能成就的主要支柱。此外,强化学习通常在具有周期性重置的静态环境中运行,这限制了其在更现实场景中的应用。本文扩展了监督学习框架,以解决在非静态、无重置环境中的控制学习问题。通过这一称为“前瞻性学习与控制(PL+C)”的框架,我们证明在相当一般的假设条件下,经验风险最小化(ERM)能够渐近地实现贝叶斯最优策略。随后,我们以前瞻性学习与控制的一个具体实例——觅食任务(作为任何移动智能体——无论是自然还是人工的——的典型任务)为例进行说明。实验表明,现代强化学习算法在这些非静态、无重置环境中无法有效学习,即使经过修改,其效率也远低于我们提出的前瞻性觅食智能体,差距达数个数量级。