Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the main workhorse for the recent achievements in AI. Moreover, RL typically operates in a stationary environment with episodic resets, limiting its utility. Here, we extend supervised learning to address learning to \textit{control} in non-stationary, reset-free environments. Using this framework, called ''Prospective Learning with Control'' (PL+C), we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective learning with control, foraging -- which is a canonical task for any mobile agent -- be it natural or artificial. We illustrate that modern RL algorithms fail to learn in these non-stationary reset-free environments, and even with modifications, they are orders of magnitude less efficient than our prospective foraging agents.
翻译:未来最优控制是人工智能的下一个前沿领域。当前解决该问题的方法通常基于强化学习(RL)。强化学习在数学上有别于监督学习,而监督学习是近期人工智能取得成就的主要技术手段。此外,强化学习通常在具有周期性重置的静态环境中运行,这限制了其应用范围。本文扩展了监督学习,以解决在非静态、无重置环境中的学习控制问题。通过这一称为“前瞻性学习与控制”(PL+C)的框架,我们证明在相当一般的假设条件下,经验风险最小化(ERM)能够渐近地实现贝叶斯最优策略。随后,我们以前瞻性学习与控制的一个具体实例——觅食行为(作为任何移动智能体,无论是自然还是人工的典型任务)为例进行说明。我们展示了现代强化学习算法在这些非静态、无重置环境中无法有效学习,即使经过修改,其效率也远低于我们提出的前瞻性觅食智能体。