We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.
翻译:我们提出了Flex,一种高效且有效的场景编码器,旨在解决端到端自动驾驶中处理海量多摄像头数据时的计算瓶颈。Flex采用一小组可学习的场景标记,联合编码来自不同摄像头和时间步的所有图像标记信息。通过设计,我们的方法是几何无关的,直接从数据中学习紧凑的场景表示,而不依赖于显式的三维归纳偏置,如鸟瞰图(BEV)、占用或三平面表示,这些在先前工作中很常见。这种整体编码策略为下游基于大语言模型(LLM)的策略模型大幅压缩了视觉输入。在一个包含20,000驾驶小时的大规模专有数据集上评估,我们的Flex相比最先进方法,推理吞吐量提高了2.2倍,同时驾驶性能显著提升。此外,我们展示了这些紧凑的场景标记在没有显式监督的情况下,发展出场景分解的涌现能力。我们的发现挑战了三维先验是必要的主流假设,证明数据驱动的联合编码策略为未来自动驾驶系统提供了一条更具可扩展性、高效和有效的路径。