Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.
翻译:视觉-语言-动作(VLA)模型在分布内表现出色,但在新相机视角和视觉扰动下性能急剧下降。我们发现这种脆弱性主要源于空间建模中的错位,而非物理建模。为解决此问题,我们提出了一种单次适应框架,通过轻量级可学习更新重新校准视觉表征。我们的第一种方法——特征令牌调制(FTM),对视觉令牌应用全局仿射变换,仅用4K参数就将Libero视角准确率从48.5%提升至87.1%。在此基础上,特征线性适应(FLA)为ViT编码器引入低秩更新,以470万参数实现90.8%的成功率——以远低于LoRA规模微调的成本达到同等效果。这些结果共同揭示了预训练VLA模型中尚未开发的强大鲁棒性,并证明有针对性的最小视觉适应足以恢复视角泛化能力。