VLA模型比你想象的更具泛化性：重新审视物理与空间建模 (VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling) - 专知论文

会员服务 ·

0

泛化 · 令牌 · 泛化性 · 扰动 · 调制 ·

VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling

翻译：VLA模型比你想象的更具泛化性：重新审视物理与空间建模

Weiqi Li,Quande Zhang,Ruifeng Zhai,Liang Lin,Guangrun Wang

Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.

翻译：视觉-语言-动作（VLA）模型在分布内表现出色，但在新相机视角和视觉扰动下性能急剧下降。我们发现这种脆弱性主要源于空间建模中的错位，而非物理建模。为解决此问题，我们提出了一种单次适应框架，通过轻量级可学习更新重新校准视觉表征。我们的第一种方法——特征令牌调制（FTM），对视觉令牌应用全局仿射变换，仅用4K参数就将Libero视角准确率从48.5%提升至87.1%。在此基础上，特征线性适应（FLA）为ViT编码器引入低秩更新，以470万参数实现90.8%的成功率——以远低于LoRA规模微调的成本达到同等效果。这些结果共同揭示了预训练VLA模型中尚未开发的强大鲁棒性，并证明有针对性的最小视觉适应足以恢复视角泛化能力。

0

相关内容

DeepSeek模型综述：V1 V2 V3 R1-Zero

DeepSeek模型综述：V1 V2 V3 R1-Zero

专知会员服务

116+阅读 · 2月11日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日

GeoffreyHinton-ICML2020投稿论文-偏转对抗攻击 Deflecting Adversarial Attacks

GeoffreyHinton-ICML2020投稿论文-偏转对抗攻击 Deflecting Adversarial Attacks

专知会员服务

24+阅读 · 2020年2月22日

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

专知会员服务

11+阅读 · 2019年11月2日

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

专知

37+阅读 · 2020年6月11日

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

专知

10+阅读 · 2020年3月31日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

CosFace: Large Margin Cosine Loss for Deep Face Recognition论文笔记

CosFace: Large Margin Cosine Loss for Deep Face Recognition论文笔记

统计学习与视觉计算组

44+阅读 · 2018年4月25日

论文浅尝 | Know-Evolve: Deep Temporal Reasoning for Dynamic KG

论文浅尝 | Know-Evolve: Deep Temporal Reasoning for Dynamic KG

开放知识图谱

36+阅读 · 2018年3月30日

2D/3D视觉信息融合仿生SLAM关键问题研究

国家自然科学基金

3+阅读 · 2015年12月31日

SDN数据平面中大规模流表的高性能查找方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

基于分层稀疏表示的微动目标ISAR三维层析成像技术

国家自然科学基金

1+阅读 · 2015年12月31日

基于决策模型和预备电位的运动想象BCI研究

国家自然科学基金

3+阅读 · 2015年12月31日

M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

Arxiv

0+阅读 · 12月15日

3D-LATTE: Latent Space 3D Editing from Textual Instructions

Arxiv

0+阅读 · 12月12日

LAMP: Language-Assisted Motion Planning for Controllable Video Generation

Arxiv

0+阅读 · 12月8日

GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers

Arxiv

0+阅读 · 12月3日

CUPID: Generative 3D Reconstruction via Joint Object and Pose Modeling

Arxiv

0+阅读 · 11月24日

VIP会员

文章信息

相关主题

相关VIP内容

DeepSeek模型综述：V1 V2 V3 R1-Zero

DeepSeek模型综述：V1 V2 V3 R1-Zero

专知会员服务

116+阅读 · 2月11日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日

GeoffreyHinton-ICML2020投稿论文-偏转对抗攻击 Deflecting Adversarial Attacks

GeoffreyHinton-ICML2020投稿论文-偏转对抗攻击 Deflecting Adversarial Attacks

专知会员服务

24+阅读 · 2020年2月22日

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

【Facebook AI】对抗性NLI:自然语言理解的新基准，Adversarial NLI: A New Benchmark for Natural Language Understanding

专知会员服务

11+阅读 · 2019年11月2日

热门VIP内容

开通专知VIP会员享更多权益服务

《利用人工智能对军事行动进行建模》

《利用人工智能学习、优化与推演美国海军作战部队的战略布局与分散（续文）》

机器人、无人机与实时影像：应对城市爆炸威胁的三大技术方案

《指挥官意图消息中关键概念自动提取》最新47页

相关资讯

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

【ICML2020】多视角对比图表示学习，Contrastive Multi-View GRL

专知

37+阅读 · 2020年6月11日

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

【CVPR2020-牛津-谷歌】语音到动作:动作识别的跨模态监督，Cross-modal Supervision

专知

10+阅读 · 2020年3月31日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

CosFace: Large Margin Cosine Loss for Deep Face Recognition论文笔记

CosFace: Large Margin Cosine Loss for Deep Face Recognition论文笔记

统计学习与视觉计算组

44+阅读 · 2018年4月25日

论文浅尝 | Know-Evolve: Deep Temporal Reasoning for Dynamic KG

论文浅尝 | Know-Evolve: Deep Temporal Reasoning for Dynamic KG

开放知识图谱

36+阅读 · 2018年3月30日

相关论文

M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

Arxiv

0+阅读 · 12月15日

3D-LATTE: Latent Space 3D Editing from Textual Instructions

Arxiv

0+阅读 · 12月12日

LAMP: Language-Assisted Motion Planning for Controllable Video Generation

Arxiv

0+阅读 · 12月8日

GRASP: GRouped Activation Shared Parameterization for Parameter-Efficient Fine-Tuning and Robust Inference of Transformers

Arxiv

0+阅读 · 12月3日

CUPID: Generative 3D Reconstruction via Joint Object and Pose Modeling

Arxiv

0+阅读 · 11月24日

相关基金

2D/3D视觉信息融合仿生SLAM关键问题研究

国家自然科学基金

3+阅读 · 2015年12月31日

SDN数据平面中大规模流表的高性能查找方法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

基于分层稀疏表示的微动目标ISAR三维层析成像技术

国家自然科学基金

1+阅读 · 2015年12月31日

基于决策模型和预备电位的运动想象BCI研究

国家自然科学基金

3+阅读 · 2015年12月31日

微信扫码咨询专知VIP会员