以变换器为基础的跨模式融合模型,为 VQA 2021挑战提供反向培训 (A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021) - 专知论文

会员服务 ·

0

视觉问答 · MoDELS · Better · 稳健性 · 情景 ·

2021 年 6 月 24 日

A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021

翻译：以变换器为基础的跨模式融合模型,为 VQA 2021挑战提供反向培训

Ke-Han Lu,Bo-Han Fang,Kuan-Yu Chen

from arxiv, CVPR 2021 Workshop: Visual Question Answering (VQA) Challenge

In this paper, inspired by the successes of visionlanguage pre-trained models and the benefits from training with adversarial attacks, we present a novel transformerbased cross-modal fusion modeling by incorporating the both notions for VQA challenge 2021. Specifically, the proposed model is on top of the architecture of VinVL model [19], and the adversarial training strategy [4] is applied to make the model robust and generalized. Moreover, two implementation tricks are also used in our system to obtain better results. The experiments demonstrate that the novel framework can achieve 76.72% on VQAv2 test-std set.

翻译：在本文中,我们借鉴了先入为主的愿景语言培训模式的成功以及对抗性攻击培训的好处,提出了一个新的基于变压器的跨模式融合模型,将 VQA 挑战 2021 的两种概念都纳入其中。具体地说,拟议模式位于VinVL 模式[19] 结构的顶端,而对抗性培训战略[4] 被用于使该模式稳健和普遍化。此外,我们系统还运用了两种实施技巧来取得更好的结果。实验表明,新框架可以在VQAv2测试集上达到76.72%。

1

相关内容

视觉问答

视觉问答（Visual Question Answering，VQA），是一种涉及计算机视觉和自然语言处理的学习任务。这一任务的定义如下： A VQA system takes as input an image and a free-form, open-ended, natural-language question about the image and produces a natural-language answer as the output[1]。翻译为中文：一个VQA系统以一张图片和一个关于这张图片形式自由、开放式的自然语言问题作为输入，以生成一条自然语言答案作为输出。简单来说，VQA就是给定的图片进行问答。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

知识驱动的视觉知识学习，以VQA视觉问答为例，31页ppt

知识驱动的视觉知识学习，以VQA视觉问答为例，31页ppt

专知会员服务

36+阅读 · 2020年9月25日

【KDD2020-清华大学】自适应图编码器，Adaptive Graph Encoder for Attributed Graph Embedding

【KDD2020-清华大学】自适应图编码器，Adaptive Graph Encoder for Attributed Graph Embedding

专知会员服务

99+阅读 · 2020年7月6日

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

专知会员服务

25+阅读 · 2020年5月22日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

由浅及深，细致解读图像问答 VQA 2018 Challenge 冠军模型 Pythia

由浅及深，细致解读图像问答 VQA 2018 Challenge 冠军模型 Pythia

极市平台

8+阅读 · 2019年3月13日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

Transformer Tracking

Arxiv

17+阅读 · 2021年3月29日

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking

Arxiv

7+阅读 · 2021年3月22日

Contrastive Triple Extraction with Generative Transformer

Arxiv

4+阅读 · 2020年12月15日

Adversarial Representation Learning for Text-to-Image Matching

Adversarial Representation Learning for Text-to-Image Matching

Arxiv

6+阅读 · 2019年8月28日

Joint Image Captioning and Question Answering

Arxiv

6+阅读 · 2018年5月22日

VIP会员

文章信息

相关主题

相关VIP内容

知识驱动的视觉知识学习，以VQA视觉问答为例，31页ppt

知识驱动的视觉知识学习，以VQA视觉问答为例，31页ppt

专知会员服务

36+阅读 · 2020年9月25日

【KDD2020-清华大学】自适应图编码器，Adaptive Graph Encoder for Attributed Graph Embedding

【KDD2020-清华大学】自适应图编码器，Adaptive Graph Encoder for Attributed Graph Embedding

专知会员服务

99+阅读 · 2020年7月6日

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

【CVPR2020】语义增强的场景文本识别的编码-解码器框架，SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition

专知会员服务

25+阅读 · 2020年5月22日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

前沿人工智能趋势报告（Frontier AI Trends Report）

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

Andrej Karpathy：2025 年 LLM 年度回顾（2025 LLM Year in Review）

音退化问题：基于输入操控的鲁棒语音转换综述

相关资讯

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

由浅及深，细致解读图像问答 VQA 2018 Challenge 冠军模型 Pythia

由浅及深，细致解读图像问答 VQA 2018 Challenge 冠军模型 Pythia

极市平台

8+阅读 · 2019年3月13日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

相关论文

Transformer Tracking

Arxiv

17+阅读 · 2021年3月29日

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking

Arxiv

7+阅读 · 2021年3月22日

Contrastive Triple Extraction with Generative Transformer

Arxiv

4+阅读 · 2020年12月15日

Adversarial Representation Learning for Text-to-Image Matching

Adversarial Representation Learning for Text-to-Image Matching

Arxiv

6+阅读 · 2019年8月28日

Joint Image Captioning and Question Answering

Arxiv

6+阅读 · 2018年5月22日

微信扫码咨询专知VIP会员