We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
翻译:我们推出Qwen3-VL,这是迄今为止Qwen系列中最强大的视觉语言模型,在广泛的多模态基准测试中实现了卓越性能。该模型原生支持高达256K令牌的交错上下文,无缝整合文本、图像和视频。模型家族包含稠密模型(2B/4B/8B/32B)和专家混合模型(30B-A3B/235B-A22B)变体,以适应不同的延迟-质量权衡。Qwen3-VL提供三大核心支柱:(i)显著增强的纯文本理解能力,在多种情况下超越可比的纯文本骨干模型;(ii)强大的长上下文理解能力,为文本和交错多模态输入提供原生256K令牌窗口,实现对长文档和视频的忠实保留、检索与交叉引用;(iii)跨单图像、多图像及视频任务的高级多模态推理能力,在MMMU等综合评估及视觉数学基准(如MathVista和MathVision)上展现领先性能。在架构层面,我们引入三项关键升级:(i)增强型交错MRoPE,用于强化图像与视频的时空建模;(ii)DeepStack集成,通过有效利用多层级ViT特征来加强视觉-语言对齐;(iii)基于文本的视频时间对齐,从T-RoPE演进为显式文本时间戳对齐,实现更精确的时间定位。在可比的令牌预算和延迟约束下,Qwen3-VL在稠密架构和专家混合架构中均实现了优越性能。我们期待Qwen3-VL成为现实工作流程中基于图像的推理、智能体决策以及多模态代码智能的基础引擎。