Despite rapid progress in multimodal large language models (MLLMs) and emerging omni-modal architectures, current benchmarks remain limited in scope and integration, suffering from incomplete modality coverage, restricted interaction to text-centric outputs, and weak interdependence and complementarity among modalities. To bridge these gaps, we introduce FysicsWorld, the first unified full-modality benchmark that supports bidirectional input-output across image, video, audio, and text, enabling comprehensive any-to-any evaluation across understanding, generation, and reasoning. FysicsWorld encompasses 16 primary tasks and 3,268 curated samples, aggregated from over 40 high-quality sources and covering a rich set of open-domain categories with diverse question types. We also propose the Cross-Modal Complementarity Screening (CMCS) strategy integrated in a systematic data construction framework that produces omni-modal data for spoken interaction and fusion-dependent cross-modal reasoning. Through a comprehensive evaluation of over 30 state-of-the-art baselines, spanning MLLMs, modality-specific models, unified understanding-generation models, and omni-modal language models, FysicsWorld exposes the performance disparities and limitations across models in understanding, generation, and reasoning. Our benchmark establishes a unified foundation and strong baselines for evaluating and advancing next-generation full-modality architectures.
翻译:尽管多模态大语言模型(MLLMs)和新兴的全模态架构发展迅速,现有基准在覆盖范围与整合性上仍存在局限,表现为模态覆盖不完整、交互局限于以文本为中心的输出,以及模态间相互依赖与互补性较弱。为弥合这些差距,我们提出了FysicsWorld——首个支持图像、视频、音频与文本间双向输入输出的统一全模态基准,能够全面评估任意模态间的理解、生成与推理能力。FysicsWorld涵盖16项核心任务与3,268个精选样本,数据汇集自40余个高质量来源,覆盖开放领域内丰富的类别与多样化的题型。我们还提出了跨模态互补性筛选(CMCS)策略,并将其整合于系统化的数据构建框架中,以生成用于口语交互及融合依赖性跨模态推理的全模态数据。通过对超过30个前沿基线模型(包括MLLMs、特定模态模型、统一理解-生成模型及全模态语言模型)的全面评估,FysicsWorld揭示了各类模型在理解、生成与推理任务上的性能差异与局限。本基准为评估和推进下一代全模态架构奠定了统一的基础与坚实的基线。