Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit's coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at https://github.com/wjc12345123/KeyframeFace.
翻译:从自然语言生成动态的3D面部动画需要同时理解时序结构化的语义和细粒度的表情变化。现有的数据集和方法主要侧重于语音驱动的动画或无结构的表情序列,因此缺乏生成富有表现力的人类表演所需的语义基础和时序结构。在本工作中,我们引入了KeyframeFace,一个通过关键帧级监督设计用于文本到动画研究的大规模多模态数据集。KeyframeFace提供了2,100个富有表现力的脚本,配以单目视频、每帧ARKit系数、上下文背景、复杂情感、手动定义的关键帧,以及基于ARKit系数和图像通过大型语言模型(LLMs)和多模态大型语言模型(MLLMs)生成的多视角标注。除了数据集之外,我们提出了首个明确利用LLM先验进行可解释面部运动合成的文本到动画框架。该设计将LLMs的语义理解能力与ARKit系数的可解释结构对齐,实现了高保真度的富有表现力的动画。KeyframeFace和我们基于LLM的框架共同为可解释、关键帧引导且上下文感知的文本到动画研究奠定了新的基础。代码和数据可在https://github.com/wjc12345123/KeyframeFace获取。