基于双边属性增强的渐进式面部粒度聚合用于人脸到语音合成 (Progressive Facial Granularity Aggregation with Bilateral Attribute-based Enhancement for Face-to-Speech Synthesis)

For individuals who have experienced traumatic events such as strokes, speech may no longer be a viable means of communication. While text-to-speech (TTS) can be used as a communication aid since it generates synthetic speech, it fails to preserve the user's own voice. As such, face-to-voice (FTV) synthesis, which derives corresponding voices from facial images, provides a promising alternative. However, existing methods rely on pre-trained visual encoders, and finetune them to align with speech embeddings, which strips fine-grained information from facial inputs such as gender or ethnicity, despite their known correlation with vocal traits. Moreover, these pipelines are multi-stage, which requires separate training of multiple components, thus leading to training inefficiency. To address these limitations, we utilize fine-grained facial attribute modeling by decomposing facial images into non-overlapping segments and progressively integrating them into a multi-granular representation. This representation is further refined through multi-task learning of speaker attributes such as gender and ethnicity at both the visual and acoustic domains. Moreover, to improve alignment robustness, we adopt a multi-view training strategy by pairing various visual perspectives of a speaker in terms of different angles and lighting conditions, with identical speech recordings. Extensive subjective and objective evaluations confirm that our approach substantially enhances face-voice congruence and synthesis stability.

翻译：对于经历过中风等创伤性事件的个体而言，言语可能不再是可行的沟通方式。虽然文本到语音（TTS）技术可作为沟通辅助工具生成合成语音，但无法保留用户自身的声音特征。因此，从面部图像推导对应语音的人脸到语音（FTV）合成技术提供了一种有前景的替代方案。然而，现有方法依赖于预训练的视觉编码器，并通过微调使其与语音嵌入对齐，这剥离了面部输入中的细粒度信息（如性别或种族），尽管已知这些属性与声学特征存在关联。此外，现有流程多为多阶段架构，需要分别训练多个组件，导致训练效率低下。为克服这些局限，我们通过将面部图像分解为非重叠片段并逐步整合为多粒度表征，实现了细粒度面部属性建模。该表征通过视觉与声学领域中对说话者属性（如性别和种族）的多任务学习进一步优化。同时，为提升对齐鲁棒性，我们采用多视角训练策略，将说话者在不同角度和光照条件下的多种视觉视角与相同的语音录音进行配对。大量主观与客观评估证实，本方法显著增强了人脸-语音一致性及合成稳定性。

相关内容

属性

关注 1

一个具体事物，总是有许许多多的性质与关系，我们把一个事物的性质与关系，都叫作事物的属性。事物与属性是不可分的，事物都是有属性的事物，属性也都是事物的属性。一个事物与另一个事物的相同或相异，也就是一个事物的属性与另一事物的属性的相同或相异。由于事物属性的相同或相异，客观世界中就形成了许多不同的事物类。具有相同属性的事物就形成一类，具有不同属性的事物就分别地形成不同的类。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日