Skip to content

Latest commit

 

History

History
5 lines (3 loc) · 2.27 KB

2502.00654.md

File metadata and controls

5 lines (3 loc) · 2.27 KB

EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis

3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional face generator and leverage it to train our EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned on continuous emotion values (i.e., valence and arousal); while retaining synchronization of lip movements with input audio. Additionally, to achieve the accurate lip synchronization for in-the-wild audio, we introduce a self-supervised learning method that leverages a text-to-speech network and a visual-audio synchronization network. We experiment our EmoTalkingGaussian on publicly available videos and have obtained better results than state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS), emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy), and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively.

基于 3D 高斯 Splatting 的说话人合成因其能够以实时推理速度渲染高保真图像而受到关注。然而,由于通常仅在缺乏面部情感多样性、且仅包含短视频数据上进行训练,生成的说话人面部难以表现广泛的情感。为了解决这一问题,我们提出了一种唇部对齐的情感面部生成器,并利用该生成器来训练我们的 EmoTalkingGaussian 模型。该模型能够根据连续的情感值(即情感价值和唤醒度)调节面部表情,同时保持与输入音频的唇部动作同步。此外,为了实现野外音频的准确唇部同步,我们引入了一种自监督学习方法,该方法利用文本转语音网络和视觉音频同步网络。我们在公开视频数据集上对 EmoTalkingGaussian 进行了实验,并在图像质量(以 PSNR、SSIM、LPIPS 测量)、情感表达(以 V-RMSE、A-RMSE、V-SA、A-SA、情感准确度测量)以及唇部同步(以 LMD、Sync-E、Sync-C 测量)方面获得了优于现有方法的结果。