Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Animating virtual avatars with co-speech gestures can be used in human-machine interaction.
- Existing methods rely on GANs, which have issues with mode collapse and unstable training.
- A novel diffusion-based framework, DiffGesture, is proposed to capture audio-to-gesture associations and preserve temporal coherence.
Paper Content
Introduction
- Making co-speech gestures is an innate human behavior that helps people express and understand thoughts
- Animating virtual avatars to gesticulate co-speech movements is important in embodied AI
- Recent research focuses on audio-driven co-speech gesture generation
- Early attempts treated this task as a searching-and-connecting problem
- Deep neural networks have been used to learn the mapping from speech audio to human skeletons
- GAN-based methods have been used to guarantee realism
- Diffusion probabilistic models provide a new perspective for realistic generation
- Difficult to adapt existing diffusion models for co-speech gesture generation
- Proposed DiffGesture framework to capture audio-gesture associations while maintaining temporal coherence
- Diffusion Audio-Gesture Transformer to model audio-gesture long-term temporal dependency
- Diffusion Gesture Stabilizer to eliminate temporal inconsistency
- Results outperform state-of-the-arts with superior performance
Related work
- Co-speech gesture generation is important for various applications.
- Rule-based pipelines and neural networks are used to map speech to gesture.
- Input modality, speaking style, and speaker identity are studied to improve the model’s capacity.
- GANs are used to guarantee realistic results, but mode collapse and unstable training are issues.
Our approach
Problem formulation
- Leverage speaking videos with clear co-speech upper body movements for model learning
- Extract speech audio sequence and use OpenPose to annotate per-frame human skeletons
- Pre-process skeletal representation into concatenation of unit direction vectors
- Optimize diffusion model’s reverse denoising process to synthesize human skeleton sequence conditioned on speech audio sequence and initial poses of first M frames
Gesture space forward and reverse process
- Goal is to learn a model distribution that approximates a real data distribution
- Denoising diffusion probabilistic models (DDPMs) define the latent variable models
- Forward diffusion process approximates the posterior distribution
- Variance schedule β 1 , . . . , β T is a constant hyperparameter
- Reverse process estimates the joint distribution
- Gaussian transition is used to formulate the reverse process
- Audio and initial poses are used as context information
- Variational lower bound on negative loglikelihood is optimized
- Training objective is simplified to an ensemble of MSE losses
Diffusion audio-gesture transformer
- Naive conditional generation scheme has a critical problem in co-speech gesture generation.
- Temporal dependency between target sequence and context information makes it more complex than time-invariant tasks.
- Transformer’s strong capacity in sequential data modeling is proposed to guarantee temporally coherent results.
- Skeleton and context condition of each frame serve as individual tokens, which captures long-term dependency with self-attention mechanism.
Diffusion gesture stabilizer
- DDPMs introduce independent random variables to improve task performance, but can have a negative effect on temporal consistency.
- A novel Diffusion Gesture Stabilizer is proposed to achieve a trade-off between diversity and temporal consistency.
- Hard thresholding and smooth sampling are used to restrict temporal variation and avoid inconsistency.
Implicit classifier-free guidance
- Co-speech gesture literature has difficulty utilizing explicit classifier guidance.
- A “unconditional” diffusion model is trained to implicitly guide the generation.
- A mix-up training trick is used to parameterize both the conditional and unconditional models.
- TED Expressive dataset contains 3D coordinates of 43 keypoints.
Experimental settings
- Attention Seq2Seq uses attention mechanism to generate pose sequences from speech text
- Speech2Gesture uses spectrums of speech audio segments as input to generate speech gestures
- Joint Embedding maps text and motion to same embedding space to generate outputs
- Trimodal uses text, audio, and speaker identity to generate gestures
- HA2G uses hierarchical audio learner to capture information across different semantic granularities
Evaluation metrics
- Evaluation uses three metrics: Fréchet Gesture Distance (FGD), Beat Consistency Score (BC) and Diversity
- FGD measures the distance between the synthesized gesture distribution and the real data distribution
- BC measures motion-audio beat correlation
- Diversity evaluates the variations among generated gestures
- Five methods are compared: GT S2S, S2G, Joint, Tri and HA2G
Evaluation results
- We compare our method with baselines on two datasets
- We report FGD, BC and Diversity metrics
- Our method outperforms existing methods by a large margin
- Comparison methods produce slow and invariant poses
- We conduct a user study with 18 participants
- We present ablation studies on key modules
- We compare GRU and Transformer for diffusion-based backbone