Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Animating virtual avatars with co-speech gestures can be used in human-machine interaction.
  • Existing methods rely on GANs, which have issues with mode collapse and unstable training.
  • A novel diffusion-based framework, DiffGesture, is proposed to capture audio-to-gesture associations and preserve temporal coherence.

Paper Content


  • Making co-speech gestures is an innate human behavior that helps people express and understand thoughts
  • Animating virtual avatars to gesticulate co-speech movements is important in embodied AI
  • Recent research focuses on audio-driven co-speech gesture generation
  • Early attempts treated this task as a searching-and-connecting problem
  • Deep neural networks have been used to learn the mapping from speech audio to human skeletons
  • GAN-based methods have been used to guarantee realism
  • Diffusion probabilistic models provide a new perspective for realistic generation
  • Difficult to adapt existing diffusion models for co-speech gesture generation
  • Proposed DiffGesture framework to capture audio-gesture associations while maintaining temporal coherence
  • Diffusion Audio-Gesture Transformer to model audio-gesture long-term temporal dependency
  • Diffusion Gesture Stabilizer to eliminate temporal inconsistency
  • Results outperform state-of-the-arts with superior performance
  • Co-speech gesture generation is important for various applications.
  • Rule-based pipelines and neural networks are used to map speech to gesture.
  • Input modality, speaking style, and speaker identity are studied to improve the model’s capacity.
  • GANs are used to guarantee realistic results, but mode collapse and unstable training are issues.

Our approach

Problem formulation

  • Leverage speaking videos with clear co-speech upper body movements for model learning
  • Extract speech audio sequence and use OpenPose to annotate per-frame human skeletons
  • Pre-process skeletal representation into concatenation of unit direction vectors
  • Optimize diffusion model’s reverse denoising process to synthesize human skeleton sequence conditioned on speech audio sequence and initial poses of first M frames

Gesture space forward and reverse process

  • Goal is to learn a model distribution that approximates a real data distribution
  • Denoising diffusion probabilistic models (DDPMs) define the latent variable models
  • Forward diffusion process approximates the posterior distribution
  • Variance schedule β 1 , . . . , β T is a constant hyperparameter
  • Reverse process estimates the joint distribution
  • Gaussian transition is used to formulate the reverse process
  • Audio and initial poses are used as context information
  • Variational lower bound on negative loglikelihood is optimized
  • Training objective is simplified to an ensemble of MSE losses

Diffusion audio-gesture transformer

  • Naive conditional generation scheme has a critical problem in co-speech gesture generation.
  • Temporal dependency between target sequence and context information makes it more complex than time-invariant tasks.
  • Transformer’s strong capacity in sequential data modeling is proposed to guarantee temporally coherent results.
  • Skeleton and context condition of each frame serve as individual tokens, which captures long-term dependency with self-attention mechanism.

Diffusion gesture stabilizer

  • DDPMs introduce independent random variables to improve task performance, but can have a negative effect on temporal consistency.
  • A novel Diffusion Gesture Stabilizer is proposed to achieve a trade-off between diversity and temporal consistency.
  • Hard thresholding and smooth sampling are used to restrict temporal variation and avoid inconsistency.

Implicit classifier-free guidance

  • Co-speech gesture literature has difficulty utilizing explicit classifier guidance.
  • A “unconditional” diffusion model is trained to implicitly guide the generation.
  • A mix-up training trick is used to parameterize both the conditional and unconditional models.
  • TED Expressive dataset contains 3D coordinates of 43 keypoints.

Experimental settings

  • Attention Seq2Seq uses attention mechanism to generate pose sequences from speech text
  • Speech2Gesture uses spectrums of speech audio segments as input to generate speech gestures
  • Joint Embedding maps text and motion to same embedding space to generate outputs
  • Trimodal uses text, audio, and speaker identity to generate gestures
  • HA2G uses hierarchical audio learner to capture information across different semantic granularities

Evaluation metrics

  • Evaluation uses three metrics: Fréchet Gesture Distance (FGD), Beat Consistency Score (BC) and Diversity
  • FGD measures the distance between the synthesized gesture distribution and the real data distribution
  • BC measures motion-audio beat correlation
  • Diversity evaluates the variations among generated gestures
  • Five methods are compared: GT S2S, S2G, Joint, Tri and HA2G

Evaluation results

  • We compare our method with baselines on two datasets
  • We report FGD, BC and Diversity metrics
  • Our method outperforms existing methods by a large margin
  • Comparison methods produce slow and invariant poses
  • We conduct a user study with 18 participants
  • We present ablation studies on key modules
  • We compare GRU and Transformer for diffusion-based backbone