Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Animating virtual avatars with co-speech gestures can be used in human-machine interaction.
Existing methods rely on GANs, which have issues with mode collapse and unstable training.
A novel diffusion-based framework, DiffGesture, is proposed to capture audio-to-gesture associations and preserve temporal coherence.

Paper Content

Introduction

Making co-speech gestures is an innate human behavior that helps people express and understand thoughts
Animating virtual avatars to gesticulate co-speech movements is important in embodied AI
Recent research focuses on audio-driven co-speech gesture generation
Early attempts treated this task as a searching-and-connecting problem
Deep neural networks have been used to learn the mapping from speech audio to human skeletons
GAN-based methods have been used to guarantee realism
Diffusion probabilistic models provide a new perspective for realistic generation
Difficult to adapt existing diffusion models for co-speech gesture generation
Proposed DiffGesture framework to capture audio-gesture associations while maintaining temporal coherence
Diffusion Audio-Gesture Transformer to model audio-gesture long-term temporal dependency
Diffusion Gesture Stabilizer to eliminate temporal inconsistency
Results outperform state-of-the-arts with superior performance

Co-speech gesture generation is important for various applications.
Rule-based pipelines and neural networks are used to map speech to gesture.
Input modality, speaking style, and speaker identity are studied to improve the model’s capacity.
GANs are used to guarantee realistic results, but mode collapse and unstable training are issues.

Our approach

Problem formulation

Leverage speaking videos with clear co-speech upper body movements for model learning
Extract speech audio sequence and use OpenPose to annotate per-frame human skeletons
Pre-process skeletal representation into concatenation of unit direction vectors
Optimize diffusion model’s reverse denoising process to synthesize human skeleton sequence conditioned on speech audio sequence and initial poses of first M frames

Gesture space forward and reverse process

Goal is to learn a model distribution that approximates a real data distribution
Denoising diffusion probabilistic models (DDPMs) define the latent variable models
Forward diffusion process approximates the posterior distribution
Variance schedule β 1 , . . . , β T is a constant hyperparameter
Reverse process estimates the joint distribution
Gaussian transition is used to formulate the reverse process
Audio and initial poses are used as context information
Variational lower bound on negative loglikelihood is optimized
Training objective is simplified to an ensemble of MSE losses

Diffusion audio-gesture transformer

Naive conditional generation scheme has a critical problem in co-speech gesture generation.
Temporal dependency between target sequence and context information makes it more complex than time-invariant tasks.
Transformer’s strong capacity in sequential data modeling is proposed to guarantee temporally coherent results.
Skeleton and context condition of each frame serve as individual tokens, which captures long-term dependency with self-attention mechanism.

Diffusion gesture stabilizer

DDPMs introduce independent random variables to improve task performance, but can have a negative effect on temporal consistency.
A novel Diffusion Gesture Stabilizer is proposed to achieve a trade-off between diversity and temporal consistency.
Hard thresholding and smooth sampling are used to restrict temporal variation and avoid inconsistency.

Implicit classifier-free guidance

Co-speech gesture literature has difficulty utilizing explicit classifier guidance.
A “unconditional” diffusion model is trained to implicitly guide the generation.
A mix-up training trick is used to parameterize both the conditional and unconditional models.
TED Expressive dataset contains 3D coordinates of 43 keypoints.

Experimental settings

Attention Seq2Seq uses attention mechanism to generate pose sequences from speech text
Speech2Gesture uses spectrums of speech audio segments as input to generate speech gestures
Joint Embedding maps text and motion to same embedding space to generate outputs
Trimodal uses text, audio, and speaker identity to generate gestures
HA2G uses hierarchical audio learner to capture information across different semantic granularities

Evaluation metrics

Evaluation uses three metrics: Fréchet Gesture Distance (FGD), Beat Consistency Score (BC) and Diversity
FGD measures the distance between the synthesized gesture distribution and the real data distribution
BC measures motion-audio beat correlation
Diversity evaluates the variations among generated gestures
Five methods are compared: GT S2S, S2G, Joint, Tri and HA2G

Evaluation results

We compare our method with baselines on two datasets
We report FGD, BC and Diversity metrics
Our method outperforms existing methods by a large margin
Comparison methods produce slow and invariant poses
We conduct a user study with 18 participants
We present ablation studies on key modules
We compare GRU and Transformer for diffusion-based backbone

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Our approach#

Problem formulation#

Gesture space forward and reverse process#

Diffusion audio-gesture transformer#

Diffusion gesture stabilizer#

Implicit classifier-free guidance#

Experimental settings#

Evaluation metrics#

Evaluation results#

Link to paper

Abstract

Paper Content

Introduction

Related work

Our approach

Problem formulation

Gesture space forward and reverse process

Diffusion audio-gesture transformer

Diffusion gesture stabilizer

Implicit classifier-free guidance

Experimental settings

Evaluation metrics

Evaluation results