Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Building a dataset, BEAT, with 76 hours of multi-modal data from 30 speakers
32 million frame-level emotion and semantic relevance annotations
Correlation of conversational gestures with facial expressions, emotions, and semantics
Proposing a baseline model, Cascaded Motion Network (CaMN)
Introducing a metric, Semantic Relevance Gesture Recall (SRGR)
BEAT is the largest motion capture dataset for investigating human gestures

Paper Content

Mo-cap and pseudo-label conversational gesture datasets exist
Most common mo-cap dataset is 4-hour Trinity dataset
Datasets for talking-face generation exist, but cannot be used for gesture synthesis
Semantic or emotion-aware motion synthesis studied in action recognition and sign-language analysis/synthesis
Baseline models for conversational gesture synthesis exist
Efforts to improve performance of baseline models by input/output representation selection, adversarial training, and generative modeling techniques
Probabilistic gesture generation enables generating diversity based on noise

Beat: body-expression-audio-text dataset

Dataset acquisition process described
Text, emotion, and semantic relevance information annotation introduced
Correlation between conversational gestures and emotions analyzed using BEAT
Distribution of semantic relevance shown
Motion capture system based on 16 synchronized cameras recording motion at 120 Hz
Facial capture system uses ARKit with a depth camera on iPhone 12 Pro
Audio recorded in 48KHz stereo

Data acquisition

BEAT is divided into conversation and self-talk sessions
Speaker’s gestures are divided into four categories
Topics are selected from 20 predefined topics
Self-talk sessions consist of 120 1-minute recordings
8 emotions are covered in the dataset
Proportion of languages and accents is strictly controlled
Mainly English data, with some Chinese, Spanish and Japanese
30 speakers from different ethnicities
Speakers asked to read answers proficiently and show natural, personal, daily style of conversational gestures
Professional speaker instructs them to elicit corresponding emotion correctly

Data annotation

Used an Automatic Speech Recognizer (ASR) to obtain initial text for conversation session
Used Montreal Forced Aligner (MFA) for temporal alignment of text with audio
Confirmed 8-class emotion label of self-talk
Annotators watched video with audio and gestures to perform frame-level annotation
600 annotators from Amazon Mechanical Turk (AMT) scored semantic relevance on a scale of 0-10

Data analysis

BEAT collection and annotation enables analysis of correlations between conversational gestures and other modalities.
Facial expressions and emotions are strongly correlated.
Visualization of gestures in T-SNE shows different characteristics in different emotions.
Large randomness for semantic relevance between gestures and texts.

Proposed baseline Cascaded Motion Network (CaMN) encodes text, emotion condition, speaker identity, audio and facial blendshape weights to synthesize body and hands gestures
Text, audio and speaker ID encoders network selection referred to [53] and customized for better performance
All input data have same time resolution as output gestures
Gesture and facial blendshape weights downsampled to 15 FPS
Text encoder converts words to word embedding set and fine-tuned by customized encoder
Audio encoder adopts raw wave representation of audio and downsampled to 16KHZ
Facial expression encoder takes initial representation of facial expression and extracts facial latent feature
Body and hands decoders implemented in a separated, cascaded structure
Final supervision based on gesture reconstruction and adversarial loss
Loss function adjusted with weights of L1 loss and adversarial loss using semantic-relevancy label

Metric for semantic relevancy

Proposed a metric called Semantic-Relevant Gesture Recall (SRGR) to evaluate the semantic relevancy of gestures.
SRGR uses semantic scores as a weight for the Probability of Correct Keypoint (PCK) between the generated and ground truth gestures.

Experiments

Evaluated validity of SRGR metric
Demonstrated data quality of dataset through subjective experiments
Demonstrated validity of baseline model through subjective and objective experiments
Discussed contribution of each modality through ablation experiments
Conducted user study to evaluate validity of SRGR
Found large variance in L1 diversity
Humans evaluate diversity on range of motion and other implicit features

Data quality

Comparing proposed dataset with Trinity and S2G-3D
Trinity dataset has 23 sequences, 10 minutes each
Data split into 19:2:2 for train/valid/test
Used S2G and A2G to cover GAN and VAE models
120 participants compared clips from Trinity and proposed dataset
Evaluated correctness, diversity, and synchrony
Results show proposed dataset received higher user preference
Especially for hand movements, outperformed Trinity by large margin

Evaluation of the baseline model

Training environment: NVIDIA V100
Evaluation metrics: FGD, SRGR, Beat-Align
Optimizer: Adam
Learning rate: 2e-4
Pretrained network: LSTM-based autoencoder
Results compared to: S2G, A2G, Seq2Seq, MultiContext

Ablation study.

Cascaded connection can achieve better performance than end-to-end approach
Removing audio reduces synchrony, but some synchrony remains
Removing weighted semantic loss improves synchrony
Relationship between emotion and synchrony, but little effect from speaker ID
Removing audio, emotion, and facial expression does not significantly affect semantic relevant gesture recall
Data from each modality contributes to improving FGD
Unities of audio and facial expressions improve FGD significantly
Removing emotion and speaker ID also impacts FGD scores
Classifier trained and tested on speaker-4’s ground truth data

Limitation

Impact of acting is inevitable and controlled.
Data was filtered out due to inconsistencies in style.
SRGR is calculated based on semantic annotation.

Conclusion

Built a large-scale, high-quality, multi-modal, semantic and emotional annotated dataset
Proposed a cascade-based baseline model for gesture synthesis based on six modalities
Achieved SoTA performance
Introduced SRGR for evaluating semantic relevancy
Annotation interface adapted from VGG Image Annotator
Inter-rater agreement rate of 96% for emotion annotations
Distribution of vowels and consonants consistent with 3000 words
10 topics for debate and introduction
Data collected from speakers of various countries, gender, ages and ethnicity
Modelled style differences with explicit controls
Filtered out 21 hours of data and six speakers due to inconsistencies in their styles
30 effective speakers with 76 hours of recordings
34 and 26 hours of recordings from native and fluent English speakers
Facial expressions represented with FACs based blendshapes
Motion retargeting on body bone animation
Frechet Gesture Distance and BeatAlign used to evaluate audio-gesture synchrony
6% higher score than GT for random sample of 300 gestures clips
83% average precision for 100 gestures clips
Single directional evaluation has higher precision than bi-directional and non-exponential
Released data file format includes motion capture, audio, facial blendshape weights, facial mesh, text-audio alignment, semantics and emotion annotations

Link to paper#

Abstract#

Paper Content#

Related work#

Beat: body-expression-audio-text dataset#

Data acquisition#

Data annotation#

Data analysis#

Multi-modal conditioned gestures synthesis baseline#

Metric for semantic relevancy#

Experiments#

Data quality#

Evaluation of the baseline model#

Ablation study.#

Limitation#

Conclusion#

Link to paper

Abstract

Paper Content

Related work

Beat: body-expression-audio-text dataset

Data acquisition

Data annotation

Data analysis

Multi-modal conditioned gestures synthesis baseline

Metric for semantic relevancy

Experiments

Data quality

Evaluation of the baseline model

Ablation study.

Limitation

Conclusion