Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Building a dataset, BEAT, with 76 hours of multi-modal data from 30 speakers
  • 32 million frame-level emotion and semantic relevance annotations
  • Correlation of conversational gestures with facial expressions, emotions, and semantics
  • Proposing a baseline model, Cascaded Motion Network (CaMN)
  • Introducing a metric, Semantic Relevance Gesture Recall (SRGR)
  • BEAT is the largest motion capture dataset for investigating human gestures

Paper Content

  • Mo-cap and pseudo-label conversational gesture datasets exist
  • Most common mo-cap dataset is 4-hour Trinity dataset
  • Datasets for talking-face generation exist, but cannot be used for gesture synthesis
  • Semantic or emotion-aware motion synthesis studied in action recognition and sign-language analysis/synthesis
  • Baseline models for conversational gesture synthesis exist
  • Efforts to improve performance of baseline models by input/output representation selection, adversarial training, and generative modeling techniques
  • Probabilistic gesture generation enables generating diversity based on noise

Beat: body-expression-audio-text dataset

  • Dataset acquisition process described
  • Text, emotion, and semantic relevance information annotation introduced
  • Correlation between conversational gestures and emotions analyzed using BEAT
  • Distribution of semantic relevance shown
  • Motion capture system based on 16 synchronized cameras recording motion at 120 Hz
  • Facial capture system uses ARKit with a depth camera on iPhone 12 Pro
  • Audio recorded in 48KHz stereo

Data acquisition

  • BEAT is divided into conversation and self-talk sessions
  • Speaker’s gestures are divided into four categories
  • Topics are selected from 20 predefined topics
  • Self-talk sessions consist of 120 1-minute recordings
  • 8 emotions are covered in the dataset
  • Proportion of languages and accents is strictly controlled
  • Mainly English data, with some Chinese, Spanish and Japanese
  • 30 speakers from different ethnicities
  • Speakers asked to read answers proficiently and show natural, personal, daily style of conversational gestures
  • Professional speaker instructs them to elicit corresponding emotion correctly

Data annotation

  • Used an Automatic Speech Recognizer (ASR) to obtain initial text for conversation session
  • Used Montreal Forced Aligner (MFA) for temporal alignment of text with audio
  • Confirmed 8-class emotion label of self-talk
  • Annotators watched video with audio and gestures to perform frame-level annotation
  • 600 annotators from Amazon Mechanical Turk (AMT) scored semantic relevance on a scale of 0-10

Data analysis

  • BEAT collection and annotation enables analysis of correlations between conversational gestures and other modalities.
  • Facial expressions and emotions are strongly correlated.
  • Visualization of gestures in T-SNE shows different characteristics in different emotions.
  • Large randomness for semantic relevance between gestures and texts.

Multi-modal conditioned gestures synthesis baseline

  • Proposed baseline Cascaded Motion Network (CaMN) encodes text, emotion condition, speaker identity, audio and facial blendshape weights to synthesize body and hands gestures
  • Text, audio and speaker ID encoders network selection referred to [53] and customized for better performance
  • All input data have same time resolution as output gestures
  • Gesture and facial blendshape weights downsampled to 15 FPS
  • Text encoder converts words to word embedding set and fine-tuned by customized encoder
  • Audio encoder adopts raw wave representation of audio and downsampled to 16KHZ
  • Facial expression encoder takes initial representation of facial expression and extracts facial latent feature
  • Body and hands decoders implemented in a separated, cascaded structure
  • Final supervision based on gesture reconstruction and adversarial loss
  • Loss function adjusted with weights of L1 loss and adversarial loss using semantic-relevancy label

Metric for semantic relevancy

  • Proposed a metric called Semantic-Relevant Gesture Recall (SRGR) to evaluate the semantic relevancy of gestures.
  • SRGR uses semantic scores as a weight for the Probability of Correct Keypoint (PCK) between the generated and ground truth gestures.

Experiments

  • Evaluated validity of SRGR metric
  • Demonstrated data quality of dataset through subjective experiments
  • Demonstrated validity of baseline model through subjective and objective experiments
  • Discussed contribution of each modality through ablation experiments
  • Conducted user study to evaluate validity of SRGR
  • Found large variance in L1 diversity
  • Humans evaluate diversity on range of motion and other implicit features

Data quality

  • Comparing proposed dataset with Trinity and S2G-3D
  • Trinity dataset has 23 sequences, 10 minutes each
  • Data split into 19:2:2 for train/valid/test
  • Used S2G and A2G to cover GAN and VAE models
  • 120 participants compared clips from Trinity and proposed dataset
  • Evaluated correctness, diversity, and synchrony
  • Results show proposed dataset received higher user preference
  • Especially for hand movements, outperformed Trinity by large margin

Evaluation of the baseline model

  • Training environment: NVIDIA V100
  • Evaluation metrics: FGD, SRGR, Beat-Align
  • Optimizer: Adam
  • Learning rate: 2e-4
  • Pretrained network: LSTM-based autoencoder
  • Results compared to: S2G, A2G, Seq2Seq, MultiContext

Ablation study.

  • Cascaded connection can achieve better performance than end-to-end approach
  • Removing audio reduces synchrony, but some synchrony remains
  • Removing weighted semantic loss improves synchrony
  • Relationship between emotion and synchrony, but little effect from speaker ID
  • Removing audio, emotion, and facial expression does not significantly affect semantic relevant gesture recall
  • Data from each modality contributes to improving FGD
  • Unities of audio and facial expressions improve FGD significantly
  • Removing emotion and speaker ID also impacts FGD scores
  • Classifier trained and tested on speaker-4’s ground truth data

Limitation

  • Impact of acting is inevitable and controlled.
  • Data was filtered out due to inconsistencies in style.
  • SRGR is calculated based on semantic annotation.

Conclusion

  • Built a large-scale, high-quality, multi-modal, semantic and emotional annotated dataset
  • Proposed a cascade-based baseline model for gesture synthesis based on six modalities
  • Achieved SoTA performance
  • Introduced SRGR for evaluating semantic relevancy
  • Annotation interface adapted from VGG Image Annotator
  • Inter-rater agreement rate of 96% for emotion annotations
  • Distribution of vowels and consonants consistent with 3000 words
  • 10 topics for debate and introduction
  • Data collected from speakers of various countries, gender, ages and ethnicity
  • Modelled style differences with explicit controls
  • Filtered out 21 hours of data and six speakers due to inconsistencies in their styles
  • 30 effective speakers with 76 hours of recordings
  • 34 and 26 hours of recordings from native and fluent English speakers
  • Facial expressions represented with FACs based blendshapes
  • Motion retargeting on body bone animation
  • Frechet Gesture Distance and BeatAlign used to evaluate audio-gesture synchrony
  • 6% higher score than GT for random sample of 300 gestures clips
  • 83% average precision for 100 gestures clips
  • Single directional evaluation has higher precision than bi-directional and non-exponential
  • Released data file format includes motion capture, audio, facial blendshape weights, facial mesh, text-audio alignment, semantics and emotion annotations