Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Building a dataset, BEAT, with 76 hours of multi-modal data from 30 speakers
- 32 million frame-level emotion and semantic relevance annotations
- Correlation of conversational gestures with facial expressions, emotions, and semantics
- Proposing a baseline model, Cascaded Motion Network (CaMN)
- Introducing a metric, Semantic Relevance Gesture Recall (SRGR)
- BEAT is the largest motion capture dataset for investigating human gestures
Paper Content
Related work
- Mo-cap and pseudo-label conversational gesture datasets exist
- Most common mo-cap dataset is 4-hour Trinity dataset
- Datasets for talking-face generation exist, but cannot be used for gesture synthesis
- Semantic or emotion-aware motion synthesis studied in action recognition and sign-language analysis/synthesis
- Baseline models for conversational gesture synthesis exist
- Efforts to improve performance of baseline models by input/output representation selection, adversarial training, and generative modeling techniques
- Probabilistic gesture generation enables generating diversity based on noise
Beat: body-expression-audio-text dataset
- Dataset acquisition process described
- Text, emotion, and semantic relevance information annotation introduced
- Correlation between conversational gestures and emotions analyzed using BEAT
- Distribution of semantic relevance shown
- Motion capture system based on 16 synchronized cameras recording motion at 120 Hz
- Facial capture system uses ARKit with a depth camera on iPhone 12 Pro
- Audio recorded in 48KHz stereo
Data acquisition
- BEAT is divided into conversation and self-talk sessions
- Speaker’s gestures are divided into four categories
- Topics are selected from 20 predefined topics
- Self-talk sessions consist of 120 1-minute recordings
- 8 emotions are covered in the dataset
- Proportion of languages and accents is strictly controlled
- Mainly English data, with some Chinese, Spanish and Japanese
- 30 speakers from different ethnicities
- Speakers asked to read answers proficiently and show natural, personal, daily style of conversational gestures
- Professional speaker instructs them to elicit corresponding emotion correctly
Data annotation
- Used an Automatic Speech Recognizer (ASR) to obtain initial text for conversation session
- Used Montreal Forced Aligner (MFA) for temporal alignment of text with audio
- Confirmed 8-class emotion label of self-talk
- Annotators watched video with audio and gestures to perform frame-level annotation
- 600 annotators from Amazon Mechanical Turk (AMT) scored semantic relevance on a scale of 0-10
Data analysis
- BEAT collection and annotation enables analysis of correlations between conversational gestures and other modalities.
- Facial expressions and emotions are strongly correlated.
- Visualization of gestures in T-SNE shows different characteristics in different emotions.
- Large randomness for semantic relevance between gestures and texts.
Multi-modal conditioned gestures synthesis baseline
- Proposed baseline Cascaded Motion Network (CaMN) encodes text, emotion condition, speaker identity, audio and facial blendshape weights to synthesize body and hands gestures
- Text, audio and speaker ID encoders network selection referred to [53] and customized for better performance
- All input data have same time resolution as output gestures
- Gesture and facial blendshape weights downsampled to 15 FPS
- Text encoder converts words to word embedding set and fine-tuned by customized encoder
- Audio encoder adopts raw wave representation of audio and downsampled to 16KHZ
- Facial expression encoder takes initial representation of facial expression and extracts facial latent feature
- Body and hands decoders implemented in a separated, cascaded structure
- Final supervision based on gesture reconstruction and adversarial loss
- Loss function adjusted with weights of L1 loss and adversarial loss using semantic-relevancy label
Metric for semantic relevancy
- Proposed a metric called Semantic-Relevant Gesture Recall (SRGR) to evaluate the semantic relevancy of gestures.
- SRGR uses semantic scores as a weight for the Probability of Correct Keypoint (PCK) between the generated and ground truth gestures.
Experiments
- Evaluated validity of SRGR metric
- Demonstrated data quality of dataset through subjective experiments
- Demonstrated validity of baseline model through subjective and objective experiments
- Discussed contribution of each modality through ablation experiments
- Conducted user study to evaluate validity of SRGR
- Found large variance in L1 diversity
- Humans evaluate diversity on range of motion and other implicit features
Data quality
- Comparing proposed dataset with Trinity and S2G-3D
- Trinity dataset has 23 sequences, 10 minutes each
- Data split into 19:2:2 for train/valid/test
- Used S2G and A2G to cover GAN and VAE models
- 120 participants compared clips from Trinity and proposed dataset
- Evaluated correctness, diversity, and synchrony
- Results show proposed dataset received higher user preference
- Especially for hand movements, outperformed Trinity by large margin
Evaluation of the baseline model
- Training environment: NVIDIA V100
- Evaluation metrics: FGD, SRGR, Beat-Align
- Optimizer: Adam
- Learning rate: 2e-4
- Pretrained network: LSTM-based autoencoder
- Results compared to: S2G, A2G, Seq2Seq, MultiContext
Ablation study.
- Cascaded connection can achieve better performance than end-to-end approach
- Removing audio reduces synchrony, but some synchrony remains
- Removing weighted semantic loss improves synchrony
- Relationship between emotion and synchrony, but little effect from speaker ID
- Removing audio, emotion, and facial expression does not significantly affect semantic relevant gesture recall
- Data from each modality contributes to improving FGD
- Unities of audio and facial expressions improve FGD significantly
- Removing emotion and speaker ID also impacts FGD scores
- Classifier trained and tested on speaker-4’s ground truth data
Limitation
- Impact of acting is inevitable and controlled.
- Data was filtered out due to inconsistencies in style.
- SRGR is calculated based on semantic annotation.
Conclusion
- Built a large-scale, high-quality, multi-modal, semantic and emotional annotated dataset
- Proposed a cascade-based baseline model for gesture synthesis based on six modalities
- Achieved SoTA performance
- Introduced SRGR for evaluating semantic relevancy
- Annotation interface adapted from VGG Image Annotator
- Inter-rater agreement rate of 96% for emotion annotations
- Distribution of vowels and consonants consistent with 3000 words
- 10 topics for debate and introduction
- Data collected from speakers of various countries, gender, ages and ethnicity
- Modelled style differences with explicit controls
- Filtered out 21 hours of data and six speakers due to inconsistencies in their styles
- 30 effective speakers with 76 hours of recordings
- 34 and 26 hours of recordings from native and fluent English speakers
- Facial expressions represented with FACs based blendshapes
- Motion retargeting on body bone animation
- Frechet Gesture Distance and BeatAlign used to evaluate audio-gesture synchrony
- 6% higher score than GT for random sample of 300 gestures clips
- 83% average precision for 100 gestures clips
- Single directional evaluation has higher precision than bi-directional and non-exponential
- Released data file format includes motion capture, audio, facial blendshape weights, facial mesh, text-audio alignment, semantics and emotion annotations