Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Automatic dubbing is the task of translating original speech in a video into a target language.
- The target language speech should match the timing of the original video, including mouth movements, pauses, and hand gestures.
- This paper proposes a model that optimizes both the translation and the speech duration of the generated translations.
- The system generates speech that better matches the timing of the original speech compared to prior work.
Paper Content
Introduction
- Automatic Dubbing (AD) is a form of speech-to-speech translation
- Dubs should have a high isochrony, meaning speech should match the movement of the speaker’s mouth
- Little dubbing data is available due to its commercial nature
- Research has focused on ways to produce isometric translations without relying on dubbing data
- Two approaches used to achieve high isochrony: generating text translations and modifying generated speech
- Jointly optimize translation and timing of translated speech
- New dubbing test set released with CC-BY-4.0 license
Related work
- Automatic dubbing is based on a pipeline of ASR, MT, PA and TTS systems
- Previous works have suggested training an MT system to generate translations with the same length as the source transcript
- Other works have applied re-scoring of n-best MT outputs based on syllables or length compliance
- Recent work did not find a strong correlation between isometry and isochrony
- Neural Dubber uses a multi-modal setup to constrain voice generation
- Prosodic alignment with relaxation has been used to fix naturalness issues
- A single but shared duration model between PA and TTS has been proposed
Method
- Propose to jointly model translation and timing of translated speech
- Main challenge is lack of obvious training data
- Human generated dubbing data is scarce and not in public domain
Training data
- Propose to derive training features from speech translation data
- Use Montreal Forced Aligner to compute alignment between speech and text
- Output includes source text, target phonemes, target durations, and locations of pauses in the target
- Train MT model to take source text and desired speech durations as input, and predict phonemes and their durations in the target
Binning of speech durations
- Speech durations are passed into the model using tokens.
- Speech durations are grouped into bins with equal number of samples.
- 100 bins are used.
Model architecture
- Train a standard encoder-decoder Transformer model with source text and desired speech durations as bins
- Target sequence is phonemes and their durations in an interleaved way
- Cross-entropy loss used in training, with phonemes and durations weighted equally
Speech synthesis
- Used FastSpeech 2 TTS model to generate speech
- Provided phonemes and phoneme durations as input to the model, overriding native duration prediction model
Noise to address training/inference mismatch
- Training uses force alignment to determine phonemes and their durations
- Inference uses Voice Activity Detector (VAD) to get speech segments and their durations
- To address mismatch between training and inference, noise is added to source durations
New dubbing test set
- Dataset created to facilitate dubbing research
- Dataset created from En→De test set of COVOST-2
- Two subsets created: test91 and test101
- Volunteers instructed to read sentences and pause when indicated
Experiments
Datasets and processing
- Used English-German portion of COVOST-2 training set
- 12.5% of training dataset has pauses
- 98.6% of samples in training dataset consist of single sentence
- Evaluated approach and baselines on COVOST-2 test set and two dubbing test sets
- Used BPE to split source and target side of training data
- Used Fast-Speech 2 for speech generation
Model and training configuration
- Transformer model used with 6 encoder and decoder layers, self-attention dimension 512 and 8 heads, and feed-forward sublayers of dimension 2048
- Model optimized with Adam, initial learning rate of 5 x 10-4, dropout set to 0.3
- Best model from each configuration selected based on lowest validation BLEU
- Model trained for 200 epochs, best checkpoint used for evaluation
- Beam size of 5 used for decoding, BLEU scores evaluated using SacreBLEU
- Fairseq library used for all experiments
- FastSpeech 2 used to generate speech, pretrained TTS English model available
Baselines & models
- Standard NMT model (StdMT) trained to translate text into text
- Isometric NMT model (IsoMT) trained to generate translations that match the input length in terms of number of characters
- Third baseline (Txt2Phn) trained with source text (without durations) as input, and target phonemes and phoneme durations as output
- Proposed approach (Txtd2PhnD) with source text and durations (obtained from the target speech) as input, and target phonemes and phoneme durations as output
Evaluation metrics
- Evaluation of phonemes instead of words is challenging
- Mapping phoneme sequences back to words was chosen to avoid penalties
- A seq2seq “MT” system was trained using English COVOST-2 training data
- Translation quality and speech overlap were evaluated
- A trade-off between synchronicity and translation quality was noticed
- A metric was used to measure synchronicity between source and dubbed speech
Results
Automatic evaluation results
- Translation quality and speech overlap of baseline models (StdMT, IsoMT, and Txt2Phn) is low (≈0.5)
- Proposed models with access to phrase level source speech durations produce significantly more synchronized target speech with minimal trade off on translation quality
- Adding Gaussian noise to source speech durations increases translation quality but decreases speech overlap
- Our model provides translations with higher overlap but lower BLEU scores
- Isometric MT is no more isometric than standard MT when both are fed into the same TTS model
Conclusion
- A novel method is proposed to optimize a machine translation model for translation quality and speech duration overlap.
- The model generates translations at the phoneme level, resulting in a simpler automatic dubbing pipeline.
- Empirical studies show a 55% relative gain in speech overlap with a trade-off of 2.7% and 9% in translation quality.
- A dubbing test set is released for use by researchers in the dubbing community.