Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Automatic dubbing is the task of translating original speech in a video into a target language.
The target language speech should match the timing of the original video, including mouth movements, pauses, and hand gestures.
This paper proposes a model that optimizes both the translation and the speech duration of the generated translations.
The system generates speech that better matches the timing of the original speech compared to prior work.

Paper Content

Introduction

Automatic Dubbing (AD) is a form of speech-to-speech translation
Dubs should have a high isochrony, meaning speech should match the movement of the speaker’s mouth
Little dubbing data is available due to its commercial nature
Research has focused on ways to produce isometric translations without relying on dubbing data
Two approaches used to achieve high isochrony: generating text translations and modifying generated speech
Jointly optimize translation and timing of translated speech
New dubbing test set released with CC-BY-4.0 license

Automatic dubbing is based on a pipeline of ASR, MT, PA and TTS systems
Previous works have suggested training an MT system to generate translations with the same length as the source transcript
Other works have applied re-scoring of n-best MT outputs based on syllables or length compliance
Recent work did not find a strong correlation between isometry and isochrony
Neural Dubber uses a multi-modal setup to constrain voice generation
Prosodic alignment with relaxation has been used to fix naturalness issues
A single but shared duration model between PA and TTS has been proposed

Method

Propose to jointly model translation and timing of translated speech
Main challenge is lack of obvious training data
Human generated dubbing data is scarce and not in public domain

Training data

Propose to derive training features from speech translation data
Use Montreal Forced Aligner to compute alignment between speech and text
Output includes source text, target phonemes, target durations, and locations of pauses in the target
Train MT model to take source text and desired speech durations as input, and predict phonemes and their durations in the target

Binning of speech durations

Speech durations are passed into the model using tokens.
Speech durations are grouped into bins with equal number of samples.
100 bins are used.

Model architecture

Train a standard encoder-decoder Transformer model with source text and desired speech durations as bins
Target sequence is phonemes and their durations in an interleaved way
Cross-entropy loss used in training, with phonemes and durations weighted equally

Speech synthesis

Used FastSpeech 2 TTS model to generate speech
Provided phonemes and phoneme durations as input to the model, overriding native duration prediction model

Noise to address training/inference mismatch

Training uses force alignment to determine phonemes and their durations
Inference uses Voice Activity Detector (VAD) to get speech segments and their durations
To address mismatch between training and inference, noise is added to source durations

New dubbing test set

Dataset created to facilitate dubbing research
Dataset created from En→De test set of COVOST-2
Two subsets created: test91 and test101
Volunteers instructed to read sentences and pause when indicated

Experiments

Datasets and processing

Used English-German portion of COVOST-2 training set
12.5% of training dataset has pauses
98.6% of samples in training dataset consist of single sentence
Evaluated approach and baselines on COVOST-2 test set and two dubbing test sets
Used BPE to split source and target side of training data
Used Fast-Speech 2 for speech generation

Model and training configuration

Transformer model used with 6 encoder and decoder layers, self-attention dimension 512 and 8 heads, and feed-forward sublayers of dimension 2048
Model optimized with Adam, initial learning rate of 5 x 10-4, dropout set to 0.3
Best model from each configuration selected based on lowest validation BLEU
Model trained for 200 epochs, best checkpoint used for evaluation
Beam size of 5 used for decoding, BLEU scores evaluated using SacreBLEU
Fairseq library used for all experiments
FastSpeech 2 used to generate speech, pretrained TTS English model available

Baselines & models

Standard NMT model (StdMT) trained to translate text into text
Isometric NMT model (IsoMT) trained to generate translations that match the input length in terms of number of characters
Third baseline (Txt2Phn) trained with source text (without durations) as input, and target phonemes and phoneme durations as output
Proposed approach (Txtd2PhnD) with source text and durations (obtained from the target speech) as input, and target phonemes and phoneme durations as output

Evaluation metrics

Evaluation of phonemes instead of words is challenging
Mapping phoneme sequences back to words was chosen to avoid penalties
A seq2seq “MT” system was trained using English COVOST-2 training data
Translation quality and speech overlap were evaluated
A trade-off between synchronicity and translation quality was noticed
A metric was used to measure synchronicity between source and dubbed speech

Results

Automatic evaluation results

Translation quality and speech overlap of baseline models (StdMT, IsoMT, and Txt2Phn) is low (≈0.5)
Proposed models with access to phrase level source speech durations produce significantly more synchronized target speech with minimal trade off on translation quality
Adding Gaussian noise to source speech durations increases translation quality but decreases speech overlap
Our model provides translations with higher overlap but lower BLEU scores
Isometric MT is no more isometric than standard MT when both are fed into the same TTS model

Conclusion

A novel method is proposed to optimize a machine translation model for translation quality and speech duration overlap.
The model generates translations at the phoneme level, resulting in a simpler automatic dubbing pipeline.
Empirical studies show a 55% relative gain in speech overlap with a trade-off of 2.7% and 9% in translation quality.
A dubbing test set is released for use by researchers in the dubbing community.

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Training data#

Binning of speech durations#

Model architecture#

Speech synthesis#

Noise to address training/inference mismatch#

New dubbing test set#

Experiments#

Datasets and processing#

Model and training configuration#

Baselines & models#

Evaluation metrics#

Results#

Automatic evaluation results#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Training data

Binning of speech durations

Model architecture

Speech synthesis

Noise to address training/inference mismatch

New dubbing test set

Experiments

Datasets and processing

Model and training configuration

Baselines & models

Evaluation metrics

Results

Automatic evaluation results

Conclusion