Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Automatic dubbing is the task of translating original speech in a video into a target language.
  • The target language speech should match the timing of the original video, including mouth movements, pauses, and hand gestures.
  • This paper proposes a model that optimizes both the translation and the speech duration of the generated translations.
  • The system generates speech that better matches the timing of the original speech compared to prior work.

Paper Content

Introduction

  • Automatic Dubbing (AD) is a form of speech-to-speech translation
  • Dubs should have a high isochrony, meaning speech should match the movement of the speaker’s mouth
  • Little dubbing data is available due to its commercial nature
  • Research has focused on ways to produce isometric translations without relying on dubbing data
  • Two approaches used to achieve high isochrony: generating text translations and modifying generated speech
  • Jointly optimize translation and timing of translated speech
  • New dubbing test set released with CC-BY-4.0 license
  • Automatic dubbing is based on a pipeline of ASR, MT, PA and TTS systems
  • Previous works have suggested training an MT system to generate translations with the same length as the source transcript
  • Other works have applied re-scoring of n-best MT outputs based on syllables or length compliance
  • Recent work did not find a strong correlation between isometry and isochrony
  • Neural Dubber uses a multi-modal setup to constrain voice generation
  • Prosodic alignment with relaxation has been used to fix naturalness issues
  • A single but shared duration model between PA and TTS has been proposed

Method

  • Propose to jointly model translation and timing of translated speech
  • Main challenge is lack of obvious training data
  • Human generated dubbing data is scarce and not in public domain

Training data

  • Propose to derive training features from speech translation data
  • Use Montreal Forced Aligner to compute alignment between speech and text
  • Output includes source text, target phonemes, target durations, and locations of pauses in the target
  • Train MT model to take source text and desired speech durations as input, and predict phonemes and their durations in the target

Binning of speech durations

  • Speech durations are passed into the model using tokens.
  • Speech durations are grouped into bins with equal number of samples.
  • 100 bins are used.

Model architecture

  • Train a standard encoder-decoder Transformer model with source text and desired speech durations as bins
  • Target sequence is phonemes and their durations in an interleaved way
  • Cross-entropy loss used in training, with phonemes and durations weighted equally

Speech synthesis

  • Used FastSpeech 2 TTS model to generate speech
  • Provided phonemes and phoneme durations as input to the model, overriding native duration prediction model

Noise to address training/inference mismatch

  • Training uses force alignment to determine phonemes and their durations
  • Inference uses Voice Activity Detector (VAD) to get speech segments and their durations
  • To address mismatch between training and inference, noise is added to source durations

New dubbing test set

  • Dataset created to facilitate dubbing research
  • Dataset created from En→De test set of COVOST-2
  • Two subsets created: test91 and test101
  • Volunteers instructed to read sentences and pause when indicated

Experiments

Datasets and processing

  • Used English-German portion of COVOST-2 training set
  • 12.5% of training dataset has pauses
  • 98.6% of samples in training dataset consist of single sentence
  • Evaluated approach and baselines on COVOST-2 test set and two dubbing test sets
  • Used BPE to split source and target side of training data
  • Used Fast-Speech 2 for speech generation

Model and training configuration

  • Transformer model used with 6 encoder and decoder layers, self-attention dimension 512 and 8 heads, and feed-forward sublayers of dimension 2048
  • Model optimized with Adam, initial learning rate of 5 x 10-4, dropout set to 0.3
  • Best model from each configuration selected based on lowest validation BLEU
  • Model trained for 200 epochs, best checkpoint used for evaluation
  • Beam size of 5 used for decoding, BLEU scores evaluated using SacreBLEU
  • Fairseq library used for all experiments
  • FastSpeech 2 used to generate speech, pretrained TTS English model available

Baselines & models

  • Standard NMT model (StdMT) trained to translate text into text
  • Isometric NMT model (IsoMT) trained to generate translations that match the input length in terms of number of characters
  • Third baseline (Txt2Phn) trained with source text (without durations) as input, and target phonemes and phoneme durations as output
  • Proposed approach (Txtd2PhnD) with source text and durations (obtained from the target speech) as input, and target phonemes and phoneme durations as output

Evaluation metrics

  • Evaluation of phonemes instead of words is challenging
  • Mapping phoneme sequences back to words was chosen to avoid penalties
  • A seq2seq “MT” system was trained using English COVOST-2 training data
  • Translation quality and speech overlap were evaluated
  • A trade-off between synchronicity and translation quality was noticed
  • A metric was used to measure synchronicity between source and dubbed speech

Results

Automatic evaluation results

  • Translation quality and speech overlap of baseline models (StdMT, IsoMT, and Txt2Phn) is low (≈0.5)
  • Proposed models with access to phrase level source speech durations produce significantly more synchronized target speech with minimal trade off on translation quality
  • Adding Gaussian noise to source speech durations increases translation quality but decreases speech overlap
  • Our model provides translations with higher overlap but lower BLEU scores
  • Isometric MT is no more isometric than standard MT when both are fed into the same TTS model

Conclusion

  • A novel method is proposed to optimize a machine translation model for translation quality and speech duration overlap.
  • The model generates translations at the phoneme level, resulting in a simpler automatic dubbing pipeline.
  • Empirical studies show a 55% relative gain in speech overlap with a trade-off of 2.7% and 9% in translation quality.
  • A dubbing test set is released for use by researchers in the dubbing community.