Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Investigated how humans dub video content from one language to another
  • Leveraged a novel corpus of 319.57 hours of video from 54 professionally produced titles
  • Challenged assumptions made in qualitative and machine-learning literature on dubbing
  • Argued for importance of vocal naturalness and translation quality over isometric and lip-sync constraints
  • Found influence of source-side audio on human dubs beyond words of translation

Paper Content

Introduction

  • Considerable attention has been paid to the dubbing of video content from one language to another
  • Human dubbing has been studied from a qualitative perspective
  • Machine-learning practitioners have taken up the task of building multimodal systems for automatic dubbing
  • Human dubbing involves a sequence of contributors with control over different aspects of the process
  • A data-driven examination of the way humans actually perform this task is missing
  • Human dubbing is a “constrained translation”
  • Questions about isochrony, isometry, speech tempo, lip sync, translation quality, and source influence are explored
  • Insights are provided on research directions to address weaknesses in current automatic dubbing approaches

Qualitative

  • Dubbing is a type of constrained translation
  • Dubs need to match the original video track
  • Dubs need to be isochronic, phonetic and kinesic synchrony
  • Dubs need to be intelligible to the target language and culture
  • Dubs should sound natural
  • Dubs should preserve the semantic meaning of the source
  • Dubbing is a form of non-literal translation called “transcreation”
  • Scholars have investigated the role of power, ideology, identity, and similar considerations in dubbing

Automatic dubbing

  • Automatic dub generation has been explored with a variety of constraints
  • Lip sync constraints have been integrated into dub generation
  • Adjusting mouth movements in the original video to match a dubbed audio track has been explored
  • Isometric machine translation has been used to produce a translation with similar length to the input
  • Controlling speaking rate in automatic dubbing systems to achieve prosodic alignment has been studied
  • Time-boundary relaxation has been used to control speaking rate and speech fluency
  • Integrating pause constraints directly into MT has been examined
  • End-to-end dubbing has been explored

Empirical studies

  • Studies have attempted to examine human dubbing through a quantitative lens
  • Di Giovanni and Romero-Fresco found that audiences may not be as sensitive to lip sync as traditionally believed
  • Karakanta et al. concluded that on-screen human dubs have lower translation quality than off-screen dubs

Corpus description & preprocessing

  • Dataset consists of 674 episodes of 54 shows
  • Dataset contains 319.57 hours of content from 9,215 distinct speakers
  • Subsets of data used for analysis are Drama, Kids, Comedy, and Suspense
  • Dataset includes audio and video for English originals and audio tracks for Spanish and German dubs
  • 35.68 hours of content with both Spanish and German dubs
  • Quality filtering performed prior to analysis

Segmentation and forced alignment

  • Script timecodes are used to segment audio tracks
  • Dialogue lines are associated with audio between start time and start time of next line
  • Lines are roughly the same as speaker turns
  • 234,322 dialogue lines for English, 29,210 for German, and 28,720 for Spanish
  • Montreal Forced Aligner is used to force align each dialogue line with its corresponding audio
  • 87.37% of English lines, 89.35% of German lines, and 80.81% of Spanish lines successfully aligned
  • Speaker fundamental frequency and energy are extracted and averaged on a per-phone basis

Filtering

  • Filtering out foreign-language text and audio
  • Detecting and excluding overlapping speech
  • Excluding dialogue lines with incorrect alignments
  • 355.36 hours of source and target content
  • Manual inspection suggests high alignment quality

Cross-lingual alignment

  • Sets of force-aligned and filtered content in each language need to be aligned across languages to create a single corpus
  • Offset finding is used to identify segments from audio tracks
  • Sentence alignment is done using Vecalign algorithm on multilingual LASER embeddings
  • Final dataset contains 42,850 aligned dialogue line pairs from 49 episodes and 11 shows

Gender annotations

  • Extract gender information from dramatis personae lists
  • Not all characters are listed in the scripts, some gender information is lost

On-screen annotations

  • Used annotations to identify on-screen and off-screen speech in 49 episodes
  • 9.68% of aligned pairs have on-screen/off-screen annotations
  • Tested for differences in duration and complexity of onscreen and offscreen lines
  • No significant differences found

Data release considerations

  • Content licensing restrictions prevent data from being released.
  • German and Spanish are target language subsets.
  • Quality filtering and cross-lingual alignment are described in sections 3.1, 3.2 and 3.3.
  • Manual on-screen/off-screen annotations are described in section 3.5.
  • Prior works have released smaller datasets relying on a permissive interpretation of “fair use”.

Analysis

Isochrony

  • Dubbed speech should line up in time with the original speech.
  • Automatic dubbing work has explored integrating isochronic constraints.
  • Source duration is a strong predictor of dub duration.
  • Overlap fraction measures how much time both the original and dubbed speech occur at the same time.
  • On-screen dubs are more isochronic than off-screen, but to a small degree.
  • Gap in on-vs off-screen isochrony may be partially explained by annotations.

Isometry

  • Past work has examined similarity of text length as a way to constrain translation for automatic dubbing.
  • This practice is called “isometric machine translation”.
  • Aim is to test how good a proxy isometry is for isochrony in human dubs and how much human dubbers preserve character length.
  • Results show isometry is a weak-to-moderate proxy for isochrony.
  • Human dubs are largely nonisometric in both Spanish and German.

Speaking rate

  • Previous literature has found that dubbed speech often sounds “artificial and contrived”
  • Isometric MT literature suggests that TTS models require isometric input to produce natural sounding isochronic output
  • The paper focuses on examining speaking rates
  • It was found that the duration ratio is more closely related to relative length of content than the dub speaking rate
  • Standard deviation of dubbing voice actor speaking rate is lower than the source speech

Lip sync

  • Qualitative and technical work have considered “lip sync” constraints in human and automatic dubbing
  • Failing to match mouth movements may reduce dub quality
  • Recent empirical studies have found that this constraint may not be as binding as previously assumed
  • Use notion of “viseme” to capture alignment between source and human dub mouth movements
  • Average within-viseme cooccurrence rate is 1.575, with an average across-viseme rate of 0.981

Translation quality

  • Human dubbing process is complicated
  • Translation is modified to satisfy isochrony, lip-sync, and other constraints
  • Question is how faithful the resulting translation is to the source material
  • Automatic MT metrics used to measure quality of human dubs
  • No substantial differences between onscreen and offscreen speech for either metric

Non-text transfer

  • Source audio properties explain a substantial fraction of target variance.
  • Source speaking rate correlates with target speaking rate.
  • Line-level mean pitch is strongly related to source audio.
  • Standard deviation of pitch is less related to source audio.
  • Mean and standard deviation of energy are weakly related to source audio.
  • Gender of source character is a weak predictor of line-level mean pitch.
  • Speaker identity is a good predictor of target audio characteristics.
  • Dialogue line-level variables increase predictive power.
  • Human dubbers imitate properties of source audio at a semantic level.

Insights for automatic dubbing

  • Translation quality and speech naturalness are important for automatic dubbing
  • Input to the dubbing process is mostly dialogue with gender and number issues
  • Little literature on automatic translation of dialogues
  • Naturalness for TTS systems is challenging
  • Need for mechanism to encode emotion/emphasis
  • Isochrony is important for automatic dubbing
  • Isometry is not a good proxy for isochrony
  • Lip sync is marginally useful for automatic dubbing

Future work

  • Analyzed two language pairs: English-German and English-Spanish
  • Hoping to analyze more distant language pairs and non-English source material
  • Isometry is a poor proxy for isochrony in human dubs
  • Automatic metrics used in this work, hoping to verify findings with human annotators
  • Aggregate analysis provides high-level insights, hoping to explore individual variations in future work

Conclusion

  • First large-scale quantitative study of how humans perform the task of dubbing video content
  • Results challenge assumptions in qualitative and machine learning literature
  • Analysis provides insights on research directions to address weaknesses in current automatic dubbing approaches