Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Investigated how humans dub video content from one language to another
Leveraged a novel corpus of 319.57 hours of video from 54 professionally produced titles
Challenged assumptions made in qualitative and machine-learning literature on dubbing
Argued for importance of vocal naturalness and translation quality over isometric and lip-sync constraints
Found influence of source-side audio on human dubs beyond words of translation

Paper Content

Introduction

Considerable attention has been paid to the dubbing of video content from one language to another
Human dubbing has been studied from a qualitative perspective
Machine-learning practitioners have taken up the task of building multimodal systems for automatic dubbing
Human dubbing involves a sequence of contributors with control over different aspects of the process
A data-driven examination of the way humans actually perform this task is missing
Human dubbing is a “constrained translation”
Questions about isochrony, isometry, speech tempo, lip sync, translation quality, and source influence are explored
Insights are provided on research directions to address weaknesses in current automatic dubbing approaches

Qualitative

Dubbing is a type of constrained translation
Dubs need to match the original video track
Dubs need to be isochronic, phonetic and kinesic synchrony
Dubs need to be intelligible to the target language and culture
Dubs should sound natural
Dubs should preserve the semantic meaning of the source
Dubbing is a form of non-literal translation called “transcreation”
Scholars have investigated the role of power, ideology, identity, and similar considerations in dubbing

Automatic dubbing

Automatic dub generation has been explored with a variety of constraints
Lip sync constraints have been integrated into dub generation
Adjusting mouth movements in the original video to match a dubbed audio track has been explored
Isometric machine translation has been used to produce a translation with similar length to the input
Controlling speaking rate in automatic dubbing systems to achieve prosodic alignment has been studied
Time-boundary relaxation has been used to control speaking rate and speech fluency
Integrating pause constraints directly into MT has been examined
End-to-end dubbing has been explored

Empirical studies

Studies have attempted to examine human dubbing through a quantitative lens
Di Giovanni and Romero-Fresco found that audiences may not be as sensitive to lip sync as traditionally believed
Karakanta et al. concluded that on-screen human dubs have lower translation quality than off-screen dubs

Corpus description & preprocessing

Dataset consists of 674 episodes of 54 shows
Dataset contains 319.57 hours of content from 9,215 distinct speakers
Subsets of data used for analysis are Drama, Kids, Comedy, and Suspense
Dataset includes audio and video for English originals and audio tracks for Spanish and German dubs
35.68 hours of content with both Spanish and German dubs
Quality filtering performed prior to analysis

Segmentation and forced alignment

Script timecodes are used to segment audio tracks
Dialogue lines are associated with audio between start time and start time of next line
Lines are roughly the same as speaker turns
234,322 dialogue lines for English, 29,210 for German, and 28,720 for Spanish
Montreal Forced Aligner is used to force align each dialogue line with its corresponding audio
87.37% of English lines, 89.35% of German lines, and 80.81% of Spanish lines successfully aligned
Speaker fundamental frequency and energy are extracted and averaged on a per-phone basis

Filtering

Filtering out foreign-language text and audio
Detecting and excluding overlapping speech
Excluding dialogue lines with incorrect alignments
355.36 hours of source and target content
Manual inspection suggests high alignment quality

Cross-lingual alignment

Sets of force-aligned and filtered content in each language need to be aligned across languages to create a single corpus
Offset finding is used to identify segments from audio tracks
Sentence alignment is done using Vecalign algorithm on multilingual LASER embeddings
Final dataset contains 42,850 aligned dialogue line pairs from 49 episodes and 11 shows

Gender annotations

Extract gender information from dramatis personae lists
Not all characters are listed in the scripts, some gender information is lost

On-screen annotations

Used annotations to identify on-screen and off-screen speech in 49 episodes
9.68% of aligned pairs have on-screen/off-screen annotations
Tested for differences in duration and complexity of onscreen and offscreen lines
No significant differences found

Data release considerations

Content licensing restrictions prevent data from being released.
German and Spanish are target language subsets.
Quality filtering and cross-lingual alignment are described in sections 3.1, 3.2 and 3.3.
Manual on-screen/off-screen annotations are described in section 3.5.
Prior works have released smaller datasets relying on a permissive interpretation of “fair use”.

Analysis

Isochrony

Dubbed speech should line up in time with the original speech.
Automatic dubbing work has explored integrating isochronic constraints.
Source duration is a strong predictor of dub duration.
Overlap fraction measures how much time both the original and dubbed speech occur at the same time.
On-screen dubs are more isochronic than off-screen, but to a small degree.
Gap in on-vs off-screen isochrony may be partially explained by annotations.

Isometry

Past work has examined similarity of text length as a way to constrain translation for automatic dubbing.
This practice is called “isometric machine translation”.
Aim is to test how good a proxy isometry is for isochrony in human dubs and how much human dubbers preserve character length.
Results show isometry is a weak-to-moderate proxy for isochrony.
Human dubs are largely nonisometric in both Spanish and German.

Speaking rate

Previous literature has found that dubbed speech often sounds “artificial and contrived”
Isometric MT literature suggests that TTS models require isometric input to produce natural sounding isochronic output
The paper focuses on examining speaking rates
It was found that the duration ratio is more closely related to relative length of content than the dub speaking rate
Standard deviation of dubbing voice actor speaking rate is lower than the source speech

Lip sync

Qualitative and technical work have considered “lip sync” constraints in human and automatic dubbing
Failing to match mouth movements may reduce dub quality
Recent empirical studies have found that this constraint may not be as binding as previously assumed
Use notion of “viseme” to capture alignment between source and human dub mouth movements
Average within-viseme cooccurrence rate is 1.575, with an average across-viseme rate of 0.981

Translation quality

Human dubbing process is complicated
Translation is modified to satisfy isochrony, lip-sync, and other constraints
Question is how faithful the resulting translation is to the source material
Automatic MT metrics used to measure quality of human dubs
No substantial differences between onscreen and offscreen speech for either metric

Non-text transfer

Source audio properties explain a substantial fraction of target variance.
Source speaking rate correlates with target speaking rate.
Line-level mean pitch is strongly related to source audio.
Standard deviation of pitch is less related to source audio.
Mean and standard deviation of energy are weakly related to source audio.
Gender of source character is a weak predictor of line-level mean pitch.
Speaker identity is a good predictor of target audio characteristics.
Dialogue line-level variables increase predictive power.
Human dubbers imitate properties of source audio at a semantic level.

Insights for automatic dubbing

Translation quality and speech naturalness are important for automatic dubbing
Input to the dubbing process is mostly dialogue with gender and number issues
Little literature on automatic translation of dialogues
Naturalness for TTS systems is challenging
Need for mechanism to encode emotion/emphasis
Isochrony is important for automatic dubbing
Isometry is not a good proxy for isochrony
Lip sync is marginally useful for automatic dubbing

Future work

Analyzed two language pairs: English-German and English-Spanish
Hoping to analyze more distant language pairs and non-English source material
Isometry is a poor proxy for isochrony in human dubs
Automatic metrics used in this work, hoping to verify findings with human annotators
Aggregate analysis provides high-level insights, hoping to explore individual variations in future work

Conclusion

First large-scale quantitative study of how humans perform the task of dubbing video content
Results challenge assumptions in qualitative and machine learning literature
Analysis provides insights on research directions to address weaknesses in current automatic dubbing approaches

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Qualitative#

Automatic dubbing#

Empirical studies#

Corpus description & preprocessing#

Segmentation and forced alignment#

Filtering#

Cross-lingual alignment#

Gender annotations#

On-screen annotations#

Data release considerations#

Analysis#

Isochrony#

Isometry#

Speaking rate#

Lip sync#

Translation quality#

Non-text transfer#

Insights for automatic dubbing#

Future work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Qualitative

Automatic dubbing

Empirical studies

Corpus description & preprocessing

Segmentation and forced alignment

Filtering

Cross-lingual alignment

Gender annotations

On-screen annotations

Data release considerations

Analysis

Isochrony

Isometry

Speaking rate

Lip sync

Translation quality

Non-text transfer

Insights for automatic dubbing

Future work

Conclusion