Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Investigated how humans dub video content from one language to another
- Leveraged a novel corpus of 319.57 hours of video from 54 professionally produced titles
- Challenged assumptions made in qualitative and machine-learning literature on dubbing
- Argued for importance of vocal naturalness and translation quality over isometric and lip-sync constraints
- Found influence of source-side audio on human dubs beyond words of translation
Paper Content
Introduction
- Considerable attention has been paid to the dubbing of video content from one language to another
- Human dubbing has been studied from a qualitative perspective
- Machine-learning practitioners have taken up the task of building multimodal systems for automatic dubbing
- Human dubbing involves a sequence of contributors with control over different aspects of the process
- A data-driven examination of the way humans actually perform this task is missing
- Human dubbing is a “constrained translation”
- Questions about isochrony, isometry, speech tempo, lip sync, translation quality, and source influence are explored
- Insights are provided on research directions to address weaknesses in current automatic dubbing approaches
Related work
Qualitative
- Dubbing is a type of constrained translation
- Dubs need to match the original video track
- Dubs need to be isochronic, phonetic and kinesic synchrony
- Dubs need to be intelligible to the target language and culture
- Dubs should sound natural
- Dubs should preserve the semantic meaning of the source
- Dubbing is a form of non-literal translation called “transcreation”
- Scholars have investigated the role of power, ideology, identity, and similar considerations in dubbing
Automatic dubbing
- Automatic dub generation has been explored with a variety of constraints
- Lip sync constraints have been integrated into dub generation
- Adjusting mouth movements in the original video to match a dubbed audio track has been explored
- Isometric machine translation has been used to produce a translation with similar length to the input
- Controlling speaking rate in automatic dubbing systems to achieve prosodic alignment has been studied
- Time-boundary relaxation has been used to control speaking rate and speech fluency
- Integrating pause constraints directly into MT has been examined
- End-to-end dubbing has been explored
Empirical studies
- Studies have attempted to examine human dubbing through a quantitative lens
- Di Giovanni and Romero-Fresco found that audiences may not be as sensitive to lip sync as traditionally believed
- Karakanta et al. concluded that on-screen human dubs have lower translation quality than off-screen dubs
Corpus description & preprocessing
- Dataset consists of 674 episodes of 54 shows
- Dataset contains 319.57 hours of content from 9,215 distinct speakers
- Subsets of data used for analysis are Drama, Kids, Comedy, and Suspense
- Dataset includes audio and video for English originals and audio tracks for Spanish and German dubs
- 35.68 hours of content with both Spanish and German dubs
- Quality filtering performed prior to analysis
Segmentation and forced alignment
- Script timecodes are used to segment audio tracks
- Dialogue lines are associated with audio between start time and start time of next line
- Lines are roughly the same as speaker turns
- 234,322 dialogue lines for English, 29,210 for German, and 28,720 for Spanish
- Montreal Forced Aligner is used to force align each dialogue line with its corresponding audio
- 87.37% of English lines, 89.35% of German lines, and 80.81% of Spanish lines successfully aligned
- Speaker fundamental frequency and energy are extracted and averaged on a per-phone basis
Filtering
- Filtering out foreign-language text and audio
- Detecting and excluding overlapping speech
- Excluding dialogue lines with incorrect alignments
- 355.36 hours of source and target content
- Manual inspection suggests high alignment quality
Cross-lingual alignment
- Sets of force-aligned and filtered content in each language need to be aligned across languages to create a single corpus
- Offset finding is used to identify segments from audio tracks
- Sentence alignment is done using Vecalign algorithm on multilingual LASER embeddings
- Final dataset contains 42,850 aligned dialogue line pairs from 49 episodes and 11 shows
Gender annotations
- Extract gender information from dramatis personae lists
- Not all characters are listed in the scripts, some gender information is lost
On-screen annotations
- Used annotations to identify on-screen and off-screen speech in 49 episodes
- 9.68% of aligned pairs have on-screen/off-screen annotations
- Tested for differences in duration and complexity of onscreen and offscreen lines
- No significant differences found
Data release considerations
- Content licensing restrictions prevent data from being released.
- German and Spanish are target language subsets.
- Quality filtering and cross-lingual alignment are described in sections 3.1, 3.2 and 3.3.
- Manual on-screen/off-screen annotations are described in section 3.5.
- Prior works have released smaller datasets relying on a permissive interpretation of “fair use”.
Analysis
Isochrony
- Dubbed speech should line up in time with the original speech.
- Automatic dubbing work has explored integrating isochronic constraints.
- Source duration is a strong predictor of dub duration.
- Overlap fraction measures how much time both the original and dubbed speech occur at the same time.
- On-screen dubs are more isochronic than off-screen, but to a small degree.
- Gap in on-vs off-screen isochrony may be partially explained by annotations.
Isometry
- Past work has examined similarity of text length as a way to constrain translation for automatic dubbing.
- This practice is called “isometric machine translation”.
- Aim is to test how good a proxy isometry is for isochrony in human dubs and how much human dubbers preserve character length.
- Results show isometry is a weak-to-moderate proxy for isochrony.
- Human dubs are largely nonisometric in both Spanish and German.
Speaking rate
- Previous literature has found that dubbed speech often sounds “artificial and contrived”
- Isometric MT literature suggests that TTS models require isometric input to produce natural sounding isochronic output
- The paper focuses on examining speaking rates
- It was found that the duration ratio is more closely related to relative length of content than the dub speaking rate
- Standard deviation of dubbing voice actor speaking rate is lower than the source speech
Lip sync
- Qualitative and technical work have considered “lip sync” constraints in human and automatic dubbing
- Failing to match mouth movements may reduce dub quality
- Recent empirical studies have found that this constraint may not be as binding as previously assumed
- Use notion of “viseme” to capture alignment between source and human dub mouth movements
- Average within-viseme cooccurrence rate is 1.575, with an average across-viseme rate of 0.981
Translation quality
- Human dubbing process is complicated
- Translation is modified to satisfy isochrony, lip-sync, and other constraints
- Question is how faithful the resulting translation is to the source material
- Automatic MT metrics used to measure quality of human dubs
- No substantial differences between onscreen and offscreen speech for either metric
Non-text transfer
- Source audio properties explain a substantial fraction of target variance.
- Source speaking rate correlates with target speaking rate.
- Line-level mean pitch is strongly related to source audio.
- Standard deviation of pitch is less related to source audio.
- Mean and standard deviation of energy are weakly related to source audio.
- Gender of source character is a weak predictor of line-level mean pitch.
- Speaker identity is a good predictor of target audio characteristics.
- Dialogue line-level variables increase predictive power.
- Human dubbers imitate properties of source audio at a semantic level.
Insights for automatic dubbing
- Translation quality and speech naturalness are important for automatic dubbing
- Input to the dubbing process is mostly dialogue with gender and number issues
- Little literature on automatic translation of dialogues
- Naturalness for TTS systems is challenging
- Need for mechanism to encode emotion/emphasis
- Isochrony is important for automatic dubbing
- Isometry is not a good proxy for isochrony
- Lip sync is marginally useful for automatic dubbing
Future work
- Analyzed two language pairs: English-German and English-Spanish
- Hoping to analyze more distant language pairs and non-English source material
- Isometry is a poor proxy for isochrony in human dubs
- Automatic metrics used in this work, hoping to verify findings with human annotators
- Aggregate analysis provides high-level insights, hoping to explore individual variations in future work
Conclusion
- First large-scale quantitative study of how humans perform the task of dubbing video content
- Results challenge assumptions in qualitative and machine learning literature
- Analysis provides insights on research directions to address weaknesses in current automatic dubbing approaches