Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Direct speech-to-speech translation (S2ST) is advantageous for fast inference with a simplified pipeline.
  • UnitY is a novel two-pass direct S2ST architecture.
  • UnitY is enhanced by subword prediction, advanced two-pass decoder architecture design and search strategy, and better training regularization.
  • UnitY is pre-trained with a self-supervised denoising auto-encoding task.
  • UnitY outperforms a single-pass speech-to-unit translation model.
  • UnitY achieves 2.51x decoding speed-up compared to predicting spectrogram in the second pass.

Paper Content

Introduction

  • Automatic speech translation is an important technology for international communication.
  • Traditional approach to speech-to-speech translation is to use separate components for speech recognition, machine translation, and text-to-speech.
  • Sequence-to-sequence models have made it possible to use a single architecture with fewer components.
  • Direct approach is attractive for low-latency systems and reducing development costs.
  • Poor performance of direct S2ST models is due to data scarcity.
  • Data shortage has been addressed by pre-training, multi-task learning, pseudo labeling, and knowledge distillation.
  • Recent works propose to model discrete acoustic units instead of a continuous speech signal.
  • UnitY is a two-pass direct S2ST model that generates subwords and discrete acoustic units.
  • UnitY achieves better translation quality and decoding efficiency than other models.
  • UnitY is pre-trained with multilingual BART at the subword level.

Architecture

  • N 1st , N 2nd , and N t2u are the depths of the first-pass decoder, second-pass decoder, and T2U encoder of UnitY
  • Training objective of the first pass is to minimize the direct S2TT loss
  • Generating subwords instead of phonemes has five advantages
  • T2U encoder bridges the gap in representations between text and unit decoders
  • Training objective of the second pass is to minimize L s2u while being conditioned on Y

Training with r-drop

  • UnitY introduces an intermediate S2TT sub-task to make optimization tractable while maintaining end-to-end differentiability.
  • S2TT task is more likely to overfit than S2UT task.
  • Regularization based on R-Drop is applied to first-pass decoder to tackle overfitting.
  • R-Drop reduces inconsistency of model predictions between training and inference, improving generalization ability.

Text decoder pre-training

  • ASR and S2TT studies benefit from self-supervised pre-training
  • Speech encoder can be pre-trained with wav2vec2.0
  • Unit decoder can be initialized with unit-based mBART pre-trained with unlabeled speech data
  • Text decoder can be initialized with text-based mBART pre-trained with unlabeled text data
  • FFN of text decoder is frozen during S2ST fine-tuning
  • T2U encoder and second-pass unit decoder are initialized randomly

Search algorithm

  • Perform two-pass beam search decoding during inference
  • First-pass decoder uses beam search with beam size B 1st
  • Second-pass decoder uses beam search with beam size B 2nd
  • Assign larger beam size to first pass (B 1st > B 2nd ) for more diversity and reduced computation time

Deep-shallow two-pass decoders

  • Assigning more model capacities to the first-pass decoder than the second-pass decoder is referred to as deepshallow two-pass decoders.
  • This capacity assignment improves translation quality and inference efficiency.

Experimental setting

  • Described experimental settings for experiments in §4
  • Details of experimental settings provided

Data

  • Used 3 datasets: Fisher Es→En, CVSS-C, and mutlidomain En↔Es
  • Fisher Es→En contains 170-hour Spanish conversational telephone speech with transcriptions and English translations
  • CVSS-C is a multilingual S2ST corpus based on CoVoST2
  • Popuri et al. (2022) used multiple public S2TT corpora and augmented with ASR corpora
  • En→Es augmented with Europarl-ST, Must-C, TEDLIUM3, Librispeech, and Common Voice
  • Es→En augmented with CoVoST2, Europarl-ST, mTEDx, Common Voice, and multilingual Librispeech

Pre-processing

  • Pre-processing for acoustic feature extraction, discrete unit extraction, and text normalization
  • Discarded over-generated target speech/unit by TTS/T2U models

Pre-training

  • Used same models as Popuri et al. (2022)
  • Trained multilingual w2v-BERT model on 51 languages with same setting as Jia et al. (2022a)
  • Used same En-Es and 50-language mBART models as Wang et al. (2022) and Tang et al. (2020)

Baseline

  • Built two cascaded S2ST systems and four direct S2ST systems
  • All speech encoders based on Conformer
  • Pre-trained speech encoder with wav2vec2.0/w2v-BERT
  • Applied R-Drop to all models predicting discrete symbols
  • Combined Conformer ASR, Transformer MT, and Transformer TTS model
  • Pre-trained S2TT’s decoder with mBART
  • Improved version of Translatotron2
  • Replaced phoneme targets with subwords, LSTM decoders with Transformer decoders, and added T2S encoder
  • Autoregressive Transformer decoder instead of NAT for second-pass spectrogram decoder
  • Applied R-Drop to first-pass decoder

Vocoder

  • HiFi-GAN vocoder is used to convert spectrograms to waveform for TTS and direct speech-to-spectrogram models
  • Unit-based HiFi-GAN vocoder is used to convert discrete units to waveform for direct speech-to-unit models
  • Both vocoders are trained separately

Training

  • Optimized models with mixed precision training
  • Implemented models based on Fairseq toolkit
  • Detailed training hyperparameters in Appendix H

Decoding

  • ASR, S2TT, and S2UT models use a beam width of 10
  • UnitY uses a beam width of 10 for B 1st and 1 for B 2nd
  • Translatotron2+ uses a beam width of 10 for the first-pass decoder

Evaluation

  • Used pre-trained ASR model to transcribe generated target speech
  • Calculated BLEU scores using sacrebleu toolkit
  • ASR model fine-tuned from wav2vec2.0 with CTC objective
  • Reference target translation normalized with lowercasing, removal of punctuation, conversion of digits to spoken forms, and removal of non-verbal words

Experimental results

  • Three corpora studied from perspective of target representation and decoder architectures
  • Table 1 shows results on Fisher
  • Four direct systems trained from scratch outperformed previous studies
  • UnitY achieved best ASR-BLEU scores
  • Two-pass decoding improved results, but targeting discrete units also helped
  • Direct models outperformed cascaded system
  • Pretraining all models benefited, Translatotron2+ gained most
  • Translatotron2+ achieved new state-of-the-art S2ST result
  • UnitY had advantage of decoding efficiency

Cvss-c

  • Table 2 and 3 show results of different models
  • UnitY outperformed S2UT model by 1.6 and 2.9 ASR-BLEU
  • Encoder pre-training with S2TT model improved ASR-BLEU scores of all direct S2ST models
  • Translatotron2+ achieved similar translation quality to UnitY
  • UnitY with text decoder pre-training improved S2UT model by 1.3 and 2.5 ASR-BLEU
  • UnitY and Translatotron2+ showed mixed results in different directions
  • Text decoder pre-training helped Translatotron2+ performance
  • UnitY approached performance of strong cascaded system and outperformed it on Must-C

Decoding efficiency

Analysis

  • Conducted analyses to understand improvements in UnitY
  • Studied if same techniques used for UnitY are helpful for Translatotron2+
  • Used multidomain Es→En corpus, excluding pseudo-labeled ASR data
  • Reported average dev scores over three runs with different random seeds

Ablation study

  • Conducted an ablation study for two-pass direct S2ST models
  • Additional T2U/T2S encoder essential for bridging gap in representations
  • R-Drop beneficial for boosting translation quality of first-pass decoder
  • Investigated adding another cross-attention over speech encoder output to unit decoder
  • Parallel and sequential cross-attention did not show any improvement

Output unit for first-pass decoder

  • Studied optimal granularity of output unit for first-pass decoder in two-pass direct S2ST models
  • Subword unit (E6) most effective for first-pass decoder, better translation quality and largest decoding speed-up

Pre-training first-pass decoder

  • Investigated pre-training text decoder with MT model
  • Found pre-training with vanilla mBART or unsupervised MT model most effective
  • Pre-training with supervised MT models did not improve performance
  • Pretraining part of UnitY with S2TT model not helpful

Capacity assignment to two-pass decoders

  • Assigning model capacity to two decoders in UnitY improved translation quality.
  • 12-layer text decoder with two-layer unit decoder was best in translation quality and decoding speed.
  • Pretraining unit decoder with unit-based mBART did not improve performance further.
  • Most effective to pre-train deep text decoder only and keep unit decoder shallow.

Data scale

  • UnitY outperformed Translatotron2+ and S2UT models when data size was 50 hours or more.
  • Text decoder pre-training became less effective as data size increased.
  • Pre-training text decoder of UnitY was essential for decent performance in low-resource settings (≤ 50 hours).

Human evaluation

  • Conducted an audio-only human evaluation to assess translation quality
  • Used crosslingual semantic textual similarity (XSTS)
  • Used mTEDx test set (989 samples)
  • Generated target audio from S2ST systems and reference translations
  • Evaluated by three bilingual annotators, assigned score from one to five
  • UnitY outperformed cascaded and S2UT models in both metrics

Two-pass sequence generation

  • Incorporate additional search process to find better output
  • Rescore intermediate hypotheses using external module
  • Inject specific information to bias output
  • Provide intermediate output to users for streaming applications
  • Two-pass approach makes optimization tractable, improving performance of speech translation models

Direct speech-to-spectrogram translation

  • Direct speech-to-spectrogram translation models predict spectrogram in the target language from the source speech.
  • Translatotron (Jia et al., 2019b) was the first direct S2ST model but had poor performance.
  • Kano et al., 2021 pre-trained the components with ASR and S2TT models, which was more effective.
  • Translatotron2 (Jia et al., 2022b) improved Translatotron by incorporating two-pass decoding.

Direct speech-to-unit translation

  • Direct speech-to-unit translation models predict discrete units instead of spectrograms.
  • Lee et al. (2022b) normalizes speaker identity of real target speech using a CTC-based speech-to-unit model.
  • Huang et al. (2022) further improves normalization by considering rhythm, pitch, and energy.

Conclusion

  • Proposed UnitY, a novel efficient two-pass direct S2ST model
  • Improved model performance by predicting subwords in the first pass, bridging decoder representations by an additional encoder, deep-shallow two-pass decoders, regularizing the training with R-Drop, and pre-training the first-pass decoder with mBART
  • UnitY outperformed a single-pass S2UT model consistently in translation accuracy and inference speed
  • Limitation: target audio quality depends on quality of discrete units generated by self-supervised discrete models
  • Algorithm 1: Two-pass beam search decoding
  • Mathematical formulation of R-Drop
  • Speech: 16kHz, 22kHz, 80-dimensional coefficients, utterance-level cepstral mean-variance normalization
  • Text: lowercase, remove punctuation, 1k, 6k, 2k unigram subword units
  • Data filtering: thresholding with ratio of sequence length of discrete units over number of corresponding source speech frames
  • Pre-trained self-supervised models and TTS models: wav2vec2.0, w2v-BERT, mBART, unit-based mBART
  • Training objectives: primary S2ST/S2UT task, auxiliary S2TT and ASR tasks
  • Autoregressive decoder of Transformer TTS as spectrogram decoder
  • CTC loss on top of unit decoder
  • Model architecture of UnitY
  • Runtime of direct S2ST models
  • Dev ASR-BLEU at different data scales
  • Ablation study for two-pass direct S2ST models
  • Pre-training strategies for first-pass decoder in UnitY
  • Capacity assignment to two-pass decoders in UnitY
  • Statistics for multi-domain En↔Es corpora