Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Direct speech-to-speech translation (S2ST) is advantageous for fast inference with a simplified pipeline.
UnitY is a novel two-pass direct S2ST architecture.
UnitY is enhanced by subword prediction, advanced two-pass decoder architecture design and search strategy, and better training regularization.
UnitY is pre-trained with a self-supervised denoising auto-encoding task.
UnitY outperforms a single-pass speech-to-unit translation model.
UnitY achieves 2.51x decoding speed-up compared to predicting spectrogram in the second pass.

Paper Content

Introduction

Automatic speech translation is an important technology for international communication.
Traditional approach to speech-to-speech translation is to use separate components for speech recognition, machine translation, and text-to-speech.
Sequence-to-sequence models have made it possible to use a single architecture with fewer components.
Direct approach is attractive for low-latency systems and reducing development costs.
Poor performance of direct S2ST models is due to data scarcity.
Data shortage has been addressed by pre-training, multi-task learning, pseudo labeling, and knowledge distillation.
Recent works propose to model discrete acoustic units instead of a continuous speech signal.
UnitY is a two-pass direct S2ST model that generates subwords and discrete acoustic units.
UnitY achieves better translation quality and decoding efficiency than other models.
UnitY is pre-trained with multilingual BART at the subword level.

Architecture

N 1st , N 2nd , and N t2u are the depths of the first-pass decoder, second-pass decoder, and T2U encoder of UnitY
Training objective of the first pass is to minimize the direct S2TT loss
Generating subwords instead of phonemes has five advantages
T2U encoder bridges the gap in representations between text and unit decoders
Training objective of the second pass is to minimize L s2u while being conditioned on Y

Training with r-drop

UnitY introduces an intermediate S2TT sub-task to make optimization tractable while maintaining end-to-end differentiability.
S2TT task is more likely to overfit than S2UT task.
Regularization based on R-Drop is applied to first-pass decoder to tackle overfitting.
R-Drop reduces inconsistency of model predictions between training and inference, improving generalization ability.

Text decoder pre-training

ASR and S2TT studies benefit from self-supervised pre-training
Speech encoder can be pre-trained with wav2vec2.0
Unit decoder can be initialized with unit-based mBART pre-trained with unlabeled speech data
Text decoder can be initialized with text-based mBART pre-trained with unlabeled text data
FFN of text decoder is frozen during S2ST fine-tuning
T2U encoder and second-pass unit decoder are initialized randomly

Search algorithm

Perform two-pass beam search decoding during inference
First-pass decoder uses beam search with beam size B 1st
Second-pass decoder uses beam search with beam size B 2nd
Assign larger beam size to first pass (B 1st > B 2nd ) for more diversity and reduced computation time

Deep-shallow two-pass decoders

Assigning more model capacities to the first-pass decoder than the second-pass decoder is referred to as deepshallow two-pass decoders.
This capacity assignment improves translation quality and inference efficiency.

Experimental setting

Described experimental settings for experiments in §4
Details of experimental settings provided

Data

Used 3 datasets: Fisher Es→En, CVSS-C, and mutlidomain En↔Es
Fisher Es→En contains 170-hour Spanish conversational telephone speech with transcriptions and English translations
CVSS-C is a multilingual S2ST corpus based on CoVoST2
Popuri et al. (2022) used multiple public S2TT corpora and augmented with ASR corpora
En→Es augmented with Europarl-ST, Must-C, TEDLIUM3, Librispeech, and Common Voice
Es→En augmented with CoVoST2, Europarl-ST, mTEDx, Common Voice, and multilingual Librispeech

Pre-processing

Pre-processing for acoustic feature extraction, discrete unit extraction, and text normalization
Discarded over-generated target speech/unit by TTS/T2U models

Pre-training

Used same models as Popuri et al. (2022)
Trained multilingual w2v-BERT model on 51 languages with same setting as Jia et al. (2022a)
Used same En-Es and 50-language mBART models as Wang et al. (2022) and Tang et al. (2020)

Baseline

Built two cascaded S2ST systems and four direct S2ST systems
All speech encoders based on Conformer
Pre-trained speech encoder with wav2vec2.0/w2v-BERT
Applied R-Drop to all models predicting discrete symbols
Combined Conformer ASR, Transformer MT, and Transformer TTS model
Pre-trained S2TT’s decoder with mBART
Improved version of Translatotron2
Replaced phoneme targets with subwords, LSTM decoders with Transformer decoders, and added T2S encoder
Autoregressive Transformer decoder instead of NAT for second-pass spectrogram decoder
Applied R-Drop to first-pass decoder

Vocoder

HiFi-GAN vocoder is used to convert spectrograms to waveform for TTS and direct speech-to-spectrogram models
Unit-based HiFi-GAN vocoder is used to convert discrete units to waveform for direct speech-to-unit models
Both vocoders are trained separately

Training

Optimized models with mixed precision training
Implemented models based on Fairseq toolkit
Detailed training hyperparameters in Appendix H

Decoding

ASR, S2TT, and S2UT models use a beam width of 10
UnitY uses a beam width of 10 for B 1st and 1 for B 2nd
Translatotron2+ uses a beam width of 10 for the first-pass decoder

Evaluation

Used pre-trained ASR model to transcribe generated target speech
Calculated BLEU scores using sacrebleu toolkit
ASR model fine-tuned from wav2vec2.0 with CTC objective
Reference target translation normalized with lowercasing, removal of punctuation, conversion of digits to spoken forms, and removal of non-verbal words

Experimental results

Three corpora studied from perspective of target representation and decoder architectures
Table 1 shows results on Fisher
Four direct systems trained from scratch outperformed previous studies
UnitY achieved best ASR-BLEU scores
Two-pass decoding improved results, but targeting discrete units also helped
Direct models outperformed cascaded system
Pretraining all models benefited, Translatotron2+ gained most
Translatotron2+ achieved new state-of-the-art S2ST result
UnitY had advantage of decoding efficiency

Cvss-c

Table 2 and 3 show results of different models
UnitY outperformed S2UT model by 1.6 and 2.9 ASR-BLEU
Encoder pre-training with S2TT model improved ASR-BLEU scores of all direct S2ST models
Translatotron2+ achieved similar translation quality to UnitY
UnitY with text decoder pre-training improved S2UT model by 1.3 and 2.5 ASR-BLEU
UnitY and Translatotron2+ showed mixed results in different directions
Text decoder pre-training helped Translatotron2+ performance
UnitY approached performance of strong cascaded system and outperformed it on Must-C

Decoding efficiency

Analysis

Conducted analyses to understand improvements in UnitY
Studied if same techniques used for UnitY are helpful for Translatotron2+
Used multidomain Es→En corpus, excluding pseudo-labeled ASR data
Reported average dev scores over three runs with different random seeds

Ablation study

Conducted an ablation study for two-pass direct S2ST models
Additional T2U/T2S encoder essential for bridging gap in representations
R-Drop beneficial for boosting translation quality of first-pass decoder
Investigated adding another cross-attention over speech encoder output to unit decoder
Parallel and sequential cross-attention did not show any improvement

Output unit for first-pass decoder

Studied optimal granularity of output unit for first-pass decoder in two-pass direct S2ST models
Subword unit (E6) most effective for first-pass decoder, better translation quality and largest decoding speed-up

Pre-training first-pass decoder

Investigated pre-training text decoder with MT model
Found pre-training with vanilla mBART or unsupervised MT model most effective
Pre-training with supervised MT models did not improve performance
Pretraining part of UnitY with S2TT model not helpful

Capacity assignment to two-pass decoders

Assigning model capacity to two decoders in UnitY improved translation quality.
12-layer text decoder with two-layer unit decoder was best in translation quality and decoding speed.
Pretraining unit decoder with unit-based mBART did not improve performance further.
Most effective to pre-train deep text decoder only and keep unit decoder shallow.

Data scale

UnitY outperformed Translatotron2+ and S2UT models when data size was 50 hours or more.
Text decoder pre-training became less effective as data size increased.
Pre-training text decoder of UnitY was essential for decent performance in low-resource settings (≤ 50 hours).

Human evaluation

Conducted an audio-only human evaluation to assess translation quality
Used crosslingual semantic textual similarity (XSTS)
Used mTEDx test set (989 samples)
Generated target audio from S2ST systems and reference translations
Evaluated by three bilingual annotators, assigned score from one to five
UnitY outperformed cascaded and S2UT models in both metrics

Two-pass sequence generation

Incorporate additional search process to find better output
Rescore intermediate hypotheses using external module
Inject specific information to bias output
Provide intermediate output to users for streaming applications
Two-pass approach makes optimization tractable, improving performance of speech translation models

Direct speech-to-spectrogram translation

Direct speech-to-spectrogram translation models predict spectrogram in the target language from the source speech.
Translatotron (Jia et al., 2019b) was the first direct S2ST model but had poor performance.
Kano et al., 2021 pre-trained the components with ASR and S2TT models, which was more effective.
Translatotron2 (Jia et al., 2022b) improved Translatotron by incorporating two-pass decoding.

Direct speech-to-unit translation

Direct speech-to-unit translation models predict discrete units instead of spectrograms.
Lee et al. (2022b) normalizes speaker identity of real target speech using a CTC-based speech-to-unit model.
Huang et al. (2022) further improves normalization by considering rhythm, pitch, and energy.

Conclusion

Proposed UnitY, a novel efficient two-pass direct S2ST model
Improved model performance by predicting subwords in the first pass, bridging decoder representations by an additional encoder, deep-shallow two-pass decoders, regularizing the training with R-Drop, and pre-training the first-pass decoder with mBART
UnitY outperformed a single-pass S2UT model consistently in translation accuracy and inference speed
Limitation: target audio quality depends on quality of discrete units generated by self-supervised discrete models
Algorithm 1: Two-pass beam search decoding
Mathematical formulation of R-Drop
Speech: 16kHz, 22kHz, 80-dimensional coefficients, utterance-level cepstral mean-variance normalization
Text: lowercase, remove punctuation, 1k, 6k, 2k unigram subword units
Data filtering: thresholding with ratio of sequence length of discrete units over number of corresponding source speech frames
Pre-trained self-supervised models and TTS models: wav2vec2.0, w2v-BERT, mBART, unit-based mBART
Training objectives: primary S2ST/S2UT task, auxiliary S2TT and ASR tasks
Autoregressive decoder of Transformer TTS as spectrogram decoder
CTC loss on top of unit decoder
Model architecture of UnitY
Runtime of direct S2ST models
Dev ASR-BLEU at different data scales
Ablation study for two-pass direct S2ST models
Pre-training strategies for first-pass decoder in UnitY
Capacity assignment to two-pass decoders in UnitY
Statistics for multi-domain En↔Es corpora

Link to paper#

Abstract#

Paper Content#

Introduction#

Architecture#

Training with r-drop#

Text decoder pre-training#

Search algorithm#

Deep-shallow two-pass decoders#

Experimental setting#

Data#

Pre-processing#

Pre-training#

Baseline#

Vocoder#

Training#

Decoding#

Evaluation#

Experimental results#

Cvss-c#

Decoding efficiency#

Analysis#

Ablation study#

Output unit for first-pass decoder#

Pre-training first-pass decoder#

Capacity assignment to two-pass decoders#

Data scale#

Human evaluation#

Two-pass sequence generation#

Direct speech-to-spectrogram translation#

Direct speech-to-unit translation#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Architecture

Training with r-drop

Text decoder pre-training

Search algorithm

Deep-shallow two-pass decoders

Experimental setting

Data

Pre-processing

Pre-training

Baseline

Vocoder

Training

Decoding

Evaluation

Experimental results

Cvss-c

Decoding efficiency

Analysis

Ablation study

Output unit for first-pass decoder

Pre-training first-pass decoder

Capacity assignment to two-pass decoders

Data scale

Human evaluation

Two-pass sequence generation

Direct speech-to-spectrogram translation

Direct speech-to-unit translation

Conclusion