Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Direct speech-to-speech translation (S2ST) is advantageous for fast inference with a simplified pipeline.
- UnitY is a novel two-pass direct S2ST architecture.
- UnitY is enhanced by subword prediction, advanced two-pass decoder architecture design and search strategy, and better training regularization.
- UnitY is pre-trained with a self-supervised denoising auto-encoding task.
- UnitY outperforms a single-pass speech-to-unit translation model.
- UnitY achieves 2.51x decoding speed-up compared to predicting spectrogram in the second pass.
Paper Content
Introduction
- Automatic speech translation is an important technology for international communication.
- Traditional approach to speech-to-speech translation is to use separate components for speech recognition, machine translation, and text-to-speech.
- Sequence-to-sequence models have made it possible to use a single architecture with fewer components.
- Direct approach is attractive for low-latency systems and reducing development costs.
- Poor performance of direct S2ST models is due to data scarcity.
- Data shortage has been addressed by pre-training, multi-task learning, pseudo labeling, and knowledge distillation.
- Recent works propose to model discrete acoustic units instead of a continuous speech signal.
- UnitY is a two-pass direct S2ST model that generates subwords and discrete acoustic units.
- UnitY achieves better translation quality and decoding efficiency than other models.
- UnitY is pre-trained with multilingual BART at the subword level.
Architecture
- N 1st , N 2nd , and N t2u are the depths of the first-pass decoder, second-pass decoder, and T2U encoder of UnitY
- Training objective of the first pass is to minimize the direct S2TT loss
- Generating subwords instead of phonemes has five advantages
- T2U encoder bridges the gap in representations between text and unit decoders
- Training objective of the second pass is to minimize L s2u while being conditioned on Y
Training with r-drop
- UnitY introduces an intermediate S2TT sub-task to make optimization tractable while maintaining end-to-end differentiability.
- S2TT task is more likely to overfit than S2UT task.
- Regularization based on R-Drop is applied to first-pass decoder to tackle overfitting.
- R-Drop reduces inconsistency of model predictions between training and inference, improving generalization ability.
Text decoder pre-training
- ASR and S2TT studies benefit from self-supervised pre-training
- Speech encoder can be pre-trained with wav2vec2.0
- Unit decoder can be initialized with unit-based mBART pre-trained with unlabeled speech data
- Text decoder can be initialized with text-based mBART pre-trained with unlabeled text data
- FFN of text decoder is frozen during S2ST fine-tuning
- T2U encoder and second-pass unit decoder are initialized randomly
Search algorithm
- Perform two-pass beam search decoding during inference
- First-pass decoder uses beam search with beam size B 1st
- Second-pass decoder uses beam search with beam size B 2nd
- Assign larger beam size to first pass (B 1st > B 2nd ) for more diversity and reduced computation time
Deep-shallow two-pass decoders
- Assigning more model capacities to the first-pass decoder than the second-pass decoder is referred to as deepshallow two-pass decoders.
- This capacity assignment improves translation quality and inference efficiency.
Experimental setting
- Described experimental settings for experiments in §4
- Details of experimental settings provided
Data
- Used 3 datasets: Fisher Es→En, CVSS-C, and mutlidomain En↔Es
- Fisher Es→En contains 170-hour Spanish conversational telephone speech with transcriptions and English translations
- CVSS-C is a multilingual S2ST corpus based on CoVoST2
- Popuri et al. (2022) used multiple public S2TT corpora and augmented with ASR corpora
- En→Es augmented with Europarl-ST, Must-C, TEDLIUM3, Librispeech, and Common Voice
- Es→En augmented with CoVoST2, Europarl-ST, mTEDx, Common Voice, and multilingual Librispeech
Pre-processing
- Pre-processing for acoustic feature extraction, discrete unit extraction, and text normalization
- Discarded over-generated target speech/unit by TTS/T2U models
Pre-training
- Used same models as Popuri et al. (2022)
- Trained multilingual w2v-BERT model on 51 languages with same setting as Jia et al. (2022a)
- Used same En-Es and 50-language mBART models as Wang et al. (2022) and Tang et al. (2020)
Baseline
- Built two cascaded S2ST systems and four direct S2ST systems
- All speech encoders based on Conformer
- Pre-trained speech encoder with wav2vec2.0/w2v-BERT
- Applied R-Drop to all models predicting discrete symbols
- Combined Conformer ASR, Transformer MT, and Transformer TTS model
- Pre-trained S2TT’s decoder with mBART
- Improved version of Translatotron2
- Replaced phoneme targets with subwords, LSTM decoders with Transformer decoders, and added T2S encoder
- Autoregressive Transformer decoder instead of NAT for second-pass spectrogram decoder
- Applied R-Drop to first-pass decoder
Vocoder
- HiFi-GAN vocoder is used to convert spectrograms to waveform for TTS and direct speech-to-spectrogram models
- Unit-based HiFi-GAN vocoder is used to convert discrete units to waveform for direct speech-to-unit models
- Both vocoders are trained separately
Training
- Optimized models with mixed precision training
- Implemented models based on Fairseq toolkit
- Detailed training hyperparameters in Appendix H
Decoding
- ASR, S2TT, and S2UT models use a beam width of 10
- UnitY uses a beam width of 10 for B 1st and 1 for B 2nd
- Translatotron2+ uses a beam width of 10 for the first-pass decoder
Evaluation
- Used pre-trained ASR model to transcribe generated target speech
- Calculated BLEU scores using sacrebleu toolkit
- ASR model fine-tuned from wav2vec2.0 with CTC objective
- Reference target translation normalized with lowercasing, removal of punctuation, conversion of digits to spoken forms, and removal of non-verbal words
Experimental results
- Three corpora studied from perspective of target representation and decoder architectures
- Table 1 shows results on Fisher
- Four direct systems trained from scratch outperformed previous studies
- UnitY achieved best ASR-BLEU scores
- Two-pass decoding improved results, but targeting discrete units also helped
- Direct models outperformed cascaded system
- Pretraining all models benefited, Translatotron2+ gained most
- Translatotron2+ achieved new state-of-the-art S2ST result
- UnitY had advantage of decoding efficiency
Cvss-c
- Table 2 and 3 show results of different models
- UnitY outperformed S2UT model by 1.6 and 2.9 ASR-BLEU
- Encoder pre-training with S2TT model improved ASR-BLEU scores of all direct S2ST models
- Translatotron2+ achieved similar translation quality to UnitY
- UnitY with text decoder pre-training improved S2UT model by 1.3 and 2.5 ASR-BLEU
- UnitY and Translatotron2+ showed mixed results in different directions
- Text decoder pre-training helped Translatotron2+ performance
- UnitY approached performance of strong cascaded system and outperformed it on Must-C
Decoding efficiency
Analysis
- Conducted analyses to understand improvements in UnitY
- Studied if same techniques used for UnitY are helpful for Translatotron2+
- Used multidomain Es→En corpus, excluding pseudo-labeled ASR data
- Reported average dev scores over three runs with different random seeds
Ablation study
- Conducted an ablation study for two-pass direct S2ST models
- Additional T2U/T2S encoder essential for bridging gap in representations
- R-Drop beneficial for boosting translation quality of first-pass decoder
- Investigated adding another cross-attention over speech encoder output to unit decoder
- Parallel and sequential cross-attention did not show any improvement
Output unit for first-pass decoder
- Studied optimal granularity of output unit for first-pass decoder in two-pass direct S2ST models
- Subword unit (E6) most effective for first-pass decoder, better translation quality and largest decoding speed-up
Pre-training first-pass decoder
- Investigated pre-training text decoder with MT model
- Found pre-training with vanilla mBART or unsupervised MT model most effective
- Pre-training with supervised MT models did not improve performance
- Pretraining part of UnitY with S2TT model not helpful
Capacity assignment to two-pass decoders
- Assigning model capacity to two decoders in UnitY improved translation quality.
- 12-layer text decoder with two-layer unit decoder was best in translation quality and decoding speed.
- Pretraining unit decoder with unit-based mBART did not improve performance further.
- Most effective to pre-train deep text decoder only and keep unit decoder shallow.
Data scale
- UnitY outperformed Translatotron2+ and S2UT models when data size was 50 hours or more.
- Text decoder pre-training became less effective as data size increased.
- Pre-training text decoder of UnitY was essential for decent performance in low-resource settings (≤ 50 hours).
Human evaluation
- Conducted an audio-only human evaluation to assess translation quality
- Used crosslingual semantic textual similarity (XSTS)
- Used mTEDx test set (989 samples)
- Generated target audio from S2ST systems and reference translations
- Evaluated by three bilingual annotators, assigned score from one to five
- UnitY outperformed cascaded and S2UT models in both metrics
Two-pass sequence generation
- Incorporate additional search process to find better output
- Rescore intermediate hypotheses using external module
- Inject specific information to bias output
- Provide intermediate output to users for streaming applications
- Two-pass approach makes optimization tractable, improving performance of speech translation models
Direct speech-to-spectrogram translation
- Direct speech-to-spectrogram translation models predict spectrogram in the target language from the source speech.
- Translatotron (Jia et al., 2019b) was the first direct S2ST model but had poor performance.
- Kano et al., 2021 pre-trained the components with ASR and S2TT models, which was more effective.
- Translatotron2 (Jia et al., 2022b) improved Translatotron by incorporating two-pass decoding.
Direct speech-to-unit translation
- Direct speech-to-unit translation models predict discrete units instead of spectrograms.
- Lee et al. (2022b) normalizes speaker identity of real target speech using a CTC-based speech-to-unit model.
- Huang et al. (2022) further improves normalization by considering rhythm, pitch, and energy.
Conclusion
- Proposed UnitY, a novel efficient two-pass direct S2ST model
- Improved model performance by predicting subwords in the first pass, bridging decoder representations by an additional encoder, deep-shallow two-pass decoders, regularizing the training with R-Drop, and pre-training the first-pass decoder with mBART
- UnitY outperformed a single-pass S2UT model consistently in translation accuracy and inference speed
- Limitation: target audio quality depends on quality of discrete units generated by self-supervised discrete models
- Algorithm 1: Two-pass beam search decoding
- Mathematical formulation of R-Drop
- Speech: 16kHz, 22kHz, 80-dimensional coefficients, utterance-level cepstral mean-variance normalization
- Text: lowercase, remove punctuation, 1k, 6k, 2k unigram subword units
- Data filtering: thresholding with ratio of sequence length of discrete units over number of corresponding source speech frames
- Pre-trained self-supervised models and TTS models: wav2vec2.0, w2v-BERT, mBART, unit-based mBART
- Training objectives: primary S2ST/S2UT task, auxiliary S2TT and ASR tasks
- Autoregressive decoder of Transformer TTS as spectrogram decoder
- CTC loss on top of unit decoder
- Model architecture of UnitY
- Runtime of direct S2ST models
- Dev ASR-BLEU at different data scales
- Ablation study for two-pass direct S2ST models
- Pre-training strategies for first-pass decoder in UnitY
- Capacity assignment to two-pass decoders in UnitY
- Statistics for multi-domain En↔Es corpora