Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Mu$^{2}$SLAM is a multilingual sequence-to-sequence model pre-trained on speech, text and supervised data.
Mu$^{2}$SLAM uses a sequence-to-sequence masked denoising objective and a masked language modeling objective.
Mu$^{2}$SLAM establishes a new state-of-the-art for models trained on public datasets.
Mu$^{2}$SLAM matches the performance of an mSLAM model on Voxpopuli ASR.
Mu$^{2}$SLAM improves by more than 6% over mSLAM on XNLI.

Paper Content

Introduction

NLP has seen success in unified text models for understanding and generation tasks
Pre-training methods in speech have moved towards unified models
Most models focus on speech-related tasks and ignore text-related benchmarks
Few studies investigate multilingual modeling with both speech and text
Multitask learning is understudied in speech-text pre-training
Models design modality-specific blocks and losses to yield high performance
Proposed multi-task multilingual pre-training method for speech and text
Language type covers more than 100 mainstream spoken languages
Unify pre-training losses for unlabeled and labeled data
Minimize number of modality-specific layers
Gradual fine-tuning and noisy fine-tuning proposed
Results show competitive performance on speech and text tasks

Approach

Proposed a multi-task multilingual pre-training method for speech and text
Considered four types of data: speech-only, text-only, speech-text, and text-text
Unified training examples into sequence-to-sequence format
Applied similar optimization objectives on encoder and decoder
Combined losses on unlabeled and labeled data to pre-train speech-text models

Model architecture

Mu 2 SLAM is a multi-task multilingual speech-text pre-training method
Speech inputs are converted into a sequence of latent speech representations via a CNN block
Text inputs go through a token embedding layer
Language and modality embeddings are added to word embeddings or speech representations
Shared multi-modal encoder-decoder model is used
Deep encoder with 24 Conformer layers and shallow decoder with 6 Transformer layers

Speech tokenization

Proposed speech-text pre-training approach treats speech inputs as an additional language.
Speech tokenizer network quantizes continuous speech representations into discrete ids.
Speech representation vector is projected into a discrete id by finding the nearest neighbour in the speech codebook.
Parameters of the speech tokenizer are learned from scratch by a contrastive loss.
Pretrained speech tokenizer in mSLAM is kept constant during model training.

Pre-training objectives

Four different training sets related to speech and/or text: D s , D t , D st , D tt
Unify pre-training losses for unlabeled and labeled data
Masking vector m randomly constructed from prior distribution
Loss on unlabeled data computed using source-target pairs (x, x)
Loss on labeled data computed using forward and backward sequence-to-sequence loss
Alignment loss on encoder and decoder to align representations between different languages and modalities
CTC loss activated on ASR data

Fine-tuning

Fine-tuning method is used to unlock the capability of a strong pre-trained model
Direct fine-tuning is used to adapt a pretrained model to a specific downstream task
Gradual fine-tuning is used to mitigate the discrepancy between pre-training and fine-tuning
Noisy fine-tuning is used to prevent the pretrained model from quickly overfitting

Experiments

Data comes from WMT and TED translation tasks
Data is similar to mSLAM (Bapna et al., 2022)
Language distribution is highly skewed, temperature-based data sampling applied with temperature of 2

Models and hyperparameters

Model setup uses Conformer and Transformer layers with model dimension, hidden dimension, attention heads, dropout, learning schedule, and Adam optimizer
Batch sizes per TPU for speech-only, text-only, AST, ASR, and MT data are 4, 8, 1, 1, and 1
Masking ratio for speech frames is 50%, for text inputs is 50%, and for MT tasks is 25%
Loss coefficients for speech-only and text-only data are 1, for text to speech and alignment tasks are 0.1, and for speech to text tasks are 0.3
Pre-train two sets of speech-text models with 4096 chars and 64k word pieces
Fine-tune on CoVoST-2, VoxPopuli, and XTREME benchmarks
Hyperparameters tuned include batch sizes, learning rates, dropout ratios, and warm-up steps
AST, ASR, and MT batch size is 4, 2, and 2
Noise ratio for AST, ASR, and MT is 0.06
Fine-tuning experiments conducted on 64 TPUv4 chips

Multilingual speech translation

Direct multilingual fine-tuning with xx-en or en-xx language pairs can obtain better performance than XLS-R and 0.6B mSLAM models
Multi-task multilingual fine-tuning can achieve better results on xx-en but lower scores on en-xx
Gradual fine-tuning can improve model performance
Mu 2 SLAM-spam gains more from gradual fine-tuning
Mu 2 SLAM-spam is better for xx-en translation, Mu 2 SLAM-char is better for en-xx

Multilingual speech recognition

Model outperforms baseline model in all languages
Average improvement of 8.2%

Multilingual text understanding

Investigated capability of speech-text pretrained models on XTREME multilingual dataset
Mu 2 SLAMspm performs better than Mu 2 SLAM-char in zero-shot setting
Mu 2 SLAM models not able to surpass mT5base
Mu 2 SLAM models deliver good results on English but worse results on non-English languages
Issue attributed to language embeddings not specializing in generation outside training data

Analysis

Experiments conducted to study effect of speech-text and text-text labeled data
AST data enabled for fine-tuning pre-trained models
Best model comes from using all available speech and text paired data
Removing ASR data improves model performance on en-xx directions
Multi-task pre-training beneficial to learning general speech-text representations
Noisy fine-tuning has drastic change from 0 to 0.06 for xx-en, subtle improvements for en-xx

Pre-training methods have been used to exploit unlabeled data in NLP and speech
Examples of pre-training methods include BERT, XLNET, T5, MASS, wav2vec, and Hubert
Research is moving towards speech-text joint training
Mu 2 SLAM is an encoder-decoder backbone model with a CNN block used to extract speech representations
Mu 2 SLAM pre-trains the model from scratch
Multilingual pre-training is being used to learn joint representations across multiple languages
Multi-task learning is used to improve model generalization performance

Conclusion

Proposed Mu 2 SLAM pre-training method for speech and text joint models
Utilizes fully encoder-decoder model as backbone
Pre-training models span more than 100 languages in both speech and text
Involves unlabeled and labeled data from speech/text-only data, ASR, AST to MT
Introduces two kinds of training objectives to unify unlabeled and labeled data in pre-training
Proposes gradual fine-tuning and noisy fine-tuning to improve model performance
Achieves strong results on CoVoST and comparable performance on VoxPopuli
Narrows gap between speech-text models and text-only models on text tasks
Future work to explore speech generation based on pre-trained models
Results on CoVoST, VoxPopuli, XNLI, TyDiQA-GoldP

Link to paper#

Abstract#

Paper Content#

Introduction#

Approach#

Model architecture#

Speech tokenization#

Pre-training objectives#

Fine-tuning#

Experiments#

Models and hyperparameters#

Multilingual speech translation#

Multilingual speech recognition#

Multilingual text understanding#

Analysis#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Approach

Model architecture

Speech tokenization

Pre-training objectives

Fine-tuning

Experiments

Models and hyperparameters

Multilingual speech translation

Multilingual speech recognition

Multilingual text understanding

Analysis

Related work

Conclusion