Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Mu$^{2}$SLAM is a multilingual sequence-to-sequence model pre-trained on speech, text and supervised data.
- Mu$^{2}$SLAM uses a sequence-to-sequence masked denoising objective and a masked language modeling objective.
- Mu$^{2}$SLAM establishes a new state-of-the-art for models trained on public datasets.
- Mu$^{2}$SLAM matches the performance of an mSLAM model on Voxpopuli ASR.
- Mu$^{2}$SLAM improves by more than 6% over mSLAM on XNLI.
Paper Content
Introduction
- NLP has seen success in unified text models for understanding and generation tasks
- Pre-training methods in speech have moved towards unified models
- Most models focus on speech-related tasks and ignore text-related benchmarks
- Few studies investigate multilingual modeling with both speech and text
- Multitask learning is understudied in speech-text pre-training
- Models design modality-specific blocks and losses to yield high performance
- Proposed multi-task multilingual pre-training method for speech and text
- Language type covers more than 100 mainstream spoken languages
- Unify pre-training losses for unlabeled and labeled data
- Minimize number of modality-specific layers
- Gradual fine-tuning and noisy fine-tuning proposed
- Results show competitive performance on speech and text tasks
Approach
- Proposed a multi-task multilingual pre-training method for speech and text
- Considered four types of data: speech-only, text-only, speech-text, and text-text
- Unified training examples into sequence-to-sequence format
- Applied similar optimization objectives on encoder and decoder
- Combined losses on unlabeled and labeled data to pre-train speech-text models
Model architecture
- Mu 2 SLAM is a multi-task multilingual speech-text pre-training method
- Speech inputs are converted into a sequence of latent speech representations via a CNN block
- Text inputs go through a token embedding layer
- Language and modality embeddings are added to word embeddings or speech representations
- Shared multi-modal encoder-decoder model is used
- Deep encoder with 24 Conformer layers and shallow decoder with 6 Transformer layers
Speech tokenization
- Proposed speech-text pre-training approach treats speech inputs as an additional language.
- Speech tokenizer network quantizes continuous speech representations into discrete ids.
- Speech representation vector is projected into a discrete id by finding the nearest neighbour in the speech codebook.
- Parameters of the speech tokenizer are learned from scratch by a contrastive loss.
- Pretrained speech tokenizer in mSLAM is kept constant during model training.
Pre-training objectives
- Four different training sets related to speech and/or text: D s , D t , D st , D tt
- Unify pre-training losses for unlabeled and labeled data
- Masking vector m randomly constructed from prior distribution
- Loss on unlabeled data computed using source-target pairs (x, x)
- Loss on labeled data computed using forward and backward sequence-to-sequence loss
- Alignment loss on encoder and decoder to align representations between different languages and modalities
- CTC loss activated on ASR data
Fine-tuning
- Fine-tuning method is used to unlock the capability of a strong pre-trained model
- Direct fine-tuning is used to adapt a pretrained model to a specific downstream task
- Gradual fine-tuning is used to mitigate the discrepancy between pre-training and fine-tuning
- Noisy fine-tuning is used to prevent the pretrained model from quickly overfitting
Experiments
- Data comes from WMT and TED translation tasks
- Data is similar to mSLAM (Bapna et al., 2022)
- Language distribution is highly skewed, temperature-based data sampling applied with temperature of 2
Models and hyperparameters
- Model setup uses Conformer and Transformer layers with model dimension, hidden dimension, attention heads, dropout, learning schedule, and Adam optimizer
- Batch sizes per TPU for speech-only, text-only, AST, ASR, and MT data are 4, 8, 1, 1, and 1
- Masking ratio for speech frames is 50%, for text inputs is 50%, and for MT tasks is 25%
- Loss coefficients for speech-only and text-only data are 1, for text to speech and alignment tasks are 0.1, and for speech to text tasks are 0.3
- Pre-train two sets of speech-text models with 4096 chars and 64k word pieces
- Fine-tune on CoVoST-2, VoxPopuli, and XTREME benchmarks
- Hyperparameters tuned include batch sizes, learning rates, dropout ratios, and warm-up steps
- AST, ASR, and MT batch size is 4, 2, and 2
- Noise ratio for AST, ASR, and MT is 0.06
- Fine-tuning experiments conducted on 64 TPUv4 chips
Multilingual speech translation
- Direct multilingual fine-tuning with xx-en or en-xx language pairs can obtain better performance than XLS-R and 0.6B mSLAM models
- Multi-task multilingual fine-tuning can achieve better results on xx-en but lower scores on en-xx
- Gradual fine-tuning can improve model performance
- Mu 2 SLAM-spam gains more from gradual fine-tuning
- Mu 2 SLAM-spam is better for xx-en translation, Mu 2 SLAM-char is better for en-xx
Multilingual speech recognition
- Model outperforms baseline model in all languages
- Average improvement of 8.2%
Multilingual text understanding
- Investigated capability of speech-text pretrained models on XTREME multilingual dataset
- Mu 2 SLAMspm performs better than Mu 2 SLAM-char in zero-shot setting
- Mu 2 SLAM models not able to surpass mT5base
- Mu 2 SLAM models deliver good results on English but worse results on non-English languages
- Issue attributed to language embeddings not specializing in generation outside training data
Analysis
- Experiments conducted to study effect of speech-text and text-text labeled data
- AST data enabled for fine-tuning pre-trained models
- Best model comes from using all available speech and text paired data
- Removing ASR data improves model performance on en-xx directions
- Multi-task pre-training beneficial to learning general speech-text representations
- Noisy fine-tuning has drastic change from 0 to 0.06 for xx-en, subtle improvements for en-xx
Related work
- Pre-training methods have been used to exploit unlabeled data in NLP and speech
- Examples of pre-training methods include BERT, XLNET, T5, MASS, wav2vec, and Hubert
- Research is moving towards speech-text joint training
- Mu 2 SLAM is an encoder-decoder backbone model with a CNN block used to extract speech representations
- Mu 2 SLAM pre-trains the model from scratch
- Multilingual pre-training is being used to learn joint representations across multiple languages
- Multi-task learning is used to improve model generalization performance
Conclusion
- Proposed Mu 2 SLAM pre-training method for speech and text joint models
- Utilizes fully encoder-decoder model as backbone
- Pre-training models span more than 100 languages in both speech and text
- Involves unlabeled and labeled data from speech/text-only data, ASR, AST to MT
- Introduces two kinds of training objectives to unify unlabeled and labeled data in pre-training
- Proposes gradual fine-tuning and noisy fine-tuning to improve model performance
- Achieves strong results on CoVoST and comparable performance on VoxPopuli
- Narrows gap between speech-text models and text-only models on text tasks
- Future work to explore speech generation based on pre-trained models
- Results on CoVoST, VoxPopuli, XNLI, TyDiQA-GoldP