Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Mu$^{2}$SLAM is a multilingual sequence-to-sequence model pre-trained on speech, text and supervised data.
  • Mu$^{2}$SLAM uses a sequence-to-sequence masked denoising objective and a masked language modeling objective.
  • Mu$^{2}$SLAM establishes a new state-of-the-art for models trained on public datasets.
  • Mu$^{2}$SLAM matches the performance of an mSLAM model on Voxpopuli ASR.
  • Mu$^{2}$SLAM improves by more than 6% over mSLAM on XNLI.

Paper Content

Introduction

  • NLP has seen success in unified text models for understanding and generation tasks
  • Pre-training methods in speech have moved towards unified models
  • Most models focus on speech-related tasks and ignore text-related benchmarks
  • Few studies investigate multilingual modeling with both speech and text
  • Multitask learning is understudied in speech-text pre-training
  • Models design modality-specific blocks and losses to yield high performance
  • Proposed multi-task multilingual pre-training method for speech and text
  • Language type covers more than 100 mainstream spoken languages
  • Unify pre-training losses for unlabeled and labeled data
  • Minimize number of modality-specific layers
  • Gradual fine-tuning and noisy fine-tuning proposed
  • Results show competitive performance on speech and text tasks

Approach

  • Proposed a multi-task multilingual pre-training method for speech and text
  • Considered four types of data: speech-only, text-only, speech-text, and text-text
  • Unified training examples into sequence-to-sequence format
  • Applied similar optimization objectives on encoder and decoder
  • Combined losses on unlabeled and labeled data to pre-train speech-text models

Model architecture

  • Mu 2 SLAM is a multi-task multilingual speech-text pre-training method
  • Speech inputs are converted into a sequence of latent speech representations via a CNN block
  • Text inputs go through a token embedding layer
  • Language and modality embeddings are added to word embeddings or speech representations
  • Shared multi-modal encoder-decoder model is used
  • Deep encoder with 24 Conformer layers and shallow decoder with 6 Transformer layers

Speech tokenization

  • Proposed speech-text pre-training approach treats speech inputs as an additional language.
  • Speech tokenizer network quantizes continuous speech representations into discrete ids.
  • Speech representation vector is projected into a discrete id by finding the nearest neighbour in the speech codebook.
  • Parameters of the speech tokenizer are learned from scratch by a contrastive loss.
  • Pretrained speech tokenizer in mSLAM is kept constant during model training.

Pre-training objectives

  • Four different training sets related to speech and/or text: D s , D t , D st , D tt
  • Unify pre-training losses for unlabeled and labeled data
  • Masking vector m randomly constructed from prior distribution
  • Loss on unlabeled data computed using source-target pairs (x, x)
  • Loss on labeled data computed using forward and backward sequence-to-sequence loss
  • Alignment loss on encoder and decoder to align representations between different languages and modalities
  • CTC loss activated on ASR data

Fine-tuning

  • Fine-tuning method is used to unlock the capability of a strong pre-trained model
  • Direct fine-tuning is used to adapt a pretrained model to a specific downstream task
  • Gradual fine-tuning is used to mitigate the discrepancy between pre-training and fine-tuning
  • Noisy fine-tuning is used to prevent the pretrained model from quickly overfitting

Experiments

  • Data comes from WMT and TED translation tasks
  • Data is similar to mSLAM (Bapna et al., 2022)
  • Language distribution is highly skewed, temperature-based data sampling applied with temperature of 2

Models and hyperparameters

  • Model setup uses Conformer and Transformer layers with model dimension, hidden dimension, attention heads, dropout, learning schedule, and Adam optimizer
  • Batch sizes per TPU for speech-only, text-only, AST, ASR, and MT data are 4, 8, 1, 1, and 1
  • Masking ratio for speech frames is 50%, for text inputs is 50%, and for MT tasks is 25%
  • Loss coefficients for speech-only and text-only data are 1, for text to speech and alignment tasks are 0.1, and for speech to text tasks are 0.3
  • Pre-train two sets of speech-text models with 4096 chars and 64k word pieces
  • Fine-tune on CoVoST-2, VoxPopuli, and XTREME benchmarks
  • Hyperparameters tuned include batch sizes, learning rates, dropout ratios, and warm-up steps
  • AST, ASR, and MT batch size is 4, 2, and 2
  • Noise ratio for AST, ASR, and MT is 0.06
  • Fine-tuning experiments conducted on 64 TPUv4 chips

Multilingual speech translation

  • Direct multilingual fine-tuning with xx-en or en-xx language pairs can obtain better performance than XLS-R and 0.6B mSLAM models
  • Multi-task multilingual fine-tuning can achieve better results on xx-en but lower scores on en-xx
  • Gradual fine-tuning can improve model performance
  • Mu 2 SLAM-spam gains more from gradual fine-tuning
  • Mu 2 SLAM-spam is better for xx-en translation, Mu 2 SLAM-char is better for en-xx

Multilingual speech recognition

  • Model outperforms baseline model in all languages
  • Average improvement of 8.2%

Multilingual text understanding

  • Investigated capability of speech-text pretrained models on XTREME multilingual dataset
  • Mu 2 SLAMspm performs better than Mu 2 SLAM-char in zero-shot setting
  • Mu 2 SLAM models not able to surpass mT5base
  • Mu 2 SLAM models deliver good results on English but worse results on non-English languages
  • Issue attributed to language embeddings not specializing in generation outside training data

Analysis

  • Experiments conducted to study effect of speech-text and text-text labeled data
  • AST data enabled for fine-tuning pre-trained models
  • Best model comes from using all available speech and text paired data
  • Removing ASR data improves model performance on en-xx directions
  • Multi-task pre-training beneficial to learning general speech-text representations
  • Noisy fine-tuning has drastic change from 0 to 0.06 for xx-en, subtle improvements for en-xx
  • Pre-training methods have been used to exploit unlabeled data in NLP and speech
  • Examples of pre-training methods include BERT, XLNET, T5, MASS, wav2vec, and Hubert
  • Research is moving towards speech-text joint training
  • Mu 2 SLAM is an encoder-decoder backbone model with a CNN block used to extract speech representations
  • Mu 2 SLAM pre-trains the model from scratch
  • Multilingual pre-training is being used to learn joint representations across multiple languages
  • Multi-task learning is used to improve model generalization performance

Conclusion

  • Proposed Mu 2 SLAM pre-training method for speech and text joint models
  • Utilizes fully encoder-decoder model as backbone
  • Pre-training models span more than 100 languages in both speech and text
  • Involves unlabeled and labeled data from speech/text-only data, ASR, AST to MT
  • Introduces two kinds of training objectives to unify unlabeled and labeled data in pre-training
  • Proposes gradual fine-tuning and noisy fine-tuning to improve model performance
  • Achieves strong results on CoVoST and comparable performance on VoxPopuli
  • Narrows gap between speech-text models and text-only models on text tasks
  • Future work to explore speech generation based on pre-trained models
  • Results on CoVoST, VoxPopuli, XNLI, TyDiQA-GoldP