Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Introduce the Universal Speech Model (USM) for Automatic Speech Recognition (ASR) across 100+ languages.
  • Pre-train the encoder on a large unlabeled multilingual dataset of 12 million hours and fine-tune on a smaller labeled dataset.
  • Use multilingual pre-training with random-projection quantization and speech-text modality matching.
  • Achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
  • Despite using a labeled training set 1/7-th the size of that used for the Whisper model, comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.

Paper Content


  • Recent advances in self-supervised learning have enabled new possibilities for speech recognition
  • Recent studies have focused on creating “universal” models that can cover multiple tasks, domains, and languages
  • This work explores the frontiers of language expansion, with the goal of training a universal ASR model that covers all spoken languages
  • Obtaining enough data to train high-quality models is a challenge, but recent developments in semisupervised algorithms make it possible to leverage untranscribed data for pre-training
  • A single large model can utilize large data sets more effectively than smaller models

Our approach

  • Produce large “Universal Speech Models” (USMs) through a training pipeline
  • Utilize three types of datasets: Unpaired Audio, Unpaired Text, and Paired ASR Data
  • 2B-parameter Conformer models built using these datasets
  • Unsupervised Pre-training and MOST (Multi-Objective Supervised pre-Training)
  • Produce generic ASR models and pre-trained models
  • Evaluate USMs on two public benchmarks and CoVoST 2
  • Explore possibility of attaching additional “adapter” units
  • Training pipeline enables building of both generic multilingual ASR systems and domain specific models

Key findings

  • USM models achieve state-of-the-art performance for multilingual ASR and AST
  • YouTube captions model performs better than Whisper
  • BEST-RQ pre-training can effectively scale to large data regime
  • MOST is an effective method for utilizing large scale text data
  • USM establishes new state-of-the-art on FLEURS and CoVoST 2
  • Chunk-wise attention for robust long-form speech recognition


  • Analyzed effects of key components of work
  • Compared performance against existing methods
  • Extensive literature on pre-training and self-training for ASR
  • Large speech models studied in monolingual and multilingual contexts
  • Large multi-modal speech models explored
  • Unsupervised pre-training methods for speech models proposed and applied
  • Work an extension of research efforts studying semi-supervised learning for ASR
  • Large speech models (> 1B) studied in previous work
  • Self-supervised learning algorithm (BEST-RQ) and multi-modal pre-training (text-injection) used to improve methods
  • Multi-softmax loss used to improve BEST-RQ
  • Multi-Objective Supervised Training (BEST-RQ with text-injection) used to improve quality of speech representations
  • Chunk-wise attention proposed as alternative to chunk-based decoding
  • Scalable self-supervised training framework for multilingual ASR proposed


Model architecture: conformer

  • Used convolution-augmented transformer (Conformer) with relative attention as encoder model
  • Features produced by Conformer used as input to CTC, RNN-T or LAS unit
  • BEST-RQ pre-training applied to encoder only
  • Considered two models with 600M and 2B parameters
  • Features of models listed in Table 2

Pre-training: best-rq

  • BEST-RQ is used to pre-train networks with speech audio
  • BEST-RQ has a small number of hyperparameters
  • BEST-RQ uses a BERT-style training task to predict masked speech features
  • Speech features are quantized and the task requires predicting the quantized label
  • Codebook vectors are chosen in an embedding space
  • Cosine similarity is used to determine the code
  • BEST-RQ does not require a quantization module, making it more scalable

Self-training: noisy student training

  • Utilize noisy student training to generate pseudo-labeled data
  • Teacher model is trained with augmentation on supervised set
  • Teacher model used to generate transcripts for unlabeled audio data
  • Heuristic filtering method used to filter pseudo-labeled data
  • Pseudo-labeled data mixed with supervised data to train student model

Chunk-wise attention for long-form asr

  • ASR systems are usually trained on short segments, typically less than 30 seconds
  • Local self attention is widely used, but stacking many layers creates a mismatch between training and inference
  • This mismatch causes high deletion errors, referred to as the “long-form (performance) degradation” problem
  • Chunk-wise attention is proposed to solve this problem, allowing other layers in the encoder to process contextual frames beyond the current chunk
  • Text-injection loss is used to produce joint, co-aligned embeddings of speech and text
  • Three types of data are used to train the model: unlabeled speech, paired speech-text, and unlabeled text
  • Model is trained in two stages, first on paired data and then on unlabeled text
  • Text machine translation data can be used during the fine-tuning stage of AST tasks

Residual adaptation with a frozen encoder

  • Pre-trained USM is expensive to fine-tune for various domains and tasks
  • Lightweight alternative of adding residual adapters with a small number of parameters to each Conformer block
  • Adapters are dynamically loaded according to the tasks within the input batch
  • Training the adapter can reduce over-fitting when training data is limited

Training details

  • Audio is sampled to 16 kHz quality
  • Audio is featurized into 128-dimensional log-mel filterbank coefficients
  • Graphemes used to tokenize text for FLEURS in-domain fine-tuning
  • Word-piece models used for tokenization for all other tasks
  • 16 codebook multi-softmax loss used to stabilize training and improve performance
  • Text encoder and decoder architecture described in [13] used
  • Single 1536-dimensional Conformer layer used as speech encoder
  • Un-transcribed speech, unspoken text, and transcribed speech mixed in each batch
  • Model initialized with BEST-RQ pre-trained encoder
  • Curriculum learning schedule used
  • Two separate optimizers for encoder and decoder parameters
  • GShard framework with GSPMD backend used to train large models on TPUs
  • 3 datasets used to train models

Audio data

  • YT-SUP+: 90k hours of segmented, labeled audio across 75 languages, plus 100k hours of segmented, pseudo-labeled en-US audio
  • YT-55-U: 12M hours of segmented, unlabeled audio on 55 rich resource languages
  • YT-513-U: 100k hours of segmented, unlabeled audio across 513 tail languages
  • Pub-U: 429k hours of unlabeled speech data in 51 languages
  • Pub-S: 1.3k hours of speech and transcript data spanning 14 languages
  • Public data only used for in-domain pre-training

Text data

  • Pre-training with unlabeled text uses a web-crawled corpus of monolingual text with 28B sentences.
  • Dataset spans 1140 languages, 205 with over 1M sentences and 199 with 100k-1M sentences.

Downstream benchmarks

  • Presented results on two public tasks (SpeechStew and FLEURS) and an internal benchmark on YouTube
  • SpeechStew dataset assembled from seven public speech corpora
  • FLEURS dataset is multi-way parallel dataset of 10 hours of read speech in 102 languages
  • 62 languages selected to compare generic ASR system with Whisper
  • WER and CER metrics reported
  • YouTube domain test set consists of utterances from 73 languages
  • CoVoST 2 used to benchmark multilingual speech translation
  • 21 source languages into English, training data ranges from 1-264 hours
  • Text translation data from CoVoST 2 combined with WMT and TED Talks data

Robust speech recognition for massively multilingual tasks

  • Whisper6 trained on 680k hours of weakly supervised data
  • Whisper hallucinates in many languages, resulting in WER exceeding 100%
  • USM-LAS and USM-CTC outperform Whisper on YouTube en-US
  • USM-CTC and USM-LAS outperform Whisper on CORAAL and SpeechStew
  • USM-LAS and USM-CTC outperform Whisper by 66% relative WER on FLEURS

Massively multilingual results beyond 100 languages

  • Pre-trained model improves FLEURS benchmark significantly with only 10 hours per language
  • Model achieves 30% relative improvement in terms of WER across 102 languages
  • Performance is maximized by in-domain fine-tuning

Most produces robust representations that generalize to new domains

  • MOST training aligns speech and text representations by training on both modalities.
  • Adding 2% of parameters to the network produces competitive performance on downstream tasks.
  • USM-LAS-Adapter uses FLEURS data to transcribe YouTube data in unseen languages.
  • USM-LAS-Adapter leads to improvements of up to 30% in some languages.

Usms are strong ast models

  • USM fine-tuning shows comparable performance to CoVoST 2 SoTA BLEU score
  • Previous SoTA used 125k hours of supervised speech translation data, USM used 859 hours
  • USM-M can use both speech and text as training input
  • USM-M achieved > 30 BLEU on CoVoST, 1 BLEU increase from SoTA

Multi-softmax loss for best-rq

  • ASR and AST benchmarks improved by > 5% when increasing the number of softmax groups from 1 to 16
  • Using multiple softmax groups reduces performance variation and improves convergence speed

Model and language scaling

  • Scaling up model size and increasing language coverage of pre-training dataset improves performance of USMs.
  • Relative gains on newly covered languages are more substantial than on other languages.
  • BEST-RQ outperforms or is comparable to other prominent pre-training methods for speech recognition.
  • BEST-RQ obtains greater gains when scaled up.

Best-rq is a scalable self-supervised learner

  • Chunk-wise attention can address long-form degradation issues.
  • Chunk-wise attention models outperform local self attention models with 128 context frames.
  • Increasing context window size of local self attention models results in high deletion error rates.

Tpu serving capacity of usm-ctc models

  • USM-CTC models are powerful generic ASR models with reliable long-form transcription performance and excellent generalization properties
  • USM-CTC model is only 3.9x slower than the 100M-parameter streaming model
  • USM-CTC can be used as an offline transcriber efficiently on TPUs (or GPUs)


  • Unlabeled data is more practical than weakly labeled data for tail languages
  • In-domain data is best for optimizing performance
  • Different transducers can be tested quickly and selected for a given purpose
  • Training is split into three stages
  • Noisy student training for unseen languages
  • Comparing BEST-RQ against W2v-BERT
  • Chunk-wise attention