Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduce the Universal Speech Model (USM) for Automatic Speech Recognition (ASR) across 100+ languages.
- Pre-train the encoder on a large unlabeled multilingual dataset of 12 million hours and fine-tune on a smaller labeled dataset.
- Use multilingual pre-training with random-projection quantization and speech-text modality matching.
- Achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
- Despite using a labeled training set 1/7-th the size of that used for the Whisper model, comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.
Paper Content
Introduction
- Recent advances in self-supervised learning have enabled new possibilities for speech recognition
- Recent studies have focused on creating “universal” models that can cover multiple tasks, domains, and languages
- This work explores the frontiers of language expansion, with the goal of training a universal ASR model that covers all spoken languages
- Obtaining enough data to train high-quality models is a challenge, but recent developments in semisupervised algorithms make it possible to leverage untranscribed data for pre-training
- A single large model can utilize large data sets more effectively than smaller models
Our approach
- Produce large “Universal Speech Models” (USMs) through a training pipeline
- Utilize three types of datasets: Unpaired Audio, Unpaired Text, and Paired ASR Data
- 2B-parameter Conformer models built using these datasets
- Unsupervised Pre-training and MOST (Multi-Objective Supervised pre-Training)
- Produce generic ASR models and pre-trained models
- Evaluate USMs on two public benchmarks and CoVoST 2
- Explore possibility of attaching additional “adapter” units
- Training pipeline enables building of both generic multilingual ASR systems and domain specific models
Key findings
- USM models achieve state-of-the-art performance for multilingual ASR and AST
- YouTube captions model performs better than Whisper
- BEST-RQ pre-training can effectively scale to large data regime
- MOST is an effective method for utilizing large scale text data
- USM establishes new state-of-the-art on FLEURS and CoVoST 2
- Chunk-wise attention for robust long-form speech recognition
Outline
- Analyzed effects of key components of work
- Compared performance against existing methods
Related work
- Extensive literature on pre-training and self-training for ASR
- Large speech models studied in monolingual and multilingual contexts
- Large multi-modal speech models explored
- Unsupervised pre-training methods for speech models proposed and applied
- Work an extension of research efforts studying semi-supervised learning for ASR
- Large speech models (> 1B) studied in previous work
- Self-supervised learning algorithm (BEST-RQ) and multi-modal pre-training (text-injection) used to improve methods
- Multi-softmax loss used to improve BEST-RQ
- Multi-Objective Supervised Training (BEST-RQ with text-injection) used to improve quality of speech representations
- Chunk-wise attention proposed as alternative to chunk-based decoding
- Scalable self-supervised training framework for multilingual ASR proposed
Methods
Model architecture: conformer
- Used convolution-augmented transformer (Conformer) with relative attention as encoder model
- Features produced by Conformer used as input to CTC, RNN-T or LAS unit
- BEST-RQ pre-training applied to encoder only
- Considered two models with 600M and 2B parameters
- Features of models listed in Table 2
Pre-training: best-rq
- BEST-RQ is used to pre-train networks with speech audio
- BEST-RQ has a small number of hyperparameters
- BEST-RQ uses a BERT-style training task to predict masked speech features
- Speech features are quantized and the task requires predicting the quantized label
- Codebook vectors are chosen in an embedding space
- Cosine similarity is used to determine the code
- BEST-RQ does not require a quantization module, making it more scalable
Self-training: noisy student training
- Utilize noisy student training to generate pseudo-labeled data
- Teacher model is trained with augmentation on supervised set
- Teacher model used to generate transcripts for unlabeled audio data
- Heuristic filtering method used to filter pseudo-labeled data
- Pseudo-labeled data mixed with supervised data to train student model
Chunk-wise attention for long-form asr
- ASR systems are usually trained on short segments, typically less than 30 seconds
- Local self attention is widely used, but stacking many layers creates a mismatch between training and inference
- This mismatch causes high deletion errors, referred to as the “long-form (performance) degradation” problem
- Chunk-wise attention is proposed to solve this problem, allowing other layers in the encoder to process contextual frames beyond the current chunk
- Text-injection loss is used to produce joint, co-aligned embeddings of speech and text
- Three types of data are used to train the model: unlabeled speech, paired speech-text, and unlabeled text
- Model is trained in two stages, first on paired data and then on unlabeled text
- Text machine translation data can be used during the fine-tuning stage of AST tasks
Residual adaptation with a frozen encoder
- Pre-trained USM is expensive to fine-tune for various domains and tasks
- Lightweight alternative of adding residual adapters with a small number of parameters to each Conformer block
- Adapters are dynamically loaded according to the tasks within the input batch
- Training the adapter can reduce over-fitting when training data is limited
Training details
- Audio is sampled to 16 kHz quality
- Audio is featurized into 128-dimensional log-mel filterbank coefficients
- Graphemes used to tokenize text for FLEURS in-domain fine-tuning
- Word-piece models used for tokenization for all other tasks
- 16 codebook multi-softmax loss used to stabilize training and improve performance
- Text encoder and decoder architecture described in [13] used
- Single 1536-dimensional Conformer layer used as speech encoder
- Un-transcribed speech, unspoken text, and transcribed speech mixed in each batch
- Model initialized with BEST-RQ pre-trained encoder
- Curriculum learning schedule used
- Two separate optimizers for encoder and decoder parameters
- GShard framework with GSPMD backend used to train large models on TPUs
- 3 datasets used to train models
Audio data
- YT-SUP+: 90k hours of segmented, labeled audio across 75 languages, plus 100k hours of segmented, pseudo-labeled en-US audio
- YT-55-U: 12M hours of segmented, unlabeled audio on 55 rich resource languages
- YT-513-U: 100k hours of segmented, unlabeled audio across 513 tail languages
- Pub-U: 429k hours of unlabeled speech data in 51 languages
- Pub-S: 1.3k hours of speech and transcript data spanning 14 languages
- Public data only used for in-domain pre-training
Text data
- Pre-training with unlabeled text uses a web-crawled corpus of monolingual text with 28B sentences.
- Dataset spans 1140 languages, 205 with over 1M sentences and 199 with 100k-1M sentences.
Downstream benchmarks
- Presented results on two public tasks (SpeechStew and FLEURS) and an internal benchmark on YouTube
- SpeechStew dataset assembled from seven public speech corpora
- FLEURS dataset is multi-way parallel dataset of 10 hours of read speech in 102 languages
- 62 languages selected to compare generic ASR system with Whisper
- WER and CER metrics reported
- YouTube domain test set consists of utterances from 73 languages
- CoVoST 2 used to benchmark multilingual speech translation
- 21 source languages into English, training data ranges from 1-264 hours
- Text translation data from CoVoST 2 combined with WMT and TED Talks data
Robust speech recognition for massively multilingual tasks
- Whisper6 trained on 680k hours of weakly supervised data
- Whisper hallucinates in many languages, resulting in WER exceeding 100%
- USM-LAS and USM-CTC outperform Whisper on YouTube en-US
- USM-CTC and USM-LAS outperform Whisper on CORAAL and SpeechStew
- USM-LAS and USM-CTC outperform Whisper by 66% relative WER on FLEURS
Massively multilingual results beyond 100 languages
- Pre-trained model improves FLEURS benchmark significantly with only 10 hours per language
- Model achieves 30% relative improvement in terms of WER across 102 languages
- Performance is maximized by in-domain fine-tuning
Most produces robust representations that generalize to new domains
- MOST training aligns speech and text representations by training on both modalities.
- Adding 2% of parameters to the network produces competitive performance on downstream tasks.
- USM-LAS-Adapter uses FLEURS data to transcribe YouTube data in unseen languages.
- USM-LAS-Adapter leads to improvements of up to 30% in some languages.
Usms are strong ast models
- USM fine-tuning shows comparable performance to CoVoST 2 SoTA BLEU score
- Previous SoTA used 125k hours of supervised speech translation data, USM used 859 hours
- USM-M can use both speech and text as training input
- USM-M achieved > 30 BLEU on CoVoST, 1 BLEU increase from SoTA
Multi-softmax loss for best-rq
- ASR and AST benchmarks improved by > 5% when increasing the number of softmax groups from 1 to 16
- Using multiple softmax groups reduces performance variation and improves convergence speed
Model and language scaling
- Scaling up model size and increasing language coverage of pre-training dataset improves performance of USMs.
- Relative gains on newly covered languages are more substantial than on other languages.
- BEST-RQ outperforms or is comparable to other prominent pre-training methods for speech recognition.
- BEST-RQ obtains greater gains when scaled up.
Best-rq is a scalable self-supervised learner
- Chunk-wise attention can address long-form degradation issues.
- Chunk-wise attention models outperform local self attention models with 128 context frames.
- Increasing context window size of local self attention models results in high deletion error rates.
Tpu serving capacity of usm-ctc models
- USM-CTC models are powerful generic ASR models with reliable long-form transcription performance and excellent generalization properties
- USM-CTC model is only 3.9x slower than the 100M-parameter streaming model
- USM-CTC can be used as an offline transcriber efficiently on TPUs (or GPUs)
Discussion
- Unlabeled data is more practical than weakly labeled data for tail languages
- In-domain data is best for optimizing performance
- Different transducers can be tested quickly and selected for a given purpose
- Training is split into three stages
- Noisy student training for unseen languages
- Comparing BEST-RQ against W2v-BERT
- Chunk-wise attention