Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introduce the Universal Speech Model (USM) for Automatic Speech Recognition (ASR) across 100+ languages.
Pre-train the encoder on a large unlabeled multilingual dataset of 12 million hours and fine-tune on a smaller labeled dataset.
Use multilingual pre-training with random-projection quantization and speech-text modality matching.
Achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
Despite using a labeled training set 1/7-th the size of that used for the Whisper model, comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.

Paper Content

Introduction

Recent advances in self-supervised learning have enabled new possibilities for speech recognition
Recent studies have focused on creating “universal” models that can cover multiple tasks, domains, and languages
This work explores the frontiers of language expansion, with the goal of training a universal ASR model that covers all spoken languages
Obtaining enough data to train high-quality models is a challenge, but recent developments in semisupervised algorithms make it possible to leverage untranscribed data for pre-training
A single large model can utilize large data sets more effectively than smaller models

Our approach

Produce large “Universal Speech Models” (USMs) through a training pipeline
Utilize three types of datasets: Unpaired Audio, Unpaired Text, and Paired ASR Data
2B-parameter Conformer models built using these datasets
Unsupervised Pre-training and MOST (Multi-Objective Supervised pre-Training)
Produce generic ASR models and pre-trained models
Evaluate USMs on two public benchmarks and CoVoST 2
Explore possibility of attaching additional “adapter” units
Training pipeline enables building of both generic multilingual ASR systems and domain specific models

Key findings

USM models achieve state-of-the-art performance for multilingual ASR and AST
YouTube captions model performs better than Whisper
BEST-RQ pre-training can effectively scale to large data regime
MOST is an effective method for utilizing large scale text data
USM establishes new state-of-the-art on FLEURS and CoVoST 2
Chunk-wise attention for robust long-form speech recognition

Outline

Analyzed effects of key components of work
Compared performance against existing methods

Extensive literature on pre-training and self-training for ASR
Large speech models studied in monolingual and multilingual contexts
Large multi-modal speech models explored
Unsupervised pre-training methods for speech models proposed and applied
Work an extension of research efforts studying semi-supervised learning for ASR
Large speech models (> 1B) studied in previous work
Self-supervised learning algorithm (BEST-RQ) and multi-modal pre-training (text-injection) used to improve methods
Multi-softmax loss used to improve BEST-RQ
Multi-Objective Supervised Training (BEST-RQ with text-injection) used to improve quality of speech representations
Chunk-wise attention proposed as alternative to chunk-based decoding
Scalable self-supervised training framework for multilingual ASR proposed

Methods

Model architecture: conformer

Used convolution-augmented transformer (Conformer) with relative attention as encoder model
Features produced by Conformer used as input to CTC, RNN-T or LAS unit
BEST-RQ pre-training applied to encoder only
Considered two models with 600M and 2B parameters
Features of models listed in Table 2

Pre-training: best-rq

BEST-RQ is used to pre-train networks with speech audio
BEST-RQ has a small number of hyperparameters
BEST-RQ uses a BERT-style training task to predict masked speech features
Speech features are quantized and the task requires predicting the quantized label
Codebook vectors are chosen in an embedding space
Cosine similarity is used to determine the code
BEST-RQ does not require a quantization module, making it more scalable

Self-training: noisy student training

Utilize noisy student training to generate pseudo-labeled data
Teacher model is trained with augmentation on supervised set
Teacher model used to generate transcripts for unlabeled audio data
Heuristic filtering method used to filter pseudo-labeled data
Pseudo-labeled data mixed with supervised data to train student model

Chunk-wise attention for long-form asr

ASR systems are usually trained on short segments, typically less than 30 seconds
Local self attention is widely used, but stacking many layers creates a mismatch between training and inference
This mismatch causes high deletion errors, referred to as the “long-form (performance) degradation” problem
Chunk-wise attention is proposed to solve this problem, allowing other layers in the encoder to process contextual frames beyond the current chunk
Text-injection loss is used to produce joint, co-aligned embeddings of speech and text
Three types of data are used to train the model: unlabeled speech, paired speech-text, and unlabeled text
Model is trained in two stages, first on paired data and then on unlabeled text
Text machine translation data can be used during the fine-tuning stage of AST tasks

Residual adaptation with a frozen encoder

Pre-trained USM is expensive to fine-tune for various domains and tasks
Lightweight alternative of adding residual adapters with a small number of parameters to each Conformer block
Adapters are dynamically loaded according to the tasks within the input batch
Training the adapter can reduce over-fitting when training data is limited

Training details

Audio is sampled to 16 kHz quality
Audio is featurized into 128-dimensional log-mel filterbank coefficients
Graphemes used to tokenize text for FLEURS in-domain fine-tuning
Word-piece models used for tokenization for all other tasks
16 codebook multi-softmax loss used to stabilize training and improve performance
Text encoder and decoder architecture described in [13] used
Single 1536-dimensional Conformer layer used as speech encoder
Un-transcribed speech, unspoken text, and transcribed speech mixed in each batch
Model initialized with BEST-RQ pre-trained encoder
Curriculum learning schedule used
Two separate optimizers for encoder and decoder parameters
GShard framework with GSPMD backend used to train large models on TPUs
3 datasets used to train models

Audio data

YT-SUP+: 90k hours of segmented, labeled audio across 75 languages, plus 100k hours of segmented, pseudo-labeled en-US audio
YT-55-U: 12M hours of segmented, unlabeled audio on 55 rich resource languages
YT-513-U: 100k hours of segmented, unlabeled audio across 513 tail languages
Pub-U: 429k hours of unlabeled speech data in 51 languages
Pub-S: 1.3k hours of speech and transcript data spanning 14 languages
Public data only used for in-domain pre-training

Text data

Pre-training with unlabeled text uses a web-crawled corpus of monolingual text with 28B sentences.
Dataset spans 1140 languages, 205 with over 1M sentences and 199 with 100k-1M sentences.

Downstream benchmarks

Presented results on two public tasks (SpeechStew and FLEURS) and an internal benchmark on YouTube
SpeechStew dataset assembled from seven public speech corpora
FLEURS dataset is multi-way parallel dataset of 10 hours of read speech in 102 languages
62 languages selected to compare generic ASR system with Whisper
WER and CER metrics reported
YouTube domain test set consists of utterances from 73 languages
CoVoST 2 used to benchmark multilingual speech translation
21 source languages into English, training data ranges from 1-264 hours
Text translation data from CoVoST 2 combined with WMT and TED Talks data

Robust speech recognition for massively multilingual tasks

Whisper6 trained on 680k hours of weakly supervised data
Whisper hallucinates in many languages, resulting in WER exceeding 100%
USM-LAS and USM-CTC outperform Whisper on YouTube en-US
USM-CTC and USM-LAS outperform Whisper on CORAAL and SpeechStew
USM-LAS and USM-CTC outperform Whisper by 66% relative WER on FLEURS

Massively multilingual results beyond 100 languages

Pre-trained model improves FLEURS benchmark significantly with only 10 hours per language
Model achieves 30% relative improvement in terms of WER across 102 languages
Performance is maximized by in-domain fine-tuning

Most produces robust representations that generalize to new domains

MOST training aligns speech and text representations by training on both modalities.
Adding 2% of parameters to the network produces competitive performance on downstream tasks.
USM-LAS-Adapter uses FLEURS data to transcribe YouTube data in unseen languages.
USM-LAS-Adapter leads to improvements of up to 30% in some languages.

Usms are strong ast models

USM fine-tuning shows comparable performance to CoVoST 2 SoTA BLEU score
Previous SoTA used 125k hours of supervised speech translation data, USM used 859 hours
USM-M can use both speech and text as training input
USM-M achieved > 30 BLEU on CoVoST, 1 BLEU increase from SoTA

Multi-softmax loss for best-rq

ASR and AST benchmarks improved by > 5% when increasing the number of softmax groups from 1 to 16
Using multiple softmax groups reduces performance variation and improves convergence speed

Model and language scaling

Scaling up model size and increasing language coverage of pre-training dataset improves performance of USMs.
Relative gains on newly covered languages are more substantial than on other languages.
BEST-RQ outperforms or is comparable to other prominent pre-training methods for speech recognition.
BEST-RQ obtains greater gains when scaled up.

Best-rq is a scalable self-supervised learner

Chunk-wise attention can address long-form degradation issues.
Chunk-wise attention models outperform local self attention models with 128 context frames.
Increasing context window size of local self attention models results in high deletion error rates.

Tpu serving capacity of usm-ctc models

USM-CTC models are powerful generic ASR models with reliable long-form transcription performance and excellent generalization properties
USM-CTC model is only 3.9x slower than the 100M-parameter streaming model
USM-CTC can be used as an offline transcriber efficiently on TPUs (or GPUs)

Discussion

Unlabeled data is more practical than weakly labeled data for tail languages
In-domain data is best for optimizing performance
Different transducers can be tested quickly and selected for a given purpose
Training is split into three stages
Noisy student training for unseen languages
Comparing BEST-RQ against W2v-BERT
Chunk-wise attention

Link to paper#

Abstract#

Paper Content#

Introduction#

Our approach#

Key findings#

Outline#

Related work#

Methods#

Model architecture: conformer#

Pre-training: best-rq#

Self-training: noisy student training#

Chunk-wise attention for long-form asr#

Residual adaptation with a frozen encoder#

Training details#

Audio data#

Text data#

Downstream benchmarks#

Robust speech recognition for massively multilingual tasks#

Massively multilingual results beyond 100 languages#

Most produces robust representations that generalize to new domains#

Usms are strong ast models#

Multi-softmax loss for best-rq#

Model and language scaling#

Best-rq is a scalable self-supervised learner#

Tpu serving capacity of usm-ctc models#

Discussion#

Link to paper

Abstract

Paper Content

Introduction

Our approach

Key findings

Outline

Related work

Methods

Model architecture: conformer

Pre-training: best-rq

Self-training: noisy student training

Chunk-wise attention for long-form asr

Residual adaptation with a frozen encoder

Training details

Audio data

Text data

Downstream benchmarks

Robust speech recognition for massively multilingual tasks

Massively multilingual results beyond 100 languages

Most produces robust representations that generalize to new domains

Usms are strong ast models

Multi-softmax loss for best-rq

Model and language scaling

Best-rq is a scalable self-supervised learner

Tpu serving capacity of usm-ctc models

Discussion