Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Studied capabilities of speech processing systems trained to predict audio transcripts
Models generalized well to standard benchmarks and competitive with prior results without fine-tuning
Models approach accuracy and robustness of humans

Paper Content

Introduction

Progress in speech recognition has been improved by unsupervised pre-training techniques
These methods can use large datasets of unlabeled speech
Pre-trained audio encoders learn high-quality representations of speech, but lack an equivalently performant decoder
Fine-tuning can be complex and can lead to models exploiting dataset-specific quirks
Recent efforts have created larger datasets for speech recognition using automated pipelines
Moving beyond gold-standard datasets to larger weakly supervised datasets improves robustness and generalization
This work scales weakly supervised speech recognition to 680,000 hours of labeled audio data
This dataset includes 96 languages and 125,000 hours of X→en translation data
Joint multilingual and multitask training is beneficial
Inference code and models are available online

Approach

Data processing

Minimalist approach to data pre-processing
Training models to predict raw text of transcripts without standardization
Construct dataset from audio paired with transcripts on the internet
Automated filtering methods to improve transcript quality
Heuristics to detect and remove machine-generated transcripts

Model

Used an off-the-shelf architecture to avoid confounding findings
Audio re-sampled to 16,000 Hz and 80-channel logmagnitude Mel spectrogram representation computed on 25-millisecond windows with a stride of 10 milliseconds

Multitask format

Speech recognition involves many components, making it complex
We want a single model to perform the entire speech processing pipeline
We use a simple format to specify tasks and conditioning information
We use a sequence-to-sequence Transformer model for many different speech processing tasks
Tasks are represented as a sequence of tokens to be predicted by the decoder
Special tokens are used as task specifiers or classification targets

Training details

Trained suite of models of various sizes
Used data parallelism, FP16, dynamic loss scaling, activation checkpointing
Trained with AdamW and gradient norm clipping
Batch size of 256 segments, trained for 2-3 passes over dataset
No data augmentation or regularization
Fine-tuned models on subset of transcripts without speaker annotations
Additional Large model trained for 2.5X more epochs with SpecAugment, Stochastic Depth, and BPE Dropout

Experiments

Zero-shot evaluation

Goal of Whisper is to develop a single robust speech processing system.
Evaluate Whisper in a zero-shot setting without using any of the training data.

Evaluation metrics

Speech recognition systems are typically evaluated using the word error rate (WER) metric.
WER penalizes all differences between the model’s output and the reference transcript, including formatting differences.
There is active research into evaluation metrics that better correlate with human judgement.
To address this problem, the authors developed a text normalizer to minimize penalization of non-semantic differences.
The authors observed WER drops of up to 50% due to formatting quirks.

English speech recognition

Deep Speech 2 reported human-level performance on LibriSpeech test-clean in 2015
SOTA WER on LibriSpeech test-clean has dropped 73% since then
Speech recognition models trained on LibriSpeech remain far above human error rates when used in other settings
Gap between human and machine performance is due to different capabilities being measured by human and machine performance on a test set
Whisper models, trained on a broad and diverse distribution of audio, could potentially match human behavior better than existing systems
Zero-shot Whisper models match accuracy and robustness of humans on LibriSpeech

Multi-lingual speech recognition

Comparing to prior work on multilingual speech recognition, results are reported on two low-data benchmarks
Whisper performs well on Multilingual LibriSpeech, outperforming XLS-R
On VoxPopuli, Whisper underperforms prior work and only beats the VP-10K+FT baseline
Strong squared correlation coefficient of 0.83 between the log of the word error rate and the log of the amount of training data per language

Translation

Whisper models are studied for their translation capabilities
Performance is measured on the X→en subset of CoVoST2
Compared with Maestro, mSLAM, and XLS-R
Achieved a new state of the art of 29
Re-purposed Fleurs as a translation dataset
Visualized correlation between amount of translation training data and BLEU score
Correlation coefficient is lower than 0.83 observed for speech recognition
Welsh (CY) is an outlier with much worse than expected performance
Majority of Welsh translation data is actually English audio with English captions

Language identification

Fleurs dataset used to evaluate language identification
Whisper underperforms supervised SOTA by 13.6%
Whisper dataset contains no training data for 20 of 102 languages in Fleurs, upperbounding accuracy at 80.4%

Robustness to additive noise

Tested noise robustness of Whisper models and 14 LibriSpeech-trained models
Measured WER when white noise or pub noise added to audio
Pub noise represents natural noisy environment
12 models pre-trained/fine-tuned on LibriSpeech, 2 NVIDIA STT models trained on mixture dataset
Performance degrades as noise becomes more intensive
All models perform worse than Whisper model under pub noise of SNR below 10 dB
Showcases Whisper’s robustness to noise

Long-form transcription

Whisper models are trained on 30-second audio chunks
Real-world applications often require transcribing longer audio
Strategy developed to transcribe long audio by consecutively transcribing 30-second segments
Beam search and temperature scheduling used to reliably transcribe long audio
Evaluated performance on seven datasets of various lengths and recording conditions

Comparison with human performance

Ambiguous or indistinct speech and labeling errors lead to different levels of irreducible error in datasets.
WER metrics from ASR systems alone are not enough to measure how much room for improvement exists.
25 recordings from the Kincaid46 dataset were selected and 5 services were used to obtain transcripts.
The audio selection covers various recording conditions.
Computer-assisted transcription had the lowest aggregate WER, only 1.15% better than Whisper’s.
Pure-human performance was only a fraction of a percentage point better than Whisper’s.
Results indicate Whisper’s English ASR performance is close to human-level accuracy.

Analysis and ablations

Model scaling

Weakly supervised training approaches can use larger datasets than traditional supervised learning.
Data used in weakly supervised training is noisier and lower quality than gold-standard supervision.
Models trained on this kind of data may not reach human-level performance.
Models may learn to exploit the idiosyncrasies of the dataset, reducing their ability to generalize.

Dataset scaling

Whisper dataset is one of the largest ever created in supervised speech recognition
Trained a series of medium-sized models on subsampled versions of the dataset
Evaluated performance on English and multilingual speech recognition and X→en translation
Increases in dataset size result in improved performance on all tasks
Performance improves rapidly on English speech recognition from 3,000 to 13,000 hours
Diminishing returns observed with model size scaling for English speech recognition
Performance follows a power-law trend for multilingual speech recognition till 54,000 hours
Performance follows a log-linear improvement trend for X→en translation till 54,000 hours

Multitask and multilingual transfer

Jointly training a single model on many tasks and languages can lead to negative transfer.
Comparing performance of models trained on English speech recognition with multitask and multilingual training setup showed negative transfer for small models.
Multitask and multilingual models benefit more from scale and eventually outperform models trained on English data only.
For largest experiments, joint models also slightly outperform English-only models.

Text normalization

Developed text normalizer with Whisper to discount innocuous word errors
Risk of normalizer being overfitted to Whisper’s peculiarities
Compared performance of Whisper using normalizer vs. FairSpeech project
Most datasets perform similarly, without significant differences in WER reduction
On some datasets, normalizer reduces WER of Whisper models significantly more
Differences in reduction can be traced to different formats used by ground truth

Strategies for reliable long-form transcription

Use beam search with 5 beams to reduce repetition looping
Increase temperature from 0 to 1.0 when average log probability is lower than -1 or gzip compression rate is higher than 2.4
Provide transcribed text from preceding window when temperature is below 0.5
Combine no-speech probability threshold of 0.6 and average log-probability threshold of -1 to improve voice activity detection
Constrain initial timestamp token to be between 0.0 and 1.0 second

Deep learning improved speech recognition performance with model depth and size
Dataset size improved performance from 3 hours of TIMIT training data to 2,000 hours of Switchboard dataset
Multitask learning has been studied for a long time
Multi-domain training increases robustness and generalization

Limitations and future work

Improved decoding strategies can reduce errors in long-form transcription
Increase training data for lower-resource languages
Study fine-tuning to compare with prior work
Investigate benefits of training encoder, decoder, or both

Conclusion

Scaling weakly supervised pretraining has been underappreciated in speech recognition research
Results achieved without self-supervision and self-training techniques
Training on large and diverse supervised dataset and focusing on zero-shot transfer can improve robustness of speech recognition system
Evaluation datasets include LibriSpeech, TED-LIUM 3, Common Voice 5.1, Artie bias corpus, CallHome and Switchboard, WSJ, CORAAL, CHiME-6, AMI-IHM and AMI-SDM1
Long-form English-only datasets include TED-LIUM 3, Meanwhile, Rev16, CORAAL, Multilingual LibriSpeech, VoxPopuli, Common Voice 9, CoVOST 2
Models from HuggingFace used for comparison
Text normalization steps to standardize English texts
Sequence-to-sequence Transformer model trained on many different speech processing tasks
Zero-shot Whisper models close the gap to human robustness
Correlation of pre-training supervision amount with downstream speech recognition and translation performance
Whisper competitive with state-of-the-art commercial and open-source ASR systems in long-form transcription
Performance close to professional human transcribers
Multitask and multilingual transfer improves with scale
Kincaid46 dataset used for human transcription benchmark
Architecture details of Whisper model family
Text normalizer has similar effect on reducing WERs between Whisper models and other open-source models
Long-form transcription performance improves incrementally with additional decoding heuristics
MUTE English English transcription WER and BLEU scores on Fleurs

Link to paper#

Abstract#

Paper Content#

Introduction#

Approach#

Data processing#

Model#

Multitask format#

Training details#

Experiments#

Zero-shot evaluation#

Evaluation metrics#

English speech recognition#

Multi-lingual speech recognition#

Translation#

Language identification#

Robustness to additive noise#

Long-form transcription#

Comparison with human performance#

Analysis and ablations#

Model scaling#

Dataset scaling#

Multitask and multilingual transfer#

Text normalization#

Strategies for reliable long-form transcription#

Related work#

Limitations and future work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Approach

Data processing

Model

Multitask format

Training details

Experiments

Zero-shot evaluation

Evaluation metrics

English speech recognition

Multi-lingual speech recognition

Translation

Language identification

Robustness to additive noise

Long-form transcription

Comparison with human performance

Analysis and ablations

Model scaling

Dataset scaling

Multitask and multilingual transfer

Text normalization

Strategies for reliable long-form transcription

Related work

Limitations and future work

Conclusion