Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Studied capabilities of speech processing systems trained to predict audio transcripts
  • Models generalized well to standard benchmarks and competitive with prior results without fine-tuning
  • Models approach accuracy and robustness of humans

Paper Content

Introduction

  • Progress in speech recognition has been improved by unsupervised pre-training techniques
  • These methods can use large datasets of unlabeled speech
  • Pre-trained audio encoders learn high-quality representations of speech, but lack an equivalently performant decoder
  • Fine-tuning can be complex and can lead to models exploiting dataset-specific quirks
  • Recent efforts have created larger datasets for speech recognition using automated pipelines
  • Moving beyond gold-standard datasets to larger weakly supervised datasets improves robustness and generalization
  • This work scales weakly supervised speech recognition to 680,000 hours of labeled audio data
  • This dataset includes 96 languages and 125,000 hours of X→en translation data
  • Joint multilingual and multitask training is beneficial
  • Inference code and models are available online

Approach

Data processing

  • Minimalist approach to data pre-processing
  • Training models to predict raw text of transcripts without standardization
  • Construct dataset from audio paired with transcripts on the internet
  • Automated filtering methods to improve transcript quality
  • Heuristics to detect and remove machine-generated transcripts

Model

  • Used an off-the-shelf architecture to avoid confounding findings
  • Audio re-sampled to 16,000 Hz and 80-channel logmagnitude Mel spectrogram representation computed on 25-millisecond windows with a stride of 10 milliseconds

Multitask format

  • Speech recognition involves many components, making it complex
  • We want a single model to perform the entire speech processing pipeline
  • We use a simple format to specify tasks and conditioning information
  • We use a sequence-to-sequence Transformer model for many different speech processing tasks
  • Tasks are represented as a sequence of tokens to be predicted by the decoder
  • Special tokens are used as task specifiers or classification targets

Training details

  • Trained suite of models of various sizes
  • Used data parallelism, FP16, dynamic loss scaling, activation checkpointing
  • Trained with AdamW and gradient norm clipping
  • Batch size of 256 segments, trained for 2-3 passes over dataset
  • No data augmentation or regularization
  • Fine-tuned models on subset of transcripts without speaker annotations
  • Additional Large model trained for 2.5X more epochs with SpecAugment, Stochastic Depth, and BPE Dropout

Experiments

Zero-shot evaluation

  • Goal of Whisper is to develop a single robust speech processing system.
  • Evaluate Whisper in a zero-shot setting without using any of the training data.

Evaluation metrics

  • Speech recognition systems are typically evaluated using the word error rate (WER) metric.
  • WER penalizes all differences between the model’s output and the reference transcript, including formatting differences.
  • There is active research into evaluation metrics that better correlate with human judgement.
  • To address this problem, the authors developed a text normalizer to minimize penalization of non-semantic differences.
  • The authors observed WER drops of up to 50% due to formatting quirks.

English speech recognition

  • Deep Speech 2 reported human-level performance on LibriSpeech test-clean in 2015
  • SOTA WER on LibriSpeech test-clean has dropped 73% since then
  • Speech recognition models trained on LibriSpeech remain far above human error rates when used in other settings
  • Gap between human and machine performance is due to different capabilities being measured by human and machine performance on a test set
  • Whisper models, trained on a broad and diverse distribution of audio, could potentially match human behavior better than existing systems
  • Zero-shot Whisper models match accuracy and robustness of humans on LibriSpeech

Multi-lingual speech recognition

  • Comparing to prior work on multilingual speech recognition, results are reported on two low-data benchmarks
  • Whisper performs well on Multilingual LibriSpeech, outperforming XLS-R
  • On VoxPopuli, Whisper underperforms prior work and only beats the VP-10K+FT baseline
  • Strong squared correlation coefficient of 0.83 between the log of the word error rate and the log of the amount of training data per language

Translation

  • Whisper models are studied for their translation capabilities
  • Performance is measured on the X→en subset of CoVoST2
  • Compared with Maestro, mSLAM, and XLS-R
  • Achieved a new state of the art of 29
  • Re-purposed Fleurs as a translation dataset
  • Visualized correlation between amount of translation training data and BLEU score
  • Correlation coefficient is lower than 0.83 observed for speech recognition
  • Welsh (CY) is an outlier with much worse than expected performance
  • Majority of Welsh translation data is actually English audio with English captions

Language identification

  • Fleurs dataset used to evaluate language identification
  • Whisper underperforms supervised SOTA by 13.6%
  • Whisper dataset contains no training data for 20 of 102 languages in Fleurs, upperbounding accuracy at 80.4%

Robustness to additive noise

  • Tested noise robustness of Whisper models and 14 LibriSpeech-trained models
  • Measured WER when white noise or pub noise added to audio
  • Pub noise represents natural noisy environment
  • 12 models pre-trained/fine-tuned on LibriSpeech, 2 NVIDIA STT models trained on mixture dataset
  • Performance degrades as noise becomes more intensive
  • All models perform worse than Whisper model under pub noise of SNR below 10 dB
  • Showcases Whisper’s robustness to noise

Long-form transcription

  • Whisper models are trained on 30-second audio chunks
  • Real-world applications often require transcribing longer audio
  • Strategy developed to transcribe long audio by consecutively transcribing 30-second segments
  • Beam search and temperature scheduling used to reliably transcribe long audio
  • Evaluated performance on seven datasets of various lengths and recording conditions

Comparison with human performance

  • Ambiguous or indistinct speech and labeling errors lead to different levels of irreducible error in datasets.
  • WER metrics from ASR systems alone are not enough to measure how much room for improvement exists.
  • 25 recordings from the Kincaid46 dataset were selected and 5 services were used to obtain transcripts.
  • The audio selection covers various recording conditions.
  • Computer-assisted transcription had the lowest aggregate WER, only 1.15% better than Whisper’s.
  • Pure-human performance was only a fraction of a percentage point better than Whisper’s.
  • Results indicate Whisper’s English ASR performance is close to human-level accuracy.

Analysis and ablations

Model scaling

  • Weakly supervised training approaches can use larger datasets than traditional supervised learning.
  • Data used in weakly supervised training is noisier and lower quality than gold-standard supervision.
  • Models trained on this kind of data may not reach human-level performance.
  • Models may learn to exploit the idiosyncrasies of the dataset, reducing their ability to generalize.

Dataset scaling

  • Whisper dataset is one of the largest ever created in supervised speech recognition
  • Trained a series of medium-sized models on subsampled versions of the dataset
  • Evaluated performance on English and multilingual speech recognition and X→en translation
  • Increases in dataset size result in improved performance on all tasks
  • Performance improves rapidly on English speech recognition from 3,000 to 13,000 hours
  • Diminishing returns observed with model size scaling for English speech recognition
  • Performance follows a power-law trend for multilingual speech recognition till 54,000 hours
  • Performance follows a log-linear improvement trend for X→en translation till 54,000 hours

Multitask and multilingual transfer

  • Jointly training a single model on many tasks and languages can lead to negative transfer.
  • Comparing performance of models trained on English speech recognition with multitask and multilingual training setup showed negative transfer for small models.
  • Multitask and multilingual models benefit more from scale and eventually outperform models trained on English data only.
  • For largest experiments, joint models also slightly outperform English-only models.

Text normalization

  • Developed text normalizer with Whisper to discount innocuous word errors
  • Risk of normalizer being overfitted to Whisper’s peculiarities
  • Compared performance of Whisper using normalizer vs. FairSpeech project
  • Most datasets perform similarly, without significant differences in WER reduction
  • On some datasets, normalizer reduces WER of Whisper models significantly more
  • Differences in reduction can be traced to different formats used by ground truth

Strategies for reliable long-form transcription

  • Use beam search with 5 beams to reduce repetition looping
  • Increase temperature from 0 to 1.0 when average log probability is lower than -1 or gzip compression rate is higher than 2.4
  • Provide transcribed text from preceding window when temperature is below 0.5
  • Combine no-speech probability threshold of 0.6 and average log-probability threshold of -1 to improve voice activity detection
  • Constrain initial timestamp token to be between 0.0 and 1.0 second
  • Deep learning improved speech recognition performance with model depth and size
  • Dataset size improved performance from 3 hours of TIMIT training data to 2,000 hours of Switchboard dataset
  • Multitask learning has been studied for a long time
  • Multi-domain training increases robustness and generalization

Limitations and future work

  • Improved decoding strategies can reduce errors in long-form transcription
  • Increase training data for lower-resource languages
  • Study fine-tuning to compare with prior work
  • Investigate benefits of training encoder, decoder, or both

Conclusion

  • Scaling weakly supervised pretraining has been underappreciated in speech recognition research
  • Results achieved without self-supervision and self-training techniques
  • Training on large and diverse supervised dataset and focusing on zero-shot transfer can improve robustness of speech recognition system
  • Evaluation datasets include LibriSpeech, TED-LIUM 3, Common Voice 5.1, Artie bias corpus, CallHome and Switchboard, WSJ, CORAAL, CHiME-6, AMI-IHM and AMI-SDM1
  • Long-form English-only datasets include TED-LIUM 3, Meanwhile, Rev16, CORAAL, Multilingual LibriSpeech, VoxPopuli, Common Voice 9, CoVOST 2
  • Models from HuggingFace used for comparison
  • Text normalization steps to standardize English texts
  • Sequence-to-sequence Transformer model trained on many different speech processing tasks
  • Zero-shot Whisper models close the gap to human robustness
  • Correlation of pre-training supervision amount with downstream speech recognition and translation performance
  • Whisper competitive with state-of-the-art commercial and open-source ASR systems in long-form transcription
  • Performance close to professional human transcribers
  • Multitask and multilingual transfer improves with scale
  • Kincaid46 dataset used for human transcription benchmark
  • Architecture details of Whisper model family
  • Text normalizer has similar effect on reducing WERs between Whisper models and other open-source models
  • Long-form transcription performance improves incrementally with additional decoding heuristics
  • MUTE English English transcription WER and BLEU scores on Fleurs