Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introducing a new collection of spoken English audio for training speech recognition systems
  • Audio is derived from open-source audio books from the LibriVox project
  • Contains over 60K hours of audio, largest freely-available corpus of speech
  • Audio segmented using voice activity detection and tagged with SNR, speaker ID and genre descriptions
  • Baseline systems and evaluation metrics for 3 settings: zero resource/unsupervised, semi-supervised, distant supervision
  • Evaluated on standard LibriSpeech dev and test sets

Paper Content

Introduction

  • Automatic Speech Recognition (ASR) has made progress in recent years with deep neural networks and large datasets
  • Costs of annotating larger datasets are prohibitive
  • Interest in weakly supervised solutions with fewer human annotations
  • Semi-supervised setting: fraction of dataset labelled, rest unlabelled
  • Distant supervision setting: dataset mostly or entirely unlabelled, but large quantities of unaligned text
  • Pretraining with labels from other languages or unsupervised objectives
  • Zero resource ASR discovers its own units from raw speech
  • Libri-light: open-source corpus of unlabelled speech and common set of metrics to evaluate three settings
  • Test sets identical to LibriSpeech to compare weakly supervised results with state-of-the art
  • Baseline system and datasets open source
  • Open source software and datasets facilitate machine learning progress
  • LibriSpeech is a large open-source dataset with audio books and sentence-level annotations
  • CommonVoice project contains 2900 hours of read speech in 37 languages
  • Wilderness dataset contains text of Bible read in 750 languages
  • Zero Resource Challenge has released datasets and metrics for unsupervised setting
  • IARPA Babel program has initiated push towards limited supervision for less studied languages
  • Dataset contains 10 hours of transcribed speech and larger amounts of untranscribed audio
  • Dataset contains 4 parts: unlabelled speech, limited labels, dev/test sets, and unaligned text
  • Unlabelled Speech Training Set obtained from LibriVox repository
  • Dataset splits based on different sizes: unlab-60k, unlab-6k, unlab-600

Limited-resource training set.

  • Selected 3 subsets of LibriSpeech training set: 10 hour, 1 hour, and 6 10-minute sets
  • Half of utterances from clean and other training sets
  • Orthographic and phonetic transcriptions from LibriSpeech
  • Dev and test sets same as LibriSpeech
  • LM corpus provided in LibriSpeech with 800M tokens and 200k vocabulary size

Metrics

  • Aim of unsupervised setting is to extract speech representations that encode phonetic content while ignoring irrelevant information
  • Unsupervised setting evaluated using ABX error, a distance-based metric
  • For semi-supervised setting, quality of learned acoustic representations evaluated with little annotated data
  • Distant supervision evaluated by Word Error Rate (WER)

Baseline systems

  • PyTorch implementation of Contrastive Predictive Coding (CPC) system used to predict hidden states of future speech frames
  • Encoder maps waveforms to hidden states using 5 convolutional layers
  • Sequence model encodes hidden states into 512-dimensional phonetic embedding with one layer of Gated Recurrent Units (GRUs)
  • Predictor maps last phonetic embedding onto future hidden state using linear projection
  • Model trained discriminatively to avoid trivial solution
  • Original paper obtained 65.5% accuracy on phoneme classification
  • Modified system obtained 68.9% accuracy with 4 times fewer parameters
  • Semi-supervised setting uses baseline pretrained CPC system with linear classifier trained with CTC loss
  • Distant supervision setting uses pretrained CPC system with improved CTC layer and pseudo-labels generated by beam-search decoding

Results

  • Table 2 shows that CPC embeddings have good ABX scores compared to an MFCC baseline.
  • Results in the semi-supervised setting (Table 3) show gains in PER when using unsupervised pretraining.
  • Results on distant supervision (Table 4) show that increasing the amount of unsupervised pretraining helps.

Conclusion

  • Introduced a new large dataset for benchmarking ASR systems
  • Unsupervised training with larger dataset yields better features
  • Performance of systems trained with limited labels (10 min to 10 hours) improved
  • Baselines provided as proof-of-concept, significant margin with fully-supervised systems
  • Improvements include larger models, speaker-adversarial losses, fine tuning entire system, pseudo-labels retraining
  • Active learning could select useful parts of dataset
  • Language modeling techniques applied on unlabelled audio to improve representations
  • Dataset constructed with data download, exclusion of bad files, conversion to flac, extraction of VAD, SNR, and Perplexity
  • Voice Activity Detection accomplished using TDS acoustic model
  • SNR ratio calculated using VAD labels
  • JSON files constructed with metadata, SNR, perplexity, macro-genre tags, VAD information
  • Files split into cuts of different sizes
  • ABX task used to quantify property of features coding for same phonemes
  • Unsupervised feature model trained using Contrastive Predictive Coding algorithm
  • TDS model used for VAD has 100 million parameters
  • TDS model used for training has 20 million parameters
  • Training took approximatly two days on NVIDIA Tesla V100-SXM2-16GB
  • Training data limited, smaller TDS model used
  • Model optimized with plain SGD with momentum