Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introducing a new collection of spoken English audio for training speech recognition systems
Audio is derived from open-source audio books from the LibriVox project
Contains over 60K hours of audio, largest freely-available corpus of speech
Audio segmented using voice activity detection and tagged with SNR, speaker ID and genre descriptions
Baseline systems and evaluation metrics for 3 settings: zero resource/unsupervised, semi-supervised, distant supervision
Evaluated on standard LibriSpeech dev and test sets

Automatic Speech Recognition (ASR) has made progress in recent years with deep neural networks and large datasets
Costs of annotating larger datasets are prohibitive
Interest in weakly supervised solutions with fewer human annotations
Semi-supervised setting: fraction of dataset labelled, rest unlabelled
Distant supervision setting: dataset mostly or entirely unlabelled, but large quantities of unaligned text
Pretraining with labels from other languages or unsupervised objectives
Zero resource ASR discovers its own units from raw speech
Libri-light: open-source corpus of unlabelled speech and common set of metrics to evaluate three settings
Test sets identical to LibriSpeech to compare weakly supervised results with state-of-the art
Baseline system and datasets open source

Open source software and datasets facilitate machine learning progress
LibriSpeech is a large open-source dataset with audio books and sentence-level annotations
CommonVoice project contains 2900 hours of read speech in 37 languages
Wilderness dataset contains text of Bible read in 750 languages
Zero Resource Challenge has released datasets and metrics for unsupervised setting
IARPA Babel program has initiated push towards limited supervision for less studied languages
Dataset contains 10 hours of transcribed speech and larger amounts of untranscribed audio
Dataset contains 4 parts: unlabelled speech, limited labels, dev/test sets, and unaligned text
Unlabelled Speech Training Set obtained from LibriVox repository
Dataset splits based on different sizes: unlab-60k, unlab-6k, unlab-600

Selected 3 subsets of LibriSpeech training set: 10 hour, 1 hour, and 6 10-minute sets
Half of utterances from clean and other training sets
Orthographic and phonetic transcriptions from LibriSpeech
Dev and test sets same as LibriSpeech
LM corpus provided in LibriSpeech with 800M tokens and 200k vocabulary size

Aim of unsupervised setting is to extract speech representations that encode phonetic content while ignoring irrelevant information
Unsupervised setting evaluated using ABX error, a distance-based metric
For semi-supervised setting, quality of learned acoustic representations evaluated with little annotated data
Distant supervision evaluated by Word Error Rate (WER)

PyTorch implementation of Contrastive Predictive Coding (CPC) system used to predict hidden states of future speech frames
Encoder maps waveforms to hidden states using 5 convolutional layers
Sequence model encodes hidden states into 512-dimensional phonetic embedding with one layer of Gated Recurrent Units (GRUs)
Predictor maps last phonetic embedding onto future hidden state using linear projection
Model trained discriminatively to avoid trivial solution
Original paper obtained 65.5% accuracy on phoneme classification
Modified system obtained 68.9% accuracy with 4 times fewer parameters
Semi-supervised setting uses baseline pretrained CPC system with linear classifier trained with CTC loss
Distant supervision setting uses pretrained CPC system with improved CTC layer and pseudo-labels generated by beam-search decoding

Table 2 shows that CPC embeddings have good ABX scores compared to an MFCC baseline.
Results in the semi-supervised setting (Table 3) show gains in PER when using unsupervised pretraining.
Results on distant supervision (Table 4) show that increasing the amount of unsupervised pretraining helps.

Introduced a new large dataset for benchmarking ASR systems
Unsupervised training with larger dataset yields better features
Performance of systems trained with limited labels (10 min to 10 hours) improved
Baselines provided as proof-of-concept, significant margin with fully-supervised systems
Improvements include larger models, speaker-adversarial losses, fine tuning entire system, pseudo-labels retraining
Active learning could select useful parts of dataset
Language modeling techniques applied on unlabelled audio to improve representations
Dataset constructed with data download, exclusion of bad files, conversion to flac, extraction of VAD, SNR, and Perplexity
Voice Activity Detection accomplished using TDS acoustic model
SNR ratio calculated using VAD labels
JSON files constructed with metadata, SNR, perplexity, macro-genre tags, VAD information
Files split into cuts of different sizes
ABX task used to quantify property of features coding for same phonemes
Unsupervised feature model trained using Contrastive Predictive Coding algorithm
TDS model used for VAD has 100 million parameters
TDS model used for training has 20 million parameters
Training took approximatly two days on NVIDIA Tesla V100-SXM2-16GB
Training data limited, smaller TDS model used
Model optimized with plain SGD with momentum