Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introducing a language modeling approach for text to speech synthesis
  • Training a neural codec language model using discrete codes
  • Regard TTS as a conditional language modeling task
  • Pre-training stage scales up TTS training data to 60K hours of English speech
  • Vall-E can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording
  • Vall-E outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity
  • Vall-E preserves the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis

Paper Content

Introduction

  • Speech synthesis has advanced through neural networks and end-to-end modeling in the last decade
  • Current TTS systems use a pipeline with an acoustic model and a vocoder using mel spectrograms as intermediate representations
  • High-quality clean data from recording studios is needed for advanced TTS systems
  • Existing work leverages speaker adaptation and speaker encoding methods to tackle the zero-shot TTS problem
  • Recent years have seen notable performance improvement for data increase in the text language model
  • VALL-E is the first language model based TTS framework leveraging large, diverse, and multi-speaker speech data
  • VALL-E generates acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt
  • VALL-E is trained with LibriLight, a corpus consisting of 60K hours of English speech with over 7000 unique speakers
  • VALL-E significantly outperforms the state-of-the-art zero-shot TTS system on LibriSpeech and VCTK
  • VALL-E is able to provide diverse outputs with the same input text and keep the acoustic environment and speaker’s emotion of the acoustic prompt
  • Cascaded TTS systems use a pipeline with an acoustic model and a vocoder
  • End-to-end TTS models jointly optimize the acoustic model and vocoder
  • Zero-shot multi-speaker TTS techniques are used to customize a TTS system to an arbitrary voice
  • Speaker adaptation and speaker encoding approaches are used
  • Advanced speaker embedding models can be employed
  • Advanced but complex speaker encoder can be designed
  • Diffusion model based TTS is extended to zero-shot TTS
  • Audio codec code is used as intermediate representations
  • Self-supervised learning is used in speech understanding and speech-to-speech generation
  • HuBERT codes, VQVAE codes, and a speaker encoder are combined
  • Audio codecs are used to synthesize speech without training a vocoder
  • µ-law transformation is used to quantize audio
  • Vector quantization is used for feature extraction
  • Neural codec models are used to encode waveform into discrete acoustic codes

Vall-e 4.1 problem formulation: regarding tts as conditional codec language modeling

  • Dataset consists of audio samples and corresponding phoneme transcriptions
  • Neural codec model used to encode audio samples into discrete acoustic codes
  • Acoustic codes used to reconstruct waveform
  • Zero-shot TTS regarded as conditional codec language modeling task
  • Neural language model trained to generate acoustic code matrix conditioned on phoneme sequence and acoustic prompt

Training: conditional codec language modeling

  • Neural speech codec model operates on discrete audio representations
  • Tokens have hierarchical structure
  • Each quantizer is trained to model residual from previous quantizers
  • Two conditional language models are designed in hierarchical manner
  • Autoregressive (AR) decoder-only language model is trained for tokens from first quantizer
  • Non-autoregressive (NAR) language model is trained for tokens from second to last quantizers
  • AR model is used for length prediction, NAR model is used for time complexity
  • AR model has phoneme embedding, acoustic embedding, transformer decoder, and prediction layer
  • NAR model has eight separate acoustic embedding layers
  • Adaptive Layer Normalization is used in NAR model

Inference: in-context learning via prompting

  • In-context learning is the ability of a text-based language model to predict labels for unseen inputs without additional parameter updates.
  • Existing TTS systems have weak in-context learning capability, requiring additional fine-tuning or degrading for unseen speakers.
  • VALL-E is a model that generates given content for unseen speakers, using a phoneme prompt and an acoustic prefix.

Experiment

Experiment setup

  • LibriLight is a dataset of 60K hours of unlabelled English speech from audiobooks with around 7000 distinct speakers
  • 960 hours of LibriSpeech is used to train a hybrid DNN-HMM ASR model
  • EnCodec model is used to generate acoustic code matrix for 60K hours of data
  • AR and NAR models have same transformer architecture
  • Waveform length is randomly cropped between 10 and 20 seconds
  • Models are trained on 16 NVIDIA TESLA V100 32GB GPUs
  • YourTTS is used as baseline
  • WavLM-TDNN is used to evaluate speaker similarity
  • HuBERT-Large is used to calculate WER
  • CMOS and SMOS are used for human evaluation

Librispeech evaluation

  • LibriSpeech used for zero-shot TTS evaluation

Vctk evaluation

  • Evaluated model on VCTK with 108 speakers, none seen during training
  • Compared model with baseline, VALL-E outperformed baseline
  • Performance gap larger when only 3s prompts available
  • Model able to generate more similar speech with longer prompts
  • Human evaluation of 60 speakers, 11 unseen, 49 seen
  • VALL-E better speaker similarity than baseline
  • VALL-E +0.23 CMOS over YourTTS
  • VALL-E +0.04 CMOS over ground-truth

Qualitative analysis

  • Previous TTS systems have a one-one mapping between input and output
  • VALL-E uses sampling-based method to generate tokens with randomness
  • Outputs have different lengths and phrase durations
  • Acoustic environment consistency between prompt and generation
  • Emotion of prompt is preserved in speech synthesis

Conclusion, limitations, and future work

  • Introduced VALL-E, a language model approach for TTS
  • Pre-trained VALL-E with 60K hours of speech data
  • Achieved new state-of-the-art zero-shot TTS results
  • VALL-E can keep acoustic environment and speaker’s emotion
  • Disordered attention alignments exist in phoneme-to-acoustic language part
  • Insufficient data coverage of accent speakers
  • Future work to leverage non-autoregressive models and modify attention mechanism