Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Introducing a language modeling approach for text to speech synthesis
Training a neural codec language model using discrete codes
Regard TTS as a conditional language modeling task
Pre-training stage scales up TTS training data to 60K hours of English speech
Vall-E can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording
Vall-E outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity
Vall-E preserves the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis

Paper Content

Introduction

Speech synthesis has advanced through neural networks and end-to-end modeling in the last decade
Current TTS systems use a pipeline with an acoustic model and a vocoder using mel spectrograms as intermediate representations
High-quality clean data from recording studios is needed for advanced TTS systems
Existing work leverages speaker adaptation and speaker encoding methods to tackle the zero-shot TTS problem
Recent years have seen notable performance improvement for data increase in the text language model
VALL-E is the first language model based TTS framework leveraging large, diverse, and multi-speaker speech data
VALL-E generates acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt
VALL-E is trained with LibriLight, a corpus consisting of 60K hours of English speech with over 7000 unique speakers
VALL-E significantly outperforms the state-of-the-art zero-shot TTS system on LibriSpeech and VCTK
VALL-E is able to provide diverse outputs with the same input text and keep the acoustic environment and speaker’s emotion of the acoustic prompt

Cascaded TTS systems use a pipeline with an acoustic model and a vocoder
End-to-end TTS models jointly optimize the acoustic model and vocoder
Zero-shot multi-speaker TTS techniques are used to customize a TTS system to an arbitrary voice
Speaker adaptation and speaker encoding approaches are used
Advanced speaker embedding models can be employed
Advanced but complex speaker encoder can be designed
Diffusion model based TTS is extended to zero-shot TTS
Audio codec code is used as intermediate representations
Self-supervised learning is used in speech understanding and speech-to-speech generation
HuBERT codes, VQVAE codes, and a speaker encoder are combined
Audio codecs are used to synthesize speech without training a vocoder
µ-law transformation is used to quantize audio
Vector quantization is used for feature extraction
Neural codec models are used to encode waveform into discrete acoustic codes

Vall-e 4.1 problem formulation: regarding tts as conditional codec language modeling

Dataset consists of audio samples and corresponding phoneme transcriptions
Neural codec model used to encode audio samples into discrete acoustic codes
Acoustic codes used to reconstruct waveform
Zero-shot TTS regarded as conditional codec language modeling task
Neural language model trained to generate acoustic code matrix conditioned on phoneme sequence and acoustic prompt

Training: conditional codec language modeling

Neural speech codec model operates on discrete audio representations
Tokens have hierarchical structure
Each quantizer is trained to model residual from previous quantizers
Two conditional language models are designed in hierarchical manner
Autoregressive (AR) decoder-only language model is trained for tokens from first quantizer
Non-autoregressive (NAR) language model is trained for tokens from second to last quantizers
AR model is used for length prediction, NAR model is used for time complexity
AR model has phoneme embedding, acoustic embedding, transformer decoder, and prediction layer
NAR model has eight separate acoustic embedding layers
Adaptive Layer Normalization is used in NAR model

Inference: in-context learning via prompting

In-context learning is the ability of a text-based language model to predict labels for unseen inputs without additional parameter updates.
Existing TTS systems have weak in-context learning capability, requiring additional fine-tuning or degrading for unseen speakers.
VALL-E is a model that generates given content for unseen speakers, using a phoneme prompt and an acoustic prefix.

Experiment

Experiment setup

LibriLight is a dataset of 60K hours of unlabelled English speech from audiobooks with around 7000 distinct speakers
960 hours of LibriSpeech is used to train a hybrid DNN-HMM ASR model
EnCodec model is used to generate acoustic code matrix for 60K hours of data
AR and NAR models have same transformer architecture
Waveform length is randomly cropped between 10 and 20 seconds
Models are trained on 16 NVIDIA TESLA V100 32GB GPUs
YourTTS is used as baseline
WavLM-TDNN is used to evaluate speaker similarity
HuBERT-Large is used to calculate WER
CMOS and SMOS are used for human evaluation

Librispeech evaluation

LibriSpeech used for zero-shot TTS evaluation

Vctk evaluation

Evaluated model on VCTK with 108 speakers, none seen during training
Compared model with baseline, VALL-E outperformed baseline
Performance gap larger when only 3s prompts available
Model able to generate more similar speech with longer prompts
Human evaluation of 60 speakers, 11 unseen, 49 seen
VALL-E better speaker similarity than baseline
VALL-E +0.23 CMOS over YourTTS
VALL-E +0.04 CMOS over ground-truth

Qualitative analysis

Previous TTS systems have a one-one mapping between input and output
VALL-E uses sampling-based method to generate tokens with randomness
Outputs have different lengths and phrase durations
Acoustic environment consistency between prompt and generation
Emotion of prompt is preserved in speech synthesis

Conclusion, limitations, and future work

Introduced VALL-E, a language model approach for TTS
Pre-trained VALL-E with 60K hours of speech data
Achieved new state-of-the-art zero-shot TTS results
VALL-E can keep acoustic environment and speaker’s emotion
Disordered attention alignments exist in phoneme-to-acoustic language part
Insufficient data coverage of accent speakers
Future work to leverage non-autoregressive models and modify attention mechanism

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Vall-e 4.1 problem formulation: regarding tts as conditional codec language modeling#

Training: conditional codec language modeling#

Inference: in-context learning via prompting#

Experiment#

Experiment setup#

Librispeech evaluation#

Vctk evaluation#

Qualitative analysis#

Conclusion, limitations, and future work#

Link to paper

Abstract

Paper Content

Introduction

Related work

Vall-e 4.1 problem formulation: regarding tts as conditional codec language modeling

Training: conditional codec language modeling

Inference: in-context learning via prompting

Experiment

Experiment setup

Librispeech evaluation

Vctk evaluation

Qualitative analysis

Conclusion, limitations, and future work