Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introducing a language modeling approach for text to speech synthesis
- Training a neural codec language model using discrete codes
- Regard TTS as a conditional language modeling task
- Pre-training stage scales up TTS training data to 60K hours of English speech
- Vall-E can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording
- Vall-E outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity
- Vall-E preserves the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis
Paper Content
Introduction
- Speech synthesis has advanced through neural networks and end-to-end modeling in the last decade
- Current TTS systems use a pipeline with an acoustic model and a vocoder using mel spectrograms as intermediate representations
- High-quality clean data from recording studios is needed for advanced TTS systems
- Existing work leverages speaker adaptation and speaker encoding methods to tackle the zero-shot TTS problem
- Recent years have seen notable performance improvement for data increase in the text language model
- VALL-E is the first language model based TTS framework leveraging large, diverse, and multi-speaker speech data
- VALL-E generates acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt
- VALL-E is trained with LibriLight, a corpus consisting of 60K hours of English speech with over 7000 unique speakers
- VALL-E significantly outperforms the state-of-the-art zero-shot TTS system on LibriSpeech and VCTK
- VALL-E is able to provide diverse outputs with the same input text and keep the acoustic environment and speaker’s emotion of the acoustic prompt
Related work
- Cascaded TTS systems use a pipeline with an acoustic model and a vocoder
- End-to-end TTS models jointly optimize the acoustic model and vocoder
- Zero-shot multi-speaker TTS techniques are used to customize a TTS system to an arbitrary voice
- Speaker adaptation and speaker encoding approaches are used
- Advanced speaker embedding models can be employed
- Advanced but complex speaker encoder can be designed
- Diffusion model based TTS is extended to zero-shot TTS
- Audio codec code is used as intermediate representations
- Self-supervised learning is used in speech understanding and speech-to-speech generation
- HuBERT codes, VQVAE codes, and a speaker encoder are combined
- Audio codecs are used to synthesize speech without training a vocoder
- µ-law transformation is used to quantize audio
- Vector quantization is used for feature extraction
- Neural codec models are used to encode waveform into discrete acoustic codes
Vall-e 4.1 problem formulation: regarding tts as conditional codec language modeling
- Dataset consists of audio samples and corresponding phoneme transcriptions
- Neural codec model used to encode audio samples into discrete acoustic codes
- Acoustic codes used to reconstruct waveform
- Zero-shot TTS regarded as conditional codec language modeling task
- Neural language model trained to generate acoustic code matrix conditioned on phoneme sequence and acoustic prompt
Training: conditional codec language modeling
- Neural speech codec model operates on discrete audio representations
- Tokens have hierarchical structure
- Each quantizer is trained to model residual from previous quantizers
- Two conditional language models are designed in hierarchical manner
- Autoregressive (AR) decoder-only language model is trained for tokens from first quantizer
- Non-autoregressive (NAR) language model is trained for tokens from second to last quantizers
- AR model is used for length prediction, NAR model is used for time complexity
- AR model has phoneme embedding, acoustic embedding, transformer decoder, and prediction layer
- NAR model has eight separate acoustic embedding layers
- Adaptive Layer Normalization is used in NAR model
Inference: in-context learning via prompting
- In-context learning is the ability of a text-based language model to predict labels for unseen inputs without additional parameter updates.
- Existing TTS systems have weak in-context learning capability, requiring additional fine-tuning or degrading for unseen speakers.
- VALL-E is a model that generates given content for unseen speakers, using a phoneme prompt and an acoustic prefix.
Experiment
Experiment setup
- LibriLight is a dataset of 60K hours of unlabelled English speech from audiobooks with around 7000 distinct speakers
- 960 hours of LibriSpeech is used to train a hybrid DNN-HMM ASR model
- EnCodec model is used to generate acoustic code matrix for 60K hours of data
- AR and NAR models have same transformer architecture
- Waveform length is randomly cropped between 10 and 20 seconds
- Models are trained on 16 NVIDIA TESLA V100 32GB GPUs
- YourTTS is used as baseline
- WavLM-TDNN is used to evaluate speaker similarity
- HuBERT-Large is used to calculate WER
- CMOS and SMOS are used for human evaluation
Librispeech evaluation
- LibriSpeech used for zero-shot TTS evaluation
Vctk evaluation
- Evaluated model on VCTK with 108 speakers, none seen during training
- Compared model with baseline, VALL-E outperformed baseline
- Performance gap larger when only 3s prompts available
- Model able to generate more similar speech with longer prompts
- Human evaluation of 60 speakers, 11 unseen, 49 seen
- VALL-E better speaker similarity than baseline
- VALL-E +0.23 CMOS over YourTTS
- VALL-E +0.04 CMOS over ground-truth
Qualitative analysis
- Previous TTS systems have a one-one mapping between input and output
- VALL-E uses sampling-based method to generate tokens with randomness
- Outputs have different lengths and phrase durations
- Acoustic environment consistency between prompt and generation
- Emotion of prompt is preserved in speech synthesis
Conclusion, limitations, and future work
- Introduced VALL-E, a language model approach for TTS
- Pre-trained VALL-E with 60K hours of speech data
- Achieved new state-of-the-art zero-shot TTS results
- VALL-E can keep acoustic environment and speaker’s emotion
- Disordered attention alignments exist in phoneme-to-acoustic language part
- Insufficient data coverage of accent speakers
- Future work to leverage non-autoregressive models and modify attention mechanism