Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- YourTTS is a multilingual approach to zero-shot multi-speaker TTS.
- Achieved SOTA results in zero-shot multi-speaker TTS and comparable results in zero-shot voice conversion.
- Promising results in target language with single-speaker dataset.
- Can be fine-tuned with less than 1 minute of speech to achieve SOTA results in voice similarity.
Paper Content
Introduction
- Text-to-Speech (TTS) systems have advanced with deep learning approaches
- Zero-shot multi-speaker TTS (ZS-TTS) proposed by [5] and extended by [6]
- Tacotron 2 adapted with external speaker embeddings [1] and LDE embeddings [2]
- Attentron proposed a finegrained encoder with an attention mechanism [3]
- ZSM-SS used a Transformer-based architecture with a normalization architecture and an external speaker encoder [11]
- SC-GlowTTS used flow-based models and improved voice similarity for unseen speakers [4]
- Multilingual TTS has evolved aiming at learning models for multiple languages at the same time [15-18]
Yourtts model
- YourTTS builds upon VITS
- Raw text is used as input instead of phonemes
- Transformer-based text encoder is used
- Language embeddings are concatenated into input characters
- Number of transformer blocks and hidden channels increased
- Decoder is stack of 4 affine coupling layers
- Vocoder is HiFi-GAN version 1
- VAE connects TTS model and vocoder
- Stochastic duration predictor used to generate diverse rhythms
- Zero-shot multi-speaker capabilities enabled by conditioning layers on external speaker embeddings
- SCL used to maximize cosine similarity between generated and ground truth audio
Experiments
Speaker encoder
- H/ASP model used as speaker encoder
- Model trained with Prototypical Angular plus Softmax loss functions
- Model evaluated in VoxCeleb 2 and Multilingual LibriSpeech datasets
- Model achieved state-of-the-art results in Vox-Celeb 1 test subset
- Model achieved average Equal Error Rate of 1.967
Audio datasets
- Investigated 3 languages using 1 dataset per language
- Pre-processing to make samples of similar loudness and remove long periods of silence
- Audios resampled to 16KHz and voice activity detection used to trim silences
- Normalized audio to -27dB using RMS-based normalization
- Used VCTK dataset for English with 44 hours of speech and 109 speakers
- Divided VCTK dataset into train, development and test sets
- Used TTS-Portuguese Corpus for Portuguese with 10 hours of speech
- Used fr FR set of M-AILABS dataset for French with 2 female and 3 male speakers
- Used 11 VCTK speakers for testing English zero-shot multi-speaker capabilities
- Used 10 speakers from LibriTTS dataset for testing outside of VCTK domain
- Used 10 speakers from MLS dataset for testing Portuguese
- Used 4 speakers from Common Voice dataset for speaker adaptation experiments
Experimental setup
- 4 training experiments were conducted using YourTTS
- Experiment 1 used VCTK dataset (monolingual)
- Experiment 2 used VCTK and TTS-Portuguese datasets (bilingual)
- Experiment 3 used VCTK, TTS-Portuguese and M-AILABS french datasets (trilingual)
- Experiment 4 used 1151 additional English speakers from LibriTTS partitions
Results and discussion
- Mean Opinion Score (MOS) study was used to evaluate synthesized speech quality
- Speaker Encoder Cosine Similarity (SECS) was calculated between speaker embeddings of two audios
- SECS ranges from -1 to 1, with larger value indicating stronger similarity
- Similarity MOS (Sim-MOS) was also reported
- Experiments involved 3 languages, but only 2 languages were used to compute metrics
- MOS scores were obtained with rigorous crowdsourcing
- Reference audio was the fifth sentence of the VCTK dataset
- 5 sentences per speaker were synthesized for inference
Vctk dataset
- Experiments 1 and 2 + SCL achieved the same SECS and similar Sim-MOS
- Use of SCL improved similarity in 2 out of 3 experiments
- SECS for all experiments higher than ground truth
- Similarity and quality results similar to ground truth
- Superior results in similarity and quality compared to other studies
Libritts dataset
- Experiment 4 achieved the best LibriTTS similarity
- Experiment 1 achieved the best MOS result
Portuguese mls dataset
- Experiment 3+SCL achieved highest MOS metric (4.11±0.07)
- Model trained with single-speaker dataset of medium quality achieved good quality in zero-shot multi-speaker synthesis
- Sim-MOS highest for experiment 3 (3.19±0.10)
- SECS highest for experiment 4+SCL
- Performance of model in Portuguese affected by gender
Speaker consistency loss
- Use of Speaker Consistency Loss (SCL) improved similarity measured by SECS
- SCL can help the generalization in recording characteristics not seen in training
- SCL slightly decreases the quality of generated audio
Zero-shot voice conversion
- SC-GlowTTS model does not provide speaker identity information to the encoder.
- YourTTS model uses external speaker embeddings to enable zero-shot voice conversion.
- 8 speakers (4M/4F) from VCTK and MLS Portuguese datasets were used for analysis.
Intra-lingual results
- Our model achieved a MOS of 4.20±0.05 and a Sim-MOS of 4.07±0.06 for zero-shot voice conversion from one English-speaker to another English-speaker.
- Our model achieved a MOS of 3.64 ± 0.09 and a Sim-MOS of 3.43 ± 0.09 for zero-shot voice conversion from one Portuguese speaker to another Portuguese speaker.
Cross-lingual results
- Transfer between English and Portuguese speakers works well
- Transfer from Portuguese to English speakers has lower quality
- Low quality of voice conversion from Portuguese male to English female speakers
- Lack of female speakers in training of model hinders generalization
- Gender does not significantly influence model’s performance in English
Speaker adaptation
- Challenge for generalization of zero-shot multi-speaker TTS models
- 4 speakers used for fine-tuning from Common Voice dataset
- Weighted random sampling used to guarantee samples from adapted speakers appear in a quarter of the batch
- Model trained for 1500 steps
- Fine-tuning with less than 1 minute of speech from speakers achieved promising results
- Direct relationship between amount of speech used and naturalness of speech
Conclusions, limitations and future work
- Presented YourTTS, achieved SOTA results in zero-shot multi-speaker TTS and zero-shot voice conversion
- Can achieve promising results in a target language using only a single speaker dataset
- Can be adjusted to a new voice using less than 1 minute of speech
- Model presents instability in the stochastic duration predictor
- Mispronunciations occur for some words, especially in Portuguese
- Gender significantly influences the model’s performance in Portuguese voice conversion