Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

YourTTS is a multilingual approach to zero-shot multi-speaker TTS.
Achieved SOTA results in zero-shot multi-speaker TTS and comparable results in zero-shot voice conversion.
Promising results in target language with single-speaker dataset.
Can be fine-tuned with less than 1 minute of speech to achieve SOTA results in voice similarity.

Paper Content

Introduction

Text-to-Speech (TTS) systems have advanced with deep learning approaches
Zero-shot multi-speaker TTS (ZS-TTS) proposed by [5] and extended by [6]
Tacotron 2 adapted with external speaker embeddings [1] and LDE embeddings [2]
Attentron proposed a finegrained encoder with an attention mechanism [3]
ZSM-SS used a Transformer-based architecture with a normalization architecture and an external speaker encoder [11]
SC-GlowTTS used flow-based models and improved voice similarity for unseen speakers [4]
Multilingual TTS has evolved aiming at learning models for multiple languages at the same time [15-18]

Yourtts model

YourTTS builds upon VITS
Raw text is used as input instead of phonemes
Transformer-based text encoder is used
Language embeddings are concatenated into input characters
Number of transformer blocks and hidden channels increased
Decoder is stack of 4 affine coupling layers
Vocoder is HiFi-GAN version 1
VAE connects TTS model and vocoder
Stochastic duration predictor used to generate diverse rhythms
Zero-shot multi-speaker capabilities enabled by conditioning layers on external speaker embeddings
SCL used to maximize cosine similarity between generated and ground truth audio

Experiments

Speaker encoder

H/ASP model used as speaker encoder
Model trained with Prototypical Angular plus Softmax loss functions
Model evaluated in VoxCeleb 2 and Multilingual LibriSpeech datasets
Model achieved state-of-the-art results in Vox-Celeb 1 test subset
Model achieved average Equal Error Rate of 1.967

Audio datasets

Investigated 3 languages using 1 dataset per language
Pre-processing to make samples of similar loudness and remove long periods of silence
Audios resampled to 16KHz and voice activity detection used to trim silences
Normalized audio to -27dB using RMS-based normalization
Used VCTK dataset for English with 44 hours of speech and 109 speakers
Divided VCTK dataset into train, development and test sets
Used TTS-Portuguese Corpus for Portuguese with 10 hours of speech
Used fr FR set of M-AILABS dataset for French with 2 female and 3 male speakers
Used 11 VCTK speakers for testing English zero-shot multi-speaker capabilities
Used 10 speakers from LibriTTS dataset for testing outside of VCTK domain
Used 10 speakers from MLS dataset for testing Portuguese
Used 4 speakers from Common Voice dataset for speaker adaptation experiments

Experimental setup

4 training experiments were conducted using YourTTS
Experiment 1 used VCTK dataset (monolingual)
Experiment 2 used VCTK and TTS-Portuguese datasets (bilingual)
Experiment 3 used VCTK, TTS-Portuguese and M-AILABS french datasets (trilingual)
Experiment 4 used 1151 additional English speakers from LibriTTS partitions

Results and discussion

Mean Opinion Score (MOS) study was used to evaluate synthesized speech quality
Speaker Encoder Cosine Similarity (SECS) was calculated between speaker embeddings of two audios
SECS ranges from -1 to 1, with larger value indicating stronger similarity
Similarity MOS (Sim-MOS) was also reported
Experiments involved 3 languages, but only 2 languages were used to compute metrics
MOS scores were obtained with rigorous crowdsourcing
Reference audio was the fifth sentence of the VCTK dataset
5 sentences per speaker were synthesized for inference

Vctk dataset

Experiments 1 and 2 + SCL achieved the same SECS and similar Sim-MOS
Use of SCL improved similarity in 2 out of 3 experiments
SECS for all experiments higher than ground truth
Similarity and quality results similar to ground truth
Superior results in similarity and quality compared to other studies

Libritts dataset

Experiment 4 achieved the best LibriTTS similarity
Experiment 1 achieved the best MOS result

Portuguese mls dataset

Experiment 3+SCL achieved highest MOS metric (4.11±0.07)
Model trained with single-speaker dataset of medium quality achieved good quality in zero-shot multi-speaker synthesis
Sim-MOS highest for experiment 3 (3.19±0.10)
SECS highest for experiment 4+SCL
Performance of model in Portuguese affected by gender

Speaker consistency loss

Use of Speaker Consistency Loss (SCL) improved similarity measured by SECS
SCL can help the generalization in recording characteristics not seen in training
SCL slightly decreases the quality of generated audio

Zero-shot voice conversion

SC-GlowTTS model does not provide speaker identity information to the encoder.
YourTTS model uses external speaker embeddings to enable zero-shot voice conversion.
8 speakers (4M/4F) from VCTK and MLS Portuguese datasets were used for analysis.

Intra-lingual results

Our model achieved a MOS of 4.20±0.05 and a Sim-MOS of 4.07±0.06 for zero-shot voice conversion from one English-speaker to another English-speaker.
Our model achieved a MOS of 3.64 ± 0.09 and a Sim-MOS of 3.43 ± 0.09 for zero-shot voice conversion from one Portuguese speaker to another Portuguese speaker.

Cross-lingual results

Transfer between English and Portuguese speakers works well
Transfer from Portuguese to English speakers has lower quality
Low quality of voice conversion from Portuguese male to English female speakers
Lack of female speakers in training of model hinders generalization
Gender does not significantly influence model’s performance in English

Speaker adaptation

Challenge for generalization of zero-shot multi-speaker TTS models
4 speakers used for fine-tuning from Common Voice dataset
Weighted random sampling used to guarantee samples from adapted speakers appear in a quarter of the batch
Model trained for 1500 steps
Fine-tuning with less than 1 minute of speech from speakers achieved promising results
Direct relationship between amount of speech used and naturalness of speech

Conclusions, limitations and future work

Presented YourTTS, achieved SOTA results in zero-shot multi-speaker TTS and zero-shot voice conversion
Can achieve promising results in a target language using only a single speaker dataset
Can be adjusted to a new voice using less than 1 minute of speech
Model presents instability in the stochastic duration predictor
Mispronunciations occur for some words, especially in Portuguese
Gender significantly influences the model’s performance in Portuguese voice conversion

Link to paper#

Abstract#

Paper Content#

Introduction#

Yourtts model#

Experiments#

Speaker encoder#

Audio datasets#

Experimental setup#

Results and discussion#

Vctk dataset#

Libritts dataset#

Portuguese mls dataset#

Speaker consistency loss#

Zero-shot voice conversion#

Intra-lingual results#

Cross-lingual results#

Speaker adaptation#

Conclusions, limitations and future work#

Link to paper

Abstract

Paper Content

Introduction

Yourtts model

Experiments

Speaker encoder

Audio datasets

Experimental setup

Results and discussion

Vctk dataset

Libritts dataset

Portuguese mls dataset

Speaker consistency loss

Zero-shot voice conversion

Intra-lingual results

Cross-lingual results

Speaker adaptation

Conclusions, limitations and future work