Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • YourTTS is a multilingual approach to zero-shot multi-speaker TTS.
  • Achieved SOTA results in zero-shot multi-speaker TTS and comparable results in zero-shot voice conversion.
  • Promising results in target language with single-speaker dataset.
  • Can be fine-tuned with less than 1 minute of speech to achieve SOTA results in voice similarity.

Paper Content

Introduction

  • Text-to-Speech (TTS) systems have advanced with deep learning approaches
  • Zero-shot multi-speaker TTS (ZS-TTS) proposed by [5] and extended by [6]
  • Tacotron 2 adapted with external speaker embeddings [1] and LDE embeddings [2]
  • Attentron proposed a finegrained encoder with an attention mechanism [3]
  • ZSM-SS used a Transformer-based architecture with a normalization architecture and an external speaker encoder [11]
  • SC-GlowTTS used flow-based models and improved voice similarity for unseen speakers [4]
  • Multilingual TTS has evolved aiming at learning models for multiple languages at the same time [15-18]

Yourtts model

  • YourTTS builds upon VITS
  • Raw text is used as input instead of phonemes
  • Transformer-based text encoder is used
  • Language embeddings are concatenated into input characters
  • Number of transformer blocks and hidden channels increased
  • Decoder is stack of 4 affine coupling layers
  • Vocoder is HiFi-GAN version 1
  • VAE connects TTS model and vocoder
  • Stochastic duration predictor used to generate diverse rhythms
  • Zero-shot multi-speaker capabilities enabled by conditioning layers on external speaker embeddings
  • SCL used to maximize cosine similarity between generated and ground truth audio

Experiments

Speaker encoder

  • H/ASP model used as speaker encoder
  • Model trained with Prototypical Angular plus Softmax loss functions
  • Model evaluated in VoxCeleb 2 and Multilingual LibriSpeech datasets
  • Model achieved state-of-the-art results in Vox-Celeb 1 test subset
  • Model achieved average Equal Error Rate of 1.967

Audio datasets

  • Investigated 3 languages using 1 dataset per language
  • Pre-processing to make samples of similar loudness and remove long periods of silence
  • Audios resampled to 16KHz and voice activity detection used to trim silences
  • Normalized audio to -27dB using RMS-based normalization
  • Used VCTK dataset for English with 44 hours of speech and 109 speakers
  • Divided VCTK dataset into train, development and test sets
  • Used TTS-Portuguese Corpus for Portuguese with 10 hours of speech
  • Used fr FR set of M-AILABS dataset for French with 2 female and 3 male speakers
  • Used 11 VCTK speakers for testing English zero-shot multi-speaker capabilities
  • Used 10 speakers from LibriTTS dataset for testing outside of VCTK domain
  • Used 10 speakers from MLS dataset for testing Portuguese
  • Used 4 speakers from Common Voice dataset for speaker adaptation experiments

Experimental setup

  • 4 training experiments were conducted using YourTTS
  • Experiment 1 used VCTK dataset (monolingual)
  • Experiment 2 used VCTK and TTS-Portuguese datasets (bilingual)
  • Experiment 3 used VCTK, TTS-Portuguese and M-AILABS french datasets (trilingual)
  • Experiment 4 used 1151 additional English speakers from LibriTTS partitions

Results and discussion

  • Mean Opinion Score (MOS) study was used to evaluate synthesized speech quality
  • Speaker Encoder Cosine Similarity (SECS) was calculated between speaker embeddings of two audios
  • SECS ranges from -1 to 1, with larger value indicating stronger similarity
  • Similarity MOS (Sim-MOS) was also reported
  • Experiments involved 3 languages, but only 2 languages were used to compute metrics
  • MOS scores were obtained with rigorous crowdsourcing
  • Reference audio was the fifth sentence of the VCTK dataset
  • 5 sentences per speaker were synthesized for inference

Vctk dataset

  • Experiments 1 and 2 + SCL achieved the same SECS and similar Sim-MOS
  • Use of SCL improved similarity in 2 out of 3 experiments
  • SECS for all experiments higher than ground truth
  • Similarity and quality results similar to ground truth
  • Superior results in similarity and quality compared to other studies

Libritts dataset

  • Experiment 4 achieved the best LibriTTS similarity
  • Experiment 1 achieved the best MOS result

Portuguese mls dataset

  • Experiment 3+SCL achieved highest MOS metric (4.11±0.07)
  • Model trained with single-speaker dataset of medium quality achieved good quality in zero-shot multi-speaker synthesis
  • Sim-MOS highest for experiment 3 (3.19±0.10)
  • SECS highest for experiment 4+SCL
  • Performance of model in Portuguese affected by gender

Speaker consistency loss

  • Use of Speaker Consistency Loss (SCL) improved similarity measured by SECS
  • SCL can help the generalization in recording characteristics not seen in training
  • SCL slightly decreases the quality of generated audio

Zero-shot voice conversion

  • SC-GlowTTS model does not provide speaker identity information to the encoder.
  • YourTTS model uses external speaker embeddings to enable zero-shot voice conversion.
  • 8 speakers (4M/4F) from VCTK and MLS Portuguese datasets were used for analysis.

Intra-lingual results

  • Our model achieved a MOS of 4.20±0.05 and a Sim-MOS of 4.07±0.06 for zero-shot voice conversion from one English-speaker to another English-speaker.
  • Our model achieved a MOS of 3.64 ± 0.09 and a Sim-MOS of 3.43 ± 0.09 for zero-shot voice conversion from one Portuguese speaker to another Portuguese speaker.

Cross-lingual results

  • Transfer between English and Portuguese speakers works well
  • Transfer from Portuguese to English speakers has lower quality
  • Low quality of voice conversion from Portuguese male to English female speakers
  • Lack of female speakers in training of model hinders generalization
  • Gender does not significantly influence model’s performance in English

Speaker adaptation

  • Challenge for generalization of zero-shot multi-speaker TTS models
  • 4 speakers used for fine-tuning from Common Voice dataset
  • Weighted random sampling used to guarantee samples from adapted speakers appear in a quarter of the batch
  • Model trained for 1500 steps
  • Fine-tuning with less than 1 minute of speech from speakers achieved promising results
  • Direct relationship between amount of speech used and naturalness of speech

Conclusions, limitations and future work

  • Presented YourTTS, achieved SOTA results in zero-shot multi-speaker TTS and zero-shot voice conversion
  • Can achieve promising results in a target language using only a single speaker dataset
  • Can be adjusted to a new voice using less than 1 minute of speech
  • Model presents instability in the stochastic duration predictor
  • Mispronunciations occur for some words, especially in Portuguese
  • Gender significantly influences the model’s performance in Portuguese voice conversion