Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Voice Conversion (VC) is the task of making a spoken utterance by one speaker sound like it was uttered by a different speaker.
Current VC methods focus on spectral features like timbre, but ignore speaking style.
This study introduces a method for converting not only timbre, but also prosodic information.
The proposed approach is based on a pretrained, self-supervised model.
The many-to-many setting with no paired data is considered.
Evaluation metrics are introduced and baselines are evaluated.
Code and samples can be found online.

Paper Content

Introduction

Humans recognize familiar people and voices by their voice texture and speaking style
Traditional voice conversion methods focused on changing the timbre of the source speaker while leaving the speaking style unchanged
Recent methods propose to additionally convert speaking style
These methods use continuous speech representations
Another line of work considers the usage of text transcriptions
Discrete self-supervised speech representations provide superior performance on downstream tasks
Proposed method uses discrete self-supervised and partially disentangled speech representations
Proposed approach is greatly superior to evaluated baselines
Can be trained on a single GPU within a couple of hours

Model

Approach based on DISSC
Decomposed representation of speech signal used to synthesize speech in target style
Three components in decomposition: phonetic-content, prosodic features, speaker identity
Cascaded pipeline proposed: extract content, predict prosody, synthesize speech

Speech input representation

Speech phonetic-content is represented using a pre-trained SSL model, HuBERT
HuBERT is used to avoid limitation to transcribed samples and to support non-written languages
Audio waveform is represented as a sequence of samples
Content encoder is a HuBERT model pre-trained on the LibriSpeech corpus
HuBERT outputs continuous representations which are then quantized into a discrete unit sequence
Repetitions of units indicate the rhythm of the speaker
Speaker representation is constructed using a fixed size look-up-table
Pitch contour is encoded using YAAPT and marked as z F 0

Speaking style conversion

Several methods exist to change the prosody of a spoken utterance.
Many of these methods only convert based on a single target utterance.
Other methods only linearly change the speaking rate.
Leading methods focus on rhythm but operate over a continuous speech representation.

Speech synthesis

HiFi-GAN neural vocoder is used (Kong et al., 2020)
Generator component takes input of phonetic-content units, pitch contour and speaker-embedding
Vocoder is trained independently for reconstruction
At inference time, unit sequence is inflated with predicted durations, pitch contour and target speaker-embedding
Discriminators are comprised of two sets: Multi-Scale Discriminators and Multi-Period Discriminators

Experimental setup

No existing framework for evaluating SSC
Setup uses datasets and metrics for timbre, pitch, and rhythm

Datasets

Previous work focused on single datasets
VCTK dataset is monotonous and lacks distinct speaking style
Used VCTK and ESD datasets
Selected two fastest and two slowest speakers from each dataset
Found that speakers in datasets do not have clear pitch patterns
Created synthetic dataset based on VCTK

Metrics

Introduced new metrics to capture rhythm in a fine-grained manner
Used EER and WER/CER to evaluate pitch contour
Used PLE, WLE, and TLE to evaluate rhythm
Used VDE and FFE to evaluate F0 similarity
Used EMD to measure pitch contour errors
Used MOS to evaluate quality and naturalness
Used human evaluations to measure speaking style conversion

Results

Speech Resynthesis (SR) is a strong VC baseline for the setup
AutoPST is the state of the art in prosodic-aware VC
Three variants of the approach are compared: DISSC_Rhythm, DISSC_Pitch and DISSC_Both
DISSC improves length errors across all scales
DISSC has a minor decrease in content quality compared to SR
DISSC has comparable naturalness to SR and outperforms AutoPST
DISSC can learn pitch patterns when they exist
DISSC can correct abnormal rhythm patterns

Unpaired voice conversion

Existing methods for VC use unpaired utterances
Vocoders generate audio from speaker and other representations
Disentanglement is encouraged through information bottlenecks, mutual information, adversarial losses, pretrained models, or combinations of these
Polyak et al. (2021) used discrete HuBERT tokens, pitch representations, and learned speaker representations to resynthesise audio waveform
Our approach additionally introduces speaking style modeling and conversion

Conclusion & future work

Proposed method for speaking style conversion
Evaluation functions to analyse and evaluate speech characteristics
Proposed approach superior to evaluated baselines
Future work to improve robustness and disentanglement of speech representation
Results compared for content, rhythm, F0, and speaker identification

Link to paper#

Abstract#

Paper Content#

Introduction#

Model#

Speech input representation#

Speaking style conversion#

Speech synthesis#

Experimental setup#

Datasets#

Metrics#

Results#

Related work#

Unpaired voice conversion#

Conclusion & future work#

Link to paper

Abstract

Paper Content

Introduction

Model

Speech input representation

Speaking style conversion

Speech synthesis

Experimental setup

Datasets

Metrics

Results

Related work

Unpaired voice conversion

Conclusion & future work