Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Voice Conversion (VC) is the task of making a spoken utterance by one speaker sound like it was uttered by a different speaker.
  • Current VC methods focus on spectral features like timbre, but ignore speaking style.
  • This study introduces a method for converting not only timbre, but also prosodic information.
  • The proposed approach is based on a pretrained, self-supervised model.
  • The many-to-many setting with no paired data is considered.
  • Evaluation metrics are introduced and baselines are evaluated.
  • Code and samples can be found online.

Paper Content

Introduction

  • Humans recognize familiar people and voices by their voice texture and speaking style
  • Traditional voice conversion methods focused on changing the timbre of the source speaker while leaving the speaking style unchanged
  • Recent methods propose to additionally convert speaking style
  • These methods use continuous speech representations
  • Another line of work considers the usage of text transcriptions
  • Discrete self-supervised speech representations provide superior performance on downstream tasks
  • Proposed method uses discrete self-supervised and partially disentangled speech representations
  • Proposed approach is greatly superior to evaluated baselines
  • Can be trained on a single GPU within a couple of hours

Model

  • Approach based on DISSC
  • Decomposed representation of speech signal used to synthesize speech in target style
  • Three components in decomposition: phonetic-content, prosodic features, speaker identity
  • Cascaded pipeline proposed: extract content, predict prosody, synthesize speech

Speech input representation

  • Speech phonetic-content is represented using a pre-trained SSL model, HuBERT
  • HuBERT is used to avoid limitation to transcribed samples and to support non-written languages
  • Audio waveform is represented as a sequence of samples
  • Content encoder is a HuBERT model pre-trained on the LibriSpeech corpus
  • HuBERT outputs continuous representations which are then quantized into a discrete unit sequence
  • Repetitions of units indicate the rhythm of the speaker
  • Speaker representation is constructed using a fixed size look-up-table
  • Pitch contour is encoded using YAAPT and marked as z F 0

Speaking style conversion

  • Several methods exist to change the prosody of a spoken utterance.
  • Many of these methods only convert based on a single target utterance.
  • Other methods only linearly change the speaking rate.
  • Leading methods focus on rhythm but operate over a continuous speech representation.

Speech synthesis

  • HiFi-GAN neural vocoder is used (Kong et al., 2020)
  • Generator component takes input of phonetic-content units, pitch contour and speaker-embedding
  • Vocoder is trained independently for reconstruction
  • At inference time, unit sequence is inflated with predicted durations, pitch contour and target speaker-embedding
  • Discriminators are comprised of two sets: Multi-Scale Discriminators and Multi-Period Discriminators

Experimental setup

  • No existing framework for evaluating SSC
  • Setup uses datasets and metrics for timbre, pitch, and rhythm

Datasets

  • Previous work focused on single datasets
  • VCTK dataset is monotonous and lacks distinct speaking style
  • Used VCTK and ESD datasets
  • Selected two fastest and two slowest speakers from each dataset
  • Found that speakers in datasets do not have clear pitch patterns
  • Created synthetic dataset based on VCTK

Metrics

  • Introduced new metrics to capture rhythm in a fine-grained manner
  • Used EER and WER/CER to evaluate pitch contour
  • Used PLE, WLE, and TLE to evaluate rhythm
  • Used VDE and FFE to evaluate F0 similarity
  • Used EMD to measure pitch contour errors
  • Used MOS to evaluate quality and naturalness
  • Used human evaluations to measure speaking style conversion

Results

  • Speech Resynthesis (SR) is a strong VC baseline for the setup
  • AutoPST is the state of the art in prosodic-aware VC
  • Three variants of the approach are compared: DISSC_Rhythm, DISSC_Pitch and DISSC_Both
  • DISSC improves length errors across all scales
  • DISSC has a minor decrease in content quality compared to SR
  • DISSC has comparable naturalness to SR and outperforms AutoPST
  • DISSC can learn pitch patterns when they exist
  • DISSC can correct abnormal rhythm patterns

Unpaired voice conversion

  • Existing methods for VC use unpaired utterances
  • Vocoders generate audio from speaker and other representations
  • Disentanglement is encouraged through information bottlenecks, mutual information, adversarial losses, pretrained models, or combinations of these
  • Polyak et al. (2021) used discrete HuBERT tokens, pitch representations, and learned speaker representations to resynthesise audio waveform
  • Our approach additionally introduces speaking style modeling and conversion

Conclusion & future work

  • Proposed method for speaking style conversion
  • Evaluation functions to analyse and evaluate speech characteristics
  • Proposed approach superior to evaluated baselines
  • Future work to improve robustness and disentanglement of speech representation
  • Results compared for content, rhythm, F0, and speaker identification