Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Voice Conversion (VC) is the task of making a spoken utterance by one speaker sound like it was uttered by a different speaker.
- Current VC methods focus on spectral features like timbre, but ignore speaking style.
- This study introduces a method for converting not only timbre, but also prosodic information.
- The proposed approach is based on a pretrained, self-supervised model.
- The many-to-many setting with no paired data is considered.
- Evaluation metrics are introduced and baselines are evaluated.
- Code and samples can be found online.
Paper Content
Introduction
- Humans recognize familiar people and voices by their voice texture and speaking style
- Traditional voice conversion methods focused on changing the timbre of the source speaker while leaving the speaking style unchanged
- Recent methods propose to additionally convert speaking style
- These methods use continuous speech representations
- Another line of work considers the usage of text transcriptions
- Discrete self-supervised speech representations provide superior performance on downstream tasks
- Proposed method uses discrete self-supervised and partially disentangled speech representations
- Proposed approach is greatly superior to evaluated baselines
- Can be trained on a single GPU within a couple of hours
Model
- Approach based on DISSC
- Decomposed representation of speech signal used to synthesize speech in target style
- Three components in decomposition: phonetic-content, prosodic features, speaker identity
- Cascaded pipeline proposed: extract content, predict prosody, synthesize speech
Speech input representation
- Speech phonetic-content is represented using a pre-trained SSL model, HuBERT
- HuBERT is used to avoid limitation to transcribed samples and to support non-written languages
- Audio waveform is represented as a sequence of samples
- Content encoder is a HuBERT model pre-trained on the LibriSpeech corpus
- HuBERT outputs continuous representations which are then quantized into a discrete unit sequence
- Repetitions of units indicate the rhythm of the speaker
- Speaker representation is constructed using a fixed size look-up-table
- Pitch contour is encoded using YAAPT and marked as z F 0
Speaking style conversion
- Several methods exist to change the prosody of a spoken utterance.
- Many of these methods only convert based on a single target utterance.
- Other methods only linearly change the speaking rate.
- Leading methods focus on rhythm but operate over a continuous speech representation.
Speech synthesis
- HiFi-GAN neural vocoder is used (Kong et al., 2020)
- Generator component takes input of phonetic-content units, pitch contour and speaker-embedding
- Vocoder is trained independently for reconstruction
- At inference time, unit sequence is inflated with predicted durations, pitch contour and target speaker-embedding
- Discriminators are comprised of two sets: Multi-Scale Discriminators and Multi-Period Discriminators
Experimental setup
- No existing framework for evaluating SSC
- Setup uses datasets and metrics for timbre, pitch, and rhythm
Datasets
- Previous work focused on single datasets
- VCTK dataset is monotonous and lacks distinct speaking style
- Used VCTK and ESD datasets
- Selected two fastest and two slowest speakers from each dataset
- Found that speakers in datasets do not have clear pitch patterns
- Created synthetic dataset based on VCTK
Metrics
- Introduced new metrics to capture rhythm in a fine-grained manner
- Used EER and WER/CER to evaluate pitch contour
- Used PLE, WLE, and TLE to evaluate rhythm
- Used VDE and FFE to evaluate F0 similarity
- Used EMD to measure pitch contour errors
- Used MOS to evaluate quality and naturalness
- Used human evaluations to measure speaking style conversion
Results
- Speech Resynthesis (SR) is a strong VC baseline for the setup
- AutoPST is the state of the art in prosodic-aware VC
- Three variants of the approach are compared: DISSC_Rhythm, DISSC_Pitch and DISSC_Both
- DISSC improves length errors across all scales
- DISSC has a minor decrease in content quality compared to SR
- DISSC has comparable naturalness to SR and outperforms AutoPST
- DISSC can learn pitch patterns when they exist
- DISSC can correct abnormal rhythm patterns
Related work
Unpaired voice conversion
- Existing methods for VC use unpaired utterances
- Vocoders generate audio from speaker and other representations
- Disentanglement is encouraged through information bottlenecks, mutual information, adversarial losses, pretrained models, or combinations of these
- Polyak et al. (2021) used discrete HuBERT tokens, pitch representations, and learned speaker representations to resynthesise audio waveform
- Our approach additionally introduces speaking style modeling and conversion
Conclusion & future work
- Proposed method for speaking style conversion
- Evaluation functions to analyse and evaluate speech characteristics
- Proposed approach superior to evaluated baselines
- Future work to improve robustness and disentanglement of speech representation
- Results compared for content, rhythm, F0, and speaker identification