Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposes to unify the subjects of speech enhancement and study Generalized Speech Enhancement
- Goal is to improve certain aspects of speech, such as intelligibility, quality, and video synchronization
- Model is composed of two steps: pseudo audio-visual speech recognition and pseudo text-to-speech synthesis
- Model is called ReVISE and is evaluated on EasyCom, an audio-visual benchmark
Paper Content
Introduction
- Speech in-the-wild is often corrupted with natural and non-natural sounds.
- Recording devices and networks can also introduce distortion.
- Distortion makes it hard for humans and machines to comprehend speech.
- Improving the quality and intelligibility of corrupted speech is essential.
- Speech enhancement is the process of generating clean speech from its corrupted version.
- Audio-visual speech enhancement uses visual speech to provide auxiliary information.
- Prior work often treats enhancement from each type of distortion as a separate problem.
- This paper advocates a more holistic approach to audio-visual speech enhancement.
- The goal of this approach is to enhance a predefined set of attributes.
- This paper proposes ReVISE, a pseudo audio-visual speech recognition model and a pseudo text-to-speech synthesis model.
- ReVISE is evaluated on four types of corrupted speech.
- ReVISE is the first model capable of high-quality in-the-wild video-to-speech synthesis.
- ReVISE is also evaluated on a challenging audio-visual speech dataset.
Background
- Audio-visual speech enhancement tasks are introduced
- Prior studies approach these tasks separately or jointly
- Contrast with literature
- Introduce self-supervised speech resynthesis
Audio-visual speech enhancement tasks
- Audio-visual speech enhancement is the task of improving the quality of corrupted speech
- Training involves providing tuples of clean speech, corrupted speech, and corresponding talking head video
- Tasks are divided depending on type of distortion applied to clean speech
- Masking-based methods are widely adopted to predict a mask
- Speech inpainting and video-to-speech synthesis require generation-based methods
- Universal enhancement models are audio-based and do not leverage auxiliary input
- Model predicts a self-supervised representation of the reference clean speech
Self-supervised speech resynthesis
- Previous studies show that HuBERT encodes mostly phonetic information and less about speaker and noise characteristics.
- HuBERT is pre-trained with a masked cluster prediction objective.
- HiFi-GAN is an end-to-end model that converts SSL units to waveform.
Method
- Formulate the generalized speech enhancement problem
- Introduce the proposed model
Problem formulation
- Original speech and auxiliary view are generated with a bijective mapping
- Corrupted speech is generated with a corruption function
- Audiovisual speech enhancement is estimating probability of factors given corrupted speech
- When distortion is high, multiple sets of factors can render same noisy speech
- Reconstructing exact clean reference signal is ill-posed problem
- Goal is to generate enhanced signal that is on manifold of clean speech and preserves factors of interest
- Faithfulness is measured by discrepancy between inverse mapping of clean and corrupted speech
- Quality is measured by metrics commonly used for text-to-speech synthesis
Model
- SSL speech tokenizer uses a BASE HuBERT model composed of a convolutional encoder and 12 Transformer layers
- 221K hours of unlabeled speech data from 8 languages is used for pre-training
- SSL units are generated by clustering the third iteration feature at the last layer with a codebook size of 2000
- HiFi-GAN model is trained for 400K updates on 8 GPUs on the LJSpeech dataset
- LARGE AV-HuBERT model is used by default, taking video and/or audio as input
- P-AVSR models are fine-tuned on 8 GPUs for less than 45K updates
Evaluation
- Evaluated model and baselines on content, synchronization, quality, and lowlevel detail reconstruction
- Content evaluated using WER computed with speech recognition model
- Synchronization evaluated using SyncNet metrics
- Quality evaluated using subjective MOS studies with a scale from 1 to 5
- Lowlevel detail reconstruction evaluated using ESTOI and MCD
Results
- Quantitative results are presented in the paper
- Supplementary material provides samples for comparison to baselines
Ground truth and resynthesis performance
- Reference clean speech and resynthesized clean speech were evaluated with proposed metrics
- Results show that intelligibility, synchronization, and quality are slightly degraded when tokenizing speech into SSL units
- Performance of ReVISE model is roughly upper-bounded by performance of resynthesized speech
Video-to-speech synthesis
- ReVISE is compared to SVTS, the state-of-the-art model for video-to-speech synthesis
- SVTS is composed of a video-to-spectrogram predictor and a neural vocoder
- SVTS has strong results on two constrained datasets
- Audio quality generated by SVTS is mediocre
- ReVISE generates higher quality audio and lower WER
Audio-visual speech inpainting
- ReVISE is evaluated on audio-visual speech inpainting task
- Previous work only evaluated on constrained GRID dataset
- Comparing with publicly available audio enhancement model, Demucs
- Resynthesis used as baseline
- Intelligibility and synchronization degrade as percentage of dropped frames increases
- Existing audio enhancement approaches fail to generalize to inpainting
- ReVISE effectively improves intelligibility and synchronization
- ReVISE uses additional audio information to improve content reconstruction
Audio-visual speech denoising
- Tab. 3 presents results of audio-visual speech denoising on four test splits.
- Two audio-based baselines and VisualVoice are compared.
- VisualVoice predicts a complex IRM given noisy audio, lip video, and speaker face image as input.
- ReVISE is more robust to higher levels of noise.
Audio-visual source separation
- Four noisy test sets were created with different levels of SNR.
- Audio model (Demucs) performs worse with speech noise than with non-speech noise at the same SNR level.
- Audio-visual models report similar performance on the two tasks at high SNR, and better performance on separation at low SNR.
- It is easier to remove speech noise if the model has auxiliary information to identify the target speech.
- Comparing VisualVoice and ReVISE, ReVISE is better at low SNR.
Universal audio-visual speech enhancement
- Single ReVISE model trained on all four types of distortion
- Universal model beats or matches distortion-specific model on almost all tasks
- Exception is video-to-speech synthesis, where universal model is 0.5% WER worse
Av enhancement on real data -easycom
- ReVISE substantially enhances the intelligibility of noisy speech compared to baseline methods
- ReVISE produces audio of higher quality than the target audio itself
- Pre-training P-AVSR is important for performance
- Predicting SSL units is better than predicting spectrograms
- ReVISE performs decently on video-to-speech synthesis without being trained on the task
- Pre-training AV-HuBERT is important for performance
- Mouth cropping helps bridge gap between pre-training and fine-tuning
- Visual input is important for performance
- Fine-tuning data does not have a large impact
- ReVISE does not focus on reconstructing exact reference signal
- ReVISE does not infer speaker identity