Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposes to unify the subjects of speech enhancement and study Generalized Speech Enhancement
Goal is to improve certain aspects of speech, such as intelligibility, quality, and video synchronization
Model is composed of two steps: pseudo audio-visual speech recognition and pseudo text-to-speech synthesis
Model is called ReVISE and is evaluated on EasyCom, an audio-visual benchmark

Paper Content

Introduction

Speech in-the-wild is often corrupted with natural and non-natural sounds.
Recording devices and networks can also introduce distortion.
Distortion makes it hard for humans and machines to comprehend speech.
Improving the quality and intelligibility of corrupted speech is essential.
Speech enhancement is the process of generating clean speech from its corrupted version.
Audio-visual speech enhancement uses visual speech to provide auxiliary information.
Prior work often treats enhancement from each type of distortion as a separate problem.
This paper advocates a more holistic approach to audio-visual speech enhancement.
The goal of this approach is to enhance a predefined set of attributes.
This paper proposes ReVISE, a pseudo audio-visual speech recognition model and a pseudo text-to-speech synthesis model.
ReVISE is evaluated on four types of corrupted speech.
ReVISE is the first model capable of high-quality in-the-wild video-to-speech synthesis.
ReVISE is also evaluated on a challenging audio-visual speech dataset.

Background

Audio-visual speech enhancement tasks are introduced
Prior studies approach these tasks separately or jointly
Contrast with literature
Introduce self-supervised speech resynthesis

Audio-visual speech enhancement tasks

Audio-visual speech enhancement is the task of improving the quality of corrupted speech
Training involves providing tuples of clean speech, corrupted speech, and corresponding talking head video
Tasks are divided depending on type of distortion applied to clean speech
Masking-based methods are widely adopted to predict a mask
Speech inpainting and video-to-speech synthesis require generation-based methods
Universal enhancement models are audio-based and do not leverage auxiliary input
Model predicts a self-supervised representation of the reference clean speech

Self-supervised speech resynthesis

Previous studies show that HuBERT encodes mostly phonetic information and less about speaker and noise characteristics.
HuBERT is pre-trained with a masked cluster prediction objective.
HiFi-GAN is an end-to-end model that converts SSL units to waveform.

Method

Formulate the generalized speech enhancement problem
Introduce the proposed model

Problem formulation

Original speech and auxiliary view are generated with a bijective mapping
Corrupted speech is generated with a corruption function
Audiovisual speech enhancement is estimating probability of factors given corrupted speech
When distortion is high, multiple sets of factors can render same noisy speech
Reconstructing exact clean reference signal is ill-posed problem
Goal is to generate enhanced signal that is on manifold of clean speech and preserves factors of interest
Faithfulness is measured by discrepancy between inverse mapping of clean and corrupted speech
Quality is measured by metrics commonly used for text-to-speech synthesis

Model

SSL speech tokenizer uses a BASE HuBERT model composed of a convolutional encoder and 12 Transformer layers
221K hours of unlabeled speech data from 8 languages is used for pre-training
SSL units are generated by clustering the third iteration feature at the last layer with a codebook size of 2000
HiFi-GAN model is trained for 400K updates on 8 GPUs on the LJSpeech dataset
LARGE AV-HuBERT model is used by default, taking video and/or audio as input
P-AVSR models are fine-tuned on 8 GPUs for less than 45K updates

Evaluation

Evaluated model and baselines on content, synchronization, quality, and lowlevel detail reconstruction
Content evaluated using WER computed with speech recognition model
Synchronization evaluated using SyncNet metrics
Quality evaluated using subjective MOS studies with a scale from 1 to 5
Lowlevel detail reconstruction evaluated using ESTOI and MCD

Results

Quantitative results are presented in the paper
Supplementary material provides samples for comparison to baselines

Ground truth and resynthesis performance

Reference clean speech and resynthesized clean speech were evaluated with proposed metrics
Results show that intelligibility, synchronization, and quality are slightly degraded when tokenizing speech into SSL units
Performance of ReVISE model is roughly upper-bounded by performance of resynthesized speech

Video-to-speech synthesis

ReVISE is compared to SVTS, the state-of-the-art model for video-to-speech synthesis
SVTS is composed of a video-to-spectrogram predictor and a neural vocoder
SVTS has strong results on two constrained datasets
Audio quality generated by SVTS is mediocre
ReVISE generates higher quality audio and lower WER

Audio-visual speech inpainting

ReVISE is evaluated on audio-visual speech inpainting task
Previous work only evaluated on constrained GRID dataset
Comparing with publicly available audio enhancement model, Demucs
Resynthesis used as baseline
Intelligibility and synchronization degrade as percentage of dropped frames increases
Existing audio enhancement approaches fail to generalize to inpainting
ReVISE effectively improves intelligibility and synchronization
ReVISE uses additional audio information to improve content reconstruction

Audio-visual speech denoising

Tab. 3 presents results of audio-visual speech denoising on four test splits.
Two audio-based baselines and VisualVoice are compared.
VisualVoice predicts a complex IRM given noisy audio, lip video, and speaker face image as input.
ReVISE is more robust to higher levels of noise.

Audio-visual source separation

Four noisy test sets were created with different levels of SNR.
Audio model (Demucs) performs worse with speech noise than with non-speech noise at the same SNR level.
Audio-visual models report similar performance on the two tasks at high SNR, and better performance on separation at low SNR.
It is easier to remove speech noise if the model has auxiliary information to identify the target speech.
Comparing VisualVoice and ReVISE, ReVISE is better at low SNR.

Universal audio-visual speech enhancement

Single ReVISE model trained on all four types of distortion
Universal model beats or matches distortion-specific model on almost all tasks
Exception is video-to-speech synthesis, where universal model is 0.5% WER worse

Av enhancement on real data -easycom

ReVISE substantially enhances the intelligibility of noisy speech compared to baseline methods
ReVISE produces audio of higher quality than the target audio itself
Pre-training P-AVSR is important for performance
Predicting SSL units is better than predicting spectrograms
ReVISE performs decently on video-to-speech synthesis without being trained on the task
Pre-training AV-HuBERT is important for performance
Mouth cropping helps bridge gap between pre-training and fine-tuning
Visual input is important for performance
Fine-tuning data does not have a large impact
ReVISE does not focus on reconstructing exact reference signal
ReVISE does not infer speaker identity

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Audio-visual speech enhancement tasks#

Self-supervised speech resynthesis#

Method#

Problem formulation#

Model#

Evaluation#

Results#

Ground truth and resynthesis performance#

Video-to-speech synthesis#

Audio-visual speech inpainting#

Audio-visual speech denoising#

Audio-visual source separation#

Universal audio-visual speech enhancement#

Av enhancement on real data -easycom#

Link to paper

Abstract

Paper Content

Introduction

Background

Audio-visual speech enhancement tasks

Self-supervised speech resynthesis

Method

Problem formulation

Model

Evaluation

Results

Ground truth and resynthesis performance

Video-to-speech synthesis

Audio-visual speech inpainting

Audio-visual speech denoising

Audio-visual source separation

Universal audio-visual speech enhancement

Av enhancement on real data -easycom