Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposes to unify the subjects of speech enhancement and study Generalized Speech Enhancement
  • Goal is to improve certain aspects of speech, such as intelligibility, quality, and video synchronization
  • Model is composed of two steps: pseudo audio-visual speech recognition and pseudo text-to-speech synthesis
  • Model is called ReVISE and is evaluated on EasyCom, an audio-visual benchmark

Paper Content

Introduction

  • Speech in-the-wild is often corrupted with natural and non-natural sounds.
  • Recording devices and networks can also introduce distortion.
  • Distortion makes it hard for humans and machines to comprehend speech.
  • Improving the quality and intelligibility of corrupted speech is essential.
  • Speech enhancement is the process of generating clean speech from its corrupted version.
  • Audio-visual speech enhancement uses visual speech to provide auxiliary information.
  • Prior work often treats enhancement from each type of distortion as a separate problem.
  • This paper advocates a more holistic approach to audio-visual speech enhancement.
  • The goal of this approach is to enhance a predefined set of attributes.
  • This paper proposes ReVISE, a pseudo audio-visual speech recognition model and a pseudo text-to-speech synthesis model.
  • ReVISE is evaluated on four types of corrupted speech.
  • ReVISE is the first model capable of high-quality in-the-wild video-to-speech synthesis.
  • ReVISE is also evaluated on a challenging audio-visual speech dataset.

Background

  • Audio-visual speech enhancement tasks are introduced
  • Prior studies approach these tasks separately or jointly
  • Contrast with literature
  • Introduce self-supervised speech resynthesis

Audio-visual speech enhancement tasks

  • Audio-visual speech enhancement is the task of improving the quality of corrupted speech
  • Training involves providing tuples of clean speech, corrupted speech, and corresponding talking head video
  • Tasks are divided depending on type of distortion applied to clean speech
  • Masking-based methods are widely adopted to predict a mask
  • Speech inpainting and video-to-speech synthesis require generation-based methods
  • Universal enhancement models are audio-based and do not leverage auxiliary input
  • Model predicts a self-supervised representation of the reference clean speech

Self-supervised speech resynthesis

  • Previous studies show that HuBERT encodes mostly phonetic information and less about speaker and noise characteristics.
  • HuBERT is pre-trained with a masked cluster prediction objective.
  • HiFi-GAN is an end-to-end model that converts SSL units to waveform.

Method

  • Formulate the generalized speech enhancement problem
  • Introduce the proposed model

Problem formulation

  • Original speech and auxiliary view are generated with a bijective mapping
  • Corrupted speech is generated with a corruption function
  • Audiovisual speech enhancement is estimating probability of factors given corrupted speech
  • When distortion is high, multiple sets of factors can render same noisy speech
  • Reconstructing exact clean reference signal is ill-posed problem
  • Goal is to generate enhanced signal that is on manifold of clean speech and preserves factors of interest
  • Faithfulness is measured by discrepancy between inverse mapping of clean and corrupted speech
  • Quality is measured by metrics commonly used for text-to-speech synthesis

Model

  • SSL speech tokenizer uses a BASE HuBERT model composed of a convolutional encoder and 12 Transformer layers
  • 221K hours of unlabeled speech data from 8 languages is used for pre-training
  • SSL units are generated by clustering the third iteration feature at the last layer with a codebook size of 2000
  • HiFi-GAN model is trained for 400K updates on 8 GPUs on the LJSpeech dataset
  • LARGE AV-HuBERT model is used by default, taking video and/or audio as input
  • P-AVSR models are fine-tuned on 8 GPUs for less than 45K updates

Evaluation

  • Evaluated model and baselines on content, synchronization, quality, and lowlevel detail reconstruction
  • Content evaluated using WER computed with speech recognition model
  • Synchronization evaluated using SyncNet metrics
  • Quality evaluated using subjective MOS studies with a scale from 1 to 5
  • Lowlevel detail reconstruction evaluated using ESTOI and MCD

Results

  • Quantitative results are presented in the paper
  • Supplementary material provides samples for comparison to baselines

Ground truth and resynthesis performance

  • Reference clean speech and resynthesized clean speech were evaluated with proposed metrics
  • Results show that intelligibility, synchronization, and quality are slightly degraded when tokenizing speech into SSL units
  • Performance of ReVISE model is roughly upper-bounded by performance of resynthesized speech

Video-to-speech synthesis

  • ReVISE is compared to SVTS, the state-of-the-art model for video-to-speech synthesis
  • SVTS is composed of a video-to-spectrogram predictor and a neural vocoder
  • SVTS has strong results on two constrained datasets
  • Audio quality generated by SVTS is mediocre
  • ReVISE generates higher quality audio and lower WER

Audio-visual speech inpainting

  • ReVISE is evaluated on audio-visual speech inpainting task
  • Previous work only evaluated on constrained GRID dataset
  • Comparing with publicly available audio enhancement model, Demucs
  • Resynthesis used as baseline
  • Intelligibility and synchronization degrade as percentage of dropped frames increases
  • Existing audio enhancement approaches fail to generalize to inpainting
  • ReVISE effectively improves intelligibility and synchronization
  • ReVISE uses additional audio information to improve content reconstruction

Audio-visual speech denoising

  • Tab. 3 presents results of audio-visual speech denoising on four test splits.
  • Two audio-based baselines and VisualVoice are compared.
  • VisualVoice predicts a complex IRM given noisy audio, lip video, and speaker face image as input.
  • ReVISE is more robust to higher levels of noise.

Audio-visual source separation

  • Four noisy test sets were created with different levels of SNR.
  • Audio model (Demucs) performs worse with speech noise than with non-speech noise at the same SNR level.
  • Audio-visual models report similar performance on the two tasks at high SNR, and better performance on separation at low SNR.
  • It is easier to remove speech noise if the model has auxiliary information to identify the target speech.
  • Comparing VisualVoice and ReVISE, ReVISE is better at low SNR.

Universal audio-visual speech enhancement

  • Single ReVISE model trained on all four types of distortion
  • Universal model beats or matches distortion-specific model on almost all tasks
  • Exception is video-to-speech synthesis, where universal model is 0.5% WER worse

Av enhancement on real data -easycom

  • ReVISE substantially enhances the intelligibility of noisy speech compared to baseline methods
  • ReVISE produces audio of higher quality than the target audio itself
  • Pre-training P-AVSR is important for performance
  • Predicting SSL units is better than predicting spectrograms
  • ReVISE performs decently on video-to-speech synthesis without being trained on the task
  • Pre-training AV-HuBERT is important for performance
  • Mouth cropping helps bridge gap between pre-training and fine-tuning
  • Visual input is important for performance
  • Fine-tuning data does not have a large impact
  • ReVISE does not focus on reconstructing exact reference signal
  • ReVISE does not infer speaker identity