Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Perceptual video quality assessment models are either frame-based or video-based.
Netflix developed the Video Multi-method Assessment Fusion (VMAF) framework to balance between high performance and computational efficiency.
We propose two improvements to the VMAF framework: SpatioTemporal VMAF and Ensemble VMAF.
Both algorithms exploit efficient temporal video features which are fed into a single or multiple regression models.
We designed a large subjective database and evaluated the proposed models against state-of-the-art approaches.

Paper Content

I. introduction

Video traffic from content delivery networks is expected to rise to 71% by 2021.
Perceptual video quality assessment (VQA) is an important component for many video applications.
VQA models can be used for codec comparisons, optimizing video coding protocols, and predicting quality of experience.
VQA models can be classified into three categories: full-reference (FR), reduced-reference (RR) and noreference (NR).
FR VQA models require an entire reference video signal to measure visual quality.
Image-based approaches exploit only spatial information.
Video-based FR VQA models have also been studied.
Data-driven models hold promise for the VQA problem.
Netflix recently announced the Video Multimethod Fusion Approach (VMAF).
Two ways to improve upon the current VMAF framework are proposed.

Ii. background on vmaf

A. how vmaf works

VMAF extracts elementary video quality metrics as features
VMAF includes feature extraction, training/testing and temporal pooling steps
DLM, VIF and TI features are extracted
Average value of each feature map is calculated
Features are aggregated over entire video sequence for training
SVR model is used to predict overall video quality

B. vmaf limitations and advantages

VMAF was developed for a specific application context
Compression artifacts and scaling artifacts are two main video impairments of interest
VMAF has been trained on video sources with film grain noise
TI feature attempts to capture motion masking effects
VMAF must generalize well on unseen video distortions

Iii. spatiotemporal vmaf

A. s-speed and t-speed features

Extracting temporal quality information is important for VQA models.
Space-time VQA models are computationally intensive.
Statistical models of frame differences are used to extract temporal quality measurements.
Entropic differencing is used to predict visual quality.
Conditioning is applied to remove local correlations from band-pass filtered coefficients.
Entropy differences between a reference and a distorted video are calculated.
Weighted block entropy values are differenced to measure the statistical distance between the GSM models of the distorted and reference video frames.

B. spatiotemporal feature integration

SpEED-QA does not account for film grain effects
SpEED-QA is not data-driven
SpEED-QA lags in performance on NFLX dataset
SpEED-QA does not exploit multiscale information
ST-VMAF deploys 12 features to capture spatial and temporal information

Iv. ensemble vmaf

A. why an ensemble model?

VMAF can be improved by integrating temporal quality measurements
Increasing feature dimensionality or using deep neural networks can lead to overfitting
Feature subsets may work well on one dataset but not another
Model fusion combines multiple individual learners
Experiments showed SVR predictions outperformed Random Forest predictions
SVRs have been demonstrated to be effective for image and video quality assessment

B. an ensemble approach to video quality assessment

Propose E-VMAF, an ensemble enhancement to VMAF
Train two SVR models on the VMAF+ database
First model uses 6 VMAF features
Second model uses 3 T-SPEED and 3 S-SPEED features
Weighted averages of individual predictions from models did not yield significant performance gains

V. the vmaf+ subjective dataset

Data-driven approaches to VQA depend on the training data used to train the regressor engines.
A useful training dataset should include diverse and realistic video content and simulate practical distortions.
Collecting consistent subjective data can increase the performance of data-driven VQA models.
A large-scale subjective study was conducted targeting multiple viewing devices and video streams.
29 10-second video clips from Netflix were used, with different resolutions and frame rates.

Vi. experimental analysis

Three different video quality applications were discussed
Three evaluation metrics were used: SROCC, PLCC and RMSE
Non-linear logistic fitting was applied to the objective scores
Nine VQA methods were tested
Experiments used a variety of subjective video databases
Databases contained a range of distortion types
Videos had various resolutions and frame rates

A. video quality prediction

Problem of video quality prediction
Cross-database results used to evaluate performance
PSNR delivered worst performance
ST-MAD did not perform well and was time-consuming
VMAF 0.6.1 performed well on NFLX dataset
ST-RRED and SpEED-QA performed well on most databases but not NFLX
VQM-VFD delivered similar SROCC performance
VMAF does not exploit temporal information and does not generalize well
ST-VMAF and E-VMAF achieved standout performance
ST-VMAF and E-VMAF improved on VMAF

B. monotonicity analysis

E-VMAF does not require an increased feature dimensionality and is less likely to overfit
Monotonicity of ST-VMAF and E-VMAF was studied when encoding resolution and compression ratio were varied
Consistent objective model should satisfy o c1 ≤ o c2 and o r1 ≤ o r2
ST-VMAF loses monotonicity for large CRF values, indicating overfitting
E-VMAF better preserves monotonicity, but still needs improvement

C. jnd prediction

VQA models can be used for JND detection
USC-JND dataset was designed for JND detection
Binary search procedure is used to determine QP values for JND points
3520 videos labeled by subjective scores 0, 1, 2 and 3
ST-VMAF and E-VMAF outperformed other VQA models

D. qoe prediction

Streaming video QoE prediction is an important application of perceptual video quality models.
Predominant video impairments during streaming are compression, scaling artifacts, and rebuffering events.
ST-VMAF and E-VMAF models performed better than other FR-VQA models.
ST-VMAF outperformed other VQA models when used in a continuous-time QoE predictor.

E. observations and takeaways

ST-VMAF and E-VMAF performed well for video quality and JND prediction
Performance of the two is similar
E-VMAF better preserves monotonicity at high compressions
E-VMAF is easier to tune
ST-VMAF requires more careful SVR tuning

Vii. future work

Developed two high-performing, data-driven full reference video quality assessment models
Plan to combine NR source VQA measurements with FR system to account for degradations of original source/reference video

Viii. acknowledgements

Acknowledgement of Anush K. Moorthy, Ioannis Katsavounidis and Anne Aaron
Support from the three individuals mentioned

Ix. appendix

A. cross-database performance for st-vmaf and e-vmaf

Proposed models rely on three components
VMAF+ training subjective data is highly consistent and serves as an excellent training dataset
Aggregate SROCC values achieved by ST-VMAF close to or exceed 0.8
Performance of ST-VMAF and E-VMAF compared to individual models M1 and M2
Performance gains of hysteresis pooling

B. computational analysis for st-vmaf and e-vmaf

VQA models need to have low time complexity to be deployed at global scale
Analysis of compute time was done on videos of 6 different resolutions
ST-MAD had the most compute time, followed by ST-RRED and VQM-VFD
ST-MAD and VQM-VFD had out-of-memory issues on long Full HD videos
ST-VMAF and E-VMAF are memory efficient and consume less compute time

Link to paper#

Abstract#

Paper Content#

I. introduction#

Ii. background on vmaf#

A. how vmaf works#

B. vmaf limitations and advantages#

Iii. spatiotemporal vmaf#

A. s-speed and t-speed features#

B. spatiotemporal feature integration#

Iv. ensemble vmaf#

A. why an ensemble model?#

B. an ensemble approach to video quality assessment#

V. the vmaf+ subjective dataset#

Vi. experimental analysis#

A. video quality prediction#

B. monotonicity analysis#

C. jnd prediction#

D. qoe prediction#

E. observations and takeaways#

Vii. future work#

Viii. acknowledgements#

Ix. appendix#

A. cross-database performance for st-vmaf and e-vmaf#

B. computational analysis for st-vmaf and e-vmaf#

Link to paper

Abstract

Paper Content

I. introduction

Ii. background on vmaf

A. how vmaf works

B. vmaf limitations and advantages

Iii. spatiotemporal vmaf

A. s-speed and t-speed features

B. spatiotemporal feature integration

Iv. ensemble vmaf

A. why an ensemble model?

B. an ensemble approach to video quality assessment

V. the vmaf+ subjective dataset

Vi. experimental analysis

A. video quality prediction

B. monotonicity analysis

C. jnd prediction

D. qoe prediction

E. observations and takeaways

Vii. future work

Viii. acknowledgements

Ix. appendix

A. cross-database performance for st-vmaf and e-vmaf

B. computational analysis for st-vmaf and e-vmaf