Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Perceptual video quality assessment models are either frame-based or video-based.
  • Netflix developed the Video Multi-method Assessment Fusion (VMAF) framework to balance between high performance and computational efficiency.
  • We propose two improvements to the VMAF framework: SpatioTemporal VMAF and Ensemble VMAF.
  • Both algorithms exploit efficient temporal video features which are fed into a single or multiple regression models.
  • We designed a large subjective database and evaluated the proposed models against state-of-the-art approaches.

Paper Content

I. introduction

  • Video traffic from content delivery networks is expected to rise to 71% by 2021.
  • Perceptual video quality assessment (VQA) is an important component for many video applications.
  • VQA models can be used for codec comparisons, optimizing video coding protocols, and predicting quality of experience.
  • VQA models can be classified into three categories: full-reference (FR), reduced-reference (RR) and noreference (NR).
  • FR VQA models require an entire reference video signal to measure visual quality.
  • Image-based approaches exploit only spatial information.
  • Video-based FR VQA models have also been studied.
  • Data-driven models hold promise for the VQA problem.
  • Netflix recently announced the Video Multimethod Fusion Approach (VMAF).
  • Two ways to improve upon the current VMAF framework are proposed.

Ii. background on vmaf

A. how vmaf works

  • VMAF extracts elementary video quality metrics as features
  • VMAF includes feature extraction, training/testing and temporal pooling steps
  • DLM, VIF and TI features are extracted
  • Average value of each feature map is calculated
  • Features are aggregated over entire video sequence for training
  • SVR model is used to predict overall video quality

B. vmaf limitations and advantages

  • VMAF was developed for a specific application context
  • Compression artifacts and scaling artifacts are two main video impairments of interest
  • VMAF has been trained on video sources with film grain noise
  • TI feature attempts to capture motion masking effects
  • VMAF must generalize well on unseen video distortions

Iii. spatiotemporal vmaf

A. s-speed and t-speed features

  • Extracting temporal quality information is important for VQA models.
  • Space-time VQA models are computationally intensive.
  • Statistical models of frame differences are used to extract temporal quality measurements.
  • Entropic differencing is used to predict visual quality.
  • Conditioning is applied to remove local correlations from band-pass filtered coefficients.
  • Entropy differences between a reference and a distorted video are calculated.
  • Weighted block entropy values are differenced to measure the statistical distance between the GSM models of the distorted and reference video frames.

B. spatiotemporal feature integration

  • SpEED-QA does not account for film grain effects
  • SpEED-QA is not data-driven
  • SpEED-QA lags in performance on NFLX dataset
  • SpEED-QA does not exploit multiscale information
  • ST-VMAF deploys 12 features to capture spatial and temporal information

Iv. ensemble vmaf

A. why an ensemble model?

  • VMAF can be improved by integrating temporal quality measurements
  • Increasing feature dimensionality or using deep neural networks can lead to overfitting
  • Feature subsets may work well on one dataset but not another
  • Model fusion combines multiple individual learners
  • Experiments showed SVR predictions outperformed Random Forest predictions
  • SVRs have been demonstrated to be effective for image and video quality assessment

B. an ensemble approach to video quality assessment

  • Propose E-VMAF, an ensemble enhancement to VMAF
  • Train two SVR models on the VMAF+ database
  • First model uses 6 VMAF features
  • Second model uses 3 T-SPEED and 3 S-SPEED features
  • Weighted averages of individual predictions from models did not yield significant performance gains

V. the vmaf+ subjective dataset

  • Data-driven approaches to VQA depend on the training data used to train the regressor engines.
  • A useful training dataset should include diverse and realistic video content and simulate practical distortions.
  • Collecting consistent subjective data can increase the performance of data-driven VQA models.
  • A large-scale subjective study was conducted targeting multiple viewing devices and video streams.
  • 29 10-second video clips from Netflix were used, with different resolutions and frame rates.

Vi. experimental analysis

  • Three different video quality applications were discussed
  • Three evaluation metrics were used: SROCC, PLCC and RMSE
  • Non-linear logistic fitting was applied to the objective scores
  • Nine VQA methods were tested
  • Experiments used a variety of subjective video databases
  • Databases contained a range of distortion types
  • Videos had various resolutions and frame rates

A. video quality prediction

  • Problem of video quality prediction
  • Cross-database results used to evaluate performance
  • PSNR delivered worst performance
  • ST-MAD did not perform well and was time-consuming
  • VMAF 0.6.1 performed well on NFLX dataset
  • ST-RRED and SpEED-QA performed well on most databases but not NFLX
  • VQM-VFD delivered similar SROCC performance
  • VMAF does not exploit temporal information and does not generalize well
  • ST-VMAF and E-VMAF achieved standout performance
  • ST-VMAF and E-VMAF improved on VMAF

B. monotonicity analysis

  • E-VMAF does not require an increased feature dimensionality and is less likely to overfit
  • Monotonicity of ST-VMAF and E-VMAF was studied when encoding resolution and compression ratio were varied
  • Consistent objective model should satisfy o c1 โ‰ค o c2 and o r1 โ‰ค o r2
  • ST-VMAF loses monotonicity for large CRF values, indicating overfitting
  • E-VMAF better preserves monotonicity, but still needs improvement

C. jnd prediction

  • VQA models can be used for JND detection
  • USC-JND dataset was designed for JND detection
  • Binary search procedure is used to determine QP values for JND points
  • 3520 videos labeled by subjective scores 0, 1, 2 and 3
  • ST-VMAF and E-VMAF outperformed other VQA models

D. qoe prediction

  • Streaming video QoE prediction is an important application of perceptual video quality models.
  • Predominant video impairments during streaming are compression, scaling artifacts, and rebuffering events.
  • ST-VMAF and E-VMAF models performed better than other FR-VQA models.
  • ST-VMAF outperformed other VQA models when used in a continuous-time QoE predictor.

E. observations and takeaways

  • ST-VMAF and E-VMAF performed well for video quality and JND prediction
  • Performance of the two is similar
  • E-VMAF better preserves monotonicity at high compressions
  • E-VMAF is easier to tune
  • ST-VMAF requires more careful SVR tuning

Vii. future work

  • Developed two high-performing, data-driven full reference video quality assessment models
  • Plan to combine NR source VQA measurements with FR system to account for degradations of original source/reference video

Viii. acknowledgements

  • Acknowledgement of Anush K. Moorthy, Ioannis Katsavounidis and Anne Aaron
  • Support from the three individuals mentioned

Ix. appendix

A. cross-database performance for st-vmaf and e-vmaf

  • Proposed models rely on three components
  • VMAF+ training subjective data is highly consistent and serves as an excellent training dataset
  • Aggregate SROCC values achieved by ST-VMAF close to or exceed 0.8
  • Performance of ST-VMAF and E-VMAF compared to individual models M1 and M2
  • Performance gains of hysteresis pooling

B. computational analysis for st-vmaf and e-vmaf

  • VQA models need to have low time complexity to be deployed at global scale
  • Analysis of compute time was done on videos of 6 different resolutions
  • ST-MAD had the most compute time, followed by ST-RRED and VQM-VFD
  • ST-MAD and VQM-VFD had out-of-memory issues on long Full HD videos
  • ST-VMAF and E-VMAF are memory efficient and consume less compute time