Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Perceptual video quality assessment models are either frame-based or video-based.
- Netflix developed the Video Multi-method Assessment Fusion (VMAF) framework to balance between high performance and computational efficiency.
- We propose two improvements to the VMAF framework: SpatioTemporal VMAF and Ensemble VMAF.
- Both algorithms exploit efficient temporal video features which are fed into a single or multiple regression models.
- We designed a large subjective database and evaluated the proposed models against state-of-the-art approaches.
Paper Content
I. introduction
- Video traffic from content delivery networks is expected to rise to 71% by 2021.
- Perceptual video quality assessment (VQA) is an important component for many video applications.
- VQA models can be used for codec comparisons, optimizing video coding protocols, and predicting quality of experience.
- VQA models can be classified into three categories: full-reference (FR), reduced-reference (RR) and noreference (NR).
- FR VQA models require an entire reference video signal to measure visual quality.
- Image-based approaches exploit only spatial information.
- Video-based FR VQA models have also been studied.
- Data-driven models hold promise for the VQA problem.
- Netflix recently announced the Video Multimethod Fusion Approach (VMAF).
- Two ways to improve upon the current VMAF framework are proposed.
Ii. background on vmaf
A. how vmaf works
- VMAF extracts elementary video quality metrics as features
- VMAF includes feature extraction, training/testing and temporal pooling steps
- DLM, VIF and TI features are extracted
- Average value of each feature map is calculated
- Features are aggregated over entire video sequence for training
- SVR model is used to predict overall video quality
B. vmaf limitations and advantages
- VMAF was developed for a specific application context
- Compression artifacts and scaling artifacts are two main video impairments of interest
- VMAF has been trained on video sources with film grain noise
- TI feature attempts to capture motion masking effects
- VMAF must generalize well on unseen video distortions
Iii. spatiotemporal vmaf
A. s-speed and t-speed features
- Extracting temporal quality information is important for VQA models.
- Space-time VQA models are computationally intensive.
- Statistical models of frame differences are used to extract temporal quality measurements.
- Entropic differencing is used to predict visual quality.
- Conditioning is applied to remove local correlations from band-pass filtered coefficients.
- Entropy differences between a reference and a distorted video are calculated.
- Weighted block entropy values are differenced to measure the statistical distance between the GSM models of the distorted and reference video frames.
B. spatiotemporal feature integration
- SpEED-QA does not account for film grain effects
- SpEED-QA is not data-driven
- SpEED-QA lags in performance on NFLX dataset
- SpEED-QA does not exploit multiscale information
- ST-VMAF deploys 12 features to capture spatial and temporal information
Iv. ensemble vmaf
A. why an ensemble model?
- VMAF can be improved by integrating temporal quality measurements
- Increasing feature dimensionality or using deep neural networks can lead to overfitting
- Feature subsets may work well on one dataset but not another
- Model fusion combines multiple individual learners
- Experiments showed SVR predictions outperformed Random Forest predictions
- SVRs have been demonstrated to be effective for image and video quality assessment
B. an ensemble approach to video quality assessment
- Propose E-VMAF, an ensemble enhancement to VMAF
- Train two SVR models on the VMAF+ database
- First model uses 6 VMAF features
- Second model uses 3 T-SPEED and 3 S-SPEED features
- Weighted averages of individual predictions from models did not yield significant performance gains
V. the vmaf+ subjective dataset
- Data-driven approaches to VQA depend on the training data used to train the regressor engines.
- A useful training dataset should include diverse and realistic video content and simulate practical distortions.
- Collecting consistent subjective data can increase the performance of data-driven VQA models.
- A large-scale subjective study was conducted targeting multiple viewing devices and video streams.
- 29 10-second video clips from Netflix were used, with different resolutions and frame rates.
Vi. experimental analysis
- Three different video quality applications were discussed
- Three evaluation metrics were used: SROCC, PLCC and RMSE
- Non-linear logistic fitting was applied to the objective scores
- Nine VQA methods were tested
- Experiments used a variety of subjective video databases
- Databases contained a range of distortion types
- Videos had various resolutions and frame rates
A. video quality prediction
- Problem of video quality prediction
- Cross-database results used to evaluate performance
- PSNR delivered worst performance
- ST-MAD did not perform well and was time-consuming
- VMAF 0.6.1 performed well on NFLX dataset
- ST-RRED and SpEED-QA performed well on most databases but not NFLX
- VQM-VFD delivered similar SROCC performance
- VMAF does not exploit temporal information and does not generalize well
- ST-VMAF and E-VMAF achieved standout performance
- ST-VMAF and E-VMAF improved on VMAF
B. monotonicity analysis
- E-VMAF does not require an increased feature dimensionality and is less likely to overfit
- Monotonicity of ST-VMAF and E-VMAF was studied when encoding resolution and compression ratio were varied
- Consistent objective model should satisfy o c1 โค o c2 and o r1 โค o r2
- ST-VMAF loses monotonicity for large CRF values, indicating overfitting
- E-VMAF better preserves monotonicity, but still needs improvement
C. jnd prediction
- VQA models can be used for JND detection
- USC-JND dataset was designed for JND detection
- Binary search procedure is used to determine QP values for JND points
- 3520 videos labeled by subjective scores 0, 1, 2 and 3
- ST-VMAF and E-VMAF outperformed other VQA models
D. qoe prediction
- Streaming video QoE prediction is an important application of perceptual video quality models.
- Predominant video impairments during streaming are compression, scaling artifacts, and rebuffering events.
- ST-VMAF and E-VMAF models performed better than other FR-VQA models.
- ST-VMAF outperformed other VQA models when used in a continuous-time QoE predictor.
E. observations and takeaways
- ST-VMAF and E-VMAF performed well for video quality and JND prediction
- Performance of the two is similar
- E-VMAF better preserves monotonicity at high compressions
- E-VMAF is easier to tune
- ST-VMAF requires more careful SVR tuning
Vii. future work
- Developed two high-performing, data-driven full reference video quality assessment models
- Plan to combine NR source VQA measurements with FR system to account for degradations of original source/reference video
Viii. acknowledgements
- Acknowledgement of Anush K. Moorthy, Ioannis Katsavounidis and Anne Aaron
- Support from the three individuals mentioned
Ix. appendix
A. cross-database performance for st-vmaf and e-vmaf
- Proposed models rely on three components
- VMAF+ training subjective data is highly consistent and serves as an excellent training dataset
- Aggregate SROCC values achieved by ST-VMAF close to or exceed 0.8
- Performance of ST-VMAF and E-VMAF compared to individual models M1 and M2
- Performance gains of hysteresis pooling
B. computational analysis for st-vmaf and e-vmaf
- VQA models need to have low time complexity to be deployed at global scale
- Analysis of compute time was done on videos of 6 different resolutions
- ST-MAD had the most compute time, followed by ST-RRED and VQM-VFD
- ST-MAD and VQM-VFD had out-of-memory issues on long Full HD videos
- ST-VMAF and E-VMAF are memory efficient and consume less compute time