Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Deep learning-based 3D human pose estimation requires large amounts of labeled data.
  • Different datasets have different skeleton formats, making it difficult to combine them.
  • Separate output heads for different skeletons results in inconsistent depth estimates.
  • A novel affine-combining autoencoder (ACAE) method is proposed to reduce the number of landmarks.
  • 28 3D human pose datasets are used to supervise one model, which outperforms prior work.

Paper Content

Introduction

  • Research on 3D human pose estimation has made progress in recent years.
  • Best results are achieved with labeled training data.
  • Individual 3D pose datasets are small and lack diversity.
  • To provide best models for downstream applications, it is important to use many datasets in the training process.
  • Numerous publicly released, labeled datasets exist.
  • Different datasets do not use the same skeleton format for their labels.
  • Prior work has rectified differences through a handful of individually defined rules.
  • We propose a method to capture and exploit the redundancy among the different skeletons.
  • We assemble the largest scale meta-dataset for 3D human pose estimation to date.
  • We release scripts for reproducing the process.
  • 3D Human Pose Estimation is a current trend in computer science
  • Handling Discrepancy in Skeleton Formats is a problem
  • Keypoint Discovery is used to disentangle pose and shape in 2D and 3D
  • Linear Subspace Learning is a long-standing technique
  • Autoencoders are used to learn a latent keypoint space
  • Autoencoders are used to regularize an initial model during fine-tuning

Method

  • Goal is to obtain a strong, monocular RGB-based 3D human pose estimation model
  • Integrating numerous datasets into one mixed training process
  • Different datasets provide annotations according to different skeleton formats
  • Workflow consists of three main steps
  • First step is to train an initial model with separate prediction heads
  • Second step is to train an undercomplete geometry-aware autoencoder
  • Third step is to use the trained autoencoder to make the model output consistent across skeleton formats

Initial model training

  • Trained an initial pose estimator to predict all J joints separately
  • No correspondences or relations across different skeletons assumed
  • Pose loss minimized is L pose = L meanrel + λ proj L proj + λ abs L abs
  • Inconsistencies among output skeletons along the challenging depth axis

Pseudo-ground truth generation

  • We need pose labels for all skeleton formats for the same examples to compare them.
  • We generate pseudo-ground truth using the initial separatehead model.
  • We choose a clean subset of the training data (H36M and MoVi) for this purpose.
  • This yields a set of K pseudo-ground truth poses with J joints.

Affine-combining autoencoder

  • Introduce a dimensionality reduction technique to improve consistency in estimating joints
  • Representation should be equivariant to rotation and translation
  • Representation consists of a list of L latent 3D points
  • Latent points span the overall structure of a pose
  • Latent points should have sparse influence on the joints
  • Adopt a constrained undercomplete linear autoencoder structure (ACAE)
  • ACAE’s encoder takes as input a list of J points and encodes them into L latent points
  • ACAE’s decoder reproduces the original points from the latents
  • ACAE is guaranteed to be rotation and translation equivariant
  • Use 1 regularization to achieve sparsity in the weights
  • Use 1 reconstruction loss to be robust to outliers
  • Adapt the problem formulation to take 2D projection into account
  • Autoencoder should be chirality-equivariant
  • Weight head and facial keypoints higher in the loss

Consistency fine-tuning

  • Autoencoder is trained on pseudo-ground truth
  • Weights are frozen and used to enhance 3D pose estimation outputs
  • Output Regularization: additional loss term measures consistency of prediction with latent space
  • Direct Latent Prediction: latents are directly predicted and fed to frozen decoder
  • Hybrid Student-Teacher: full prediction head and latent head are used, with student-teacher-like loss

Experimental setup

  • Adopted state-of-the-art Me-TRAbs 3D human pose estimator as platform
  • 400k training steps with AdamW and batch size 128
  • Used ghost BatchNorm with size 16
  • Final fine-tuning phase with 40k iterations and smaller learning rate
  • Autoencoder weights trained on pseudo-GT
  • 4 datasets used: MuPoTS, 3DPW, 3DHP, H36M
  • 4 metrics used: MPJPE, PMPJPE, PCK@100mm, CPS@200mm

Benefit of training on many datasets

  • Training models on individual datasets and evaluating on corresponding test splits is a baseline
  • Performance improves when training with more datasets
  • Small dataset combination outperforms single-dataset baselines
  • H36M scores sometimes suffer from additional data
  • Model trained on large dataset combination achieves strong scores
  • Despite good benchmark scores, skeleton outputs can still be inconsistent

Consistent multi-skeleton prediction

  • Merging joints from different skeletons reduces joint count from 555 to 163
  • Merging joints leads to weaker results than predicting all joints separately
  • H36M is an outlier, merging joints works well for it
  • Autoencoder-based regularization improves results and leads to consistent predictions
  • Hybrid approach is better than initial model trained to predict separate joints

Comparison to prior works

  • We compare our results to state-of-the-art models and observe better accuracy
  • We investigate how to best supervise models in a large-scale multi-dataset setting
  • We enforce chirality equivariance on the autoencoder
  • We analyze the effect of latent keypoint count on the reconstruction error
  • We propose a principled, automatic approach to large-scale multi-skeleton training of 3D human pose estimation
  • We regularize a 3D human pose estimator’s output to stay close to the learned latent space
  • We release code for data processing and training, as well as trained models
  • We use a novel formulation of dimensionality reduction of sets of keypoints
  • We use a learning rate schedule with exponential decay
  • We use a weak supervision loss for 2D-annotated examples
  • We use all cameras of 3DHP, and all HD cameras of CMU-Panoptic
  • We initialize with ImageNet-pretrained weights
  • We use YOLOv4 and DeepLabv3 for person bounding boxes and segmentation
  • We evaluate all 24 SMPL joints for 3DPW, and all 17 joints for 3DHP and MuPoTS
  • We use a batch composition based on the total number of examples in each dataset
  • We study the effect of training length on the final model performance
  • We study the effect of fine-tuning length on the final model performance