Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Deep learning-based 3D human pose estimation requires large amounts of labeled data.
- Different datasets have different skeleton formats, making it difficult to combine them.
- Separate output heads for different skeletons results in inconsistent depth estimates.
- A novel affine-combining autoencoder (ACAE) method is proposed to reduce the number of landmarks.
- 28 3D human pose datasets are used to supervise one model, which outperforms prior work.
Paper Content
Introduction
- Research on 3D human pose estimation has made progress in recent years.
- Best results are achieved with labeled training data.
- Individual 3D pose datasets are small and lack diversity.
- To provide best models for downstream applications, it is important to use many datasets in the training process.
- Numerous publicly released, labeled datasets exist.
- Different datasets do not use the same skeleton format for their labels.
- Prior work has rectified differences through a handful of individually defined rules.
- We propose a method to capture and exploit the redundancy among the different skeletons.
- We assemble the largest scale meta-dataset for 3D human pose estimation to date.
- We release scripts for reproducing the process.
Related work
- 3D Human Pose Estimation is a current trend in computer science
- Handling Discrepancy in Skeleton Formats is a problem
- Keypoint Discovery is used to disentangle pose and shape in 2D and 3D
- Linear Subspace Learning is a long-standing technique
- Autoencoders are used to learn a latent keypoint space
- Autoencoders are used to regularize an initial model during fine-tuning
Method
- Goal is to obtain a strong, monocular RGB-based 3D human pose estimation model
- Integrating numerous datasets into one mixed training process
- Different datasets provide annotations according to different skeleton formats
- Workflow consists of three main steps
- First step is to train an initial model with separate prediction heads
- Second step is to train an undercomplete geometry-aware autoencoder
- Third step is to use the trained autoencoder to make the model output consistent across skeleton formats
Initial model training
- Trained an initial pose estimator to predict all J joints separately
- No correspondences or relations across different skeletons assumed
- Pose loss minimized is L pose = L meanrel + λ proj L proj + λ abs L abs
- Inconsistencies among output skeletons along the challenging depth axis
Pseudo-ground truth generation
- We need pose labels for all skeleton formats for the same examples to compare them.
- We generate pseudo-ground truth using the initial separatehead model.
- We choose a clean subset of the training data (H36M and MoVi) for this purpose.
- This yields a set of K pseudo-ground truth poses with J joints.
Affine-combining autoencoder
- Introduce a dimensionality reduction technique to improve consistency in estimating joints
- Representation should be equivariant to rotation and translation
- Representation consists of a list of L latent 3D points
- Latent points span the overall structure of a pose
- Latent points should have sparse influence on the joints
- Adopt a constrained undercomplete linear autoencoder structure (ACAE)
- ACAE’s encoder takes as input a list of J points and encodes them into L latent points
- ACAE’s decoder reproduces the original points from the latents
- ACAE is guaranteed to be rotation and translation equivariant
- Use 1 regularization to achieve sparsity in the weights
- Use 1 reconstruction loss to be robust to outliers
- Adapt the problem formulation to take 2D projection into account
- Autoencoder should be chirality-equivariant
- Weight head and facial keypoints higher in the loss
Consistency fine-tuning
- Autoencoder is trained on pseudo-ground truth
- Weights are frozen and used to enhance 3D pose estimation outputs
- Output Regularization: additional loss term measures consistency of prediction with latent space
- Direct Latent Prediction: latents are directly predicted and fed to frozen decoder
- Hybrid Student-Teacher: full prediction head and latent head are used, with student-teacher-like loss
Experimental setup
- Adopted state-of-the-art Me-TRAbs 3D human pose estimator as platform
- 400k training steps with AdamW and batch size 128
- Used ghost BatchNorm with size 16
- Final fine-tuning phase with 40k iterations and smaller learning rate
- Autoencoder weights trained on pseudo-GT
- 4 datasets used: MuPoTS, 3DPW, 3DHP, H36M
- 4 metrics used: MPJPE, PMPJPE, PCK@100mm, CPS@200mm
Benefit of training on many datasets
- Training models on individual datasets and evaluating on corresponding test splits is a baseline
- Performance improves when training with more datasets
- Small dataset combination outperforms single-dataset baselines
- H36M scores sometimes suffer from additional data
- Model trained on large dataset combination achieves strong scores
- Despite good benchmark scores, skeleton outputs can still be inconsistent
Consistent multi-skeleton prediction
- Merging joints from different skeletons reduces joint count from 555 to 163
- Merging joints leads to weaker results than predicting all joints separately
- H36M is an outlier, merging joints works well for it
- Autoencoder-based regularization improves results and leads to consistent predictions
- Hybrid approach is better than initial model trained to predict separate joints
Comparison to prior works
- We compare our results to state-of-the-art models and observe better accuracy
- We investigate how to best supervise models in a large-scale multi-dataset setting
- We enforce chirality equivariance on the autoencoder
- We analyze the effect of latent keypoint count on the reconstruction error
- We propose a principled, automatic approach to large-scale multi-skeleton training of 3D human pose estimation
- We regularize a 3D human pose estimator’s output to stay close to the learned latent space
- We release code for data processing and training, as well as trained models
- We use a novel formulation of dimensionality reduction of sets of keypoints
- We use a learning rate schedule with exponential decay
- We use a weak supervision loss for 2D-annotated examples
- We use all cameras of 3DHP, and all HD cameras of CMU-Panoptic
- We initialize with ImageNet-pretrained weights
- We use YOLOv4 and DeepLabv3 for person bounding boxes and segmentation
- We evaluate all 24 SMPL joints for 3DPW, and all 17 joints for 3DHP and MuPoTS
- We use a batch composition based on the total number of examples in each dataset
- We study the effect of training length on the final model performance
- We study the effect of fine-tuning length on the final model performance