Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Deep learning-based 3D human pose estimation requires large amounts of labeled data.
Different datasets have different skeleton formats, making it difficult to combine them.
Separate output heads for different skeletons results in inconsistent depth estimates.
A novel affine-combining autoencoder (ACAE) method is proposed to reduce the number of landmarks.
28 3D human pose datasets are used to supervise one model, which outperforms prior work.

Paper Content

Introduction

Research on 3D human pose estimation has made progress in recent years.
Best results are achieved with labeled training data.
Individual 3D pose datasets are small and lack diversity.
To provide best models for downstream applications, it is important to use many datasets in the training process.
Numerous publicly released, labeled datasets exist.
Different datasets do not use the same skeleton format for their labels.
Prior work has rectified differences through a handful of individually defined rules.
We propose a method to capture and exploit the redundancy among the different skeletons.
We assemble the largest scale meta-dataset for 3D human pose estimation to date.
We release scripts for reproducing the process.

3D Human Pose Estimation is a current trend in computer science
Handling Discrepancy in Skeleton Formats is a problem
Keypoint Discovery is used to disentangle pose and shape in 2D and 3D
Linear Subspace Learning is a long-standing technique
Autoencoders are used to learn a latent keypoint space
Autoencoders are used to regularize an initial model during fine-tuning

Method

Goal is to obtain a strong, monocular RGB-based 3D human pose estimation model
Integrating numerous datasets into one mixed training process
Different datasets provide annotations according to different skeleton formats
Workflow consists of three main steps
First step is to train an initial model with separate prediction heads
Second step is to train an undercomplete geometry-aware autoencoder
Third step is to use the trained autoencoder to make the model output consistent across skeleton formats

Initial model training

Trained an initial pose estimator to predict all J joints separately
No correspondences or relations across different skeletons assumed
Pose loss minimized is L pose = L meanrel + λ proj L proj + λ abs L abs
Inconsistencies among output skeletons along the challenging depth axis

Pseudo-ground truth generation

We need pose labels for all skeleton formats for the same examples to compare them.
We generate pseudo-ground truth using the initial separatehead model.
We choose a clean subset of the training data (H36M and MoVi) for this purpose.
This yields a set of K pseudo-ground truth poses with J joints.

Affine-combining autoencoder

Introduce a dimensionality reduction technique to improve consistency in estimating joints
Representation should be equivariant to rotation and translation
Representation consists of a list of L latent 3D points
Latent points span the overall structure of a pose
Latent points should have sparse influence on the joints
Adopt a constrained undercomplete linear autoencoder structure (ACAE)
ACAE’s encoder takes as input a list of J points and encodes them into L latent points
ACAE’s decoder reproduces the original points from the latents
ACAE is guaranteed to be rotation and translation equivariant
Use 1 regularization to achieve sparsity in the weights
Use 1 reconstruction loss to be robust to outliers
Adapt the problem formulation to take 2D projection into account
Autoencoder should be chirality-equivariant
Weight head and facial keypoints higher in the loss

Consistency fine-tuning

Autoencoder is trained on pseudo-ground truth
Weights are frozen and used to enhance 3D pose estimation outputs
Output Regularization: additional loss term measures consistency of prediction with latent space
Direct Latent Prediction: latents are directly predicted and fed to frozen decoder
Hybrid Student-Teacher: full prediction head and latent head are used, with student-teacher-like loss

Experimental setup

Adopted state-of-the-art Me-TRAbs 3D human pose estimator as platform
400k training steps with AdamW and batch size 128
Used ghost BatchNorm with size 16
Final fine-tuning phase with 40k iterations and smaller learning rate
Autoencoder weights trained on pseudo-GT
4 datasets used: MuPoTS, 3DPW, 3DHP, H36M
4 metrics used: MPJPE, PMPJPE, PCK@100mm, CPS@200mm

Benefit of training on many datasets

Training models on individual datasets and evaluating on corresponding test splits is a baseline
Performance improves when training with more datasets
Small dataset combination outperforms single-dataset baselines
H36M scores sometimes suffer from additional data
Model trained on large dataset combination achieves strong scores
Despite good benchmark scores, skeleton outputs can still be inconsistent

Consistent multi-skeleton prediction

Merging joints from different skeletons reduces joint count from 555 to 163
Merging joints leads to weaker results than predicting all joints separately
H36M is an outlier, merging joints works well for it
Autoencoder-based regularization improves results and leads to consistent predictions
Hybrid approach is better than initial model trained to predict separate joints

Comparison to prior works

We compare our results to state-of-the-art models and observe better accuracy
We investigate how to best supervise models in a large-scale multi-dataset setting
We enforce chirality equivariance on the autoencoder
We analyze the effect of latent keypoint count on the reconstruction error
We propose a principled, automatic approach to large-scale multi-skeleton training of 3D human pose estimation
We regularize a 3D human pose estimator’s output to stay close to the learned latent space
We release code for data processing and training, as well as trained models
We use a novel formulation of dimensionality reduction of sets of keypoints
We use a learning rate schedule with exponential decay
We use a weak supervision loss for 2D-annotated examples
We use all cameras of 3DHP, and all HD cameras of CMU-Panoptic
We initialize with ImageNet-pretrained weights
We use YOLOv4 and DeepLabv3 for person bounding boxes and segmentation
We evaluate all 24 SMPL joints for 3DPW, and all 17 joints for 3DHP and MuPoTS
We use a batch composition based on the total number of examples in each dataset
We study the effect of training length on the final model performance
We study the effect of fine-tuning length on the final model performance

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Initial model training#

Pseudo-ground truth generation#

Affine-combining autoencoder#

Consistency fine-tuning#

Experimental setup#

Benefit of training on many datasets#

Consistent multi-skeleton prediction#

Comparison to prior works#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Initial model training

Pseudo-ground truth generation

Affine-combining autoencoder

Consistency fine-tuning

Experimental setup

Benefit of training on many datasets

Consistent multi-skeleton prediction

Comparison to prior works