Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

ViTPose is a simple baseline model that uses a plain and non-hierarchical vision transformer for body pose estimation.
ViTPose can be scaled up from 20M to 1B parameters, providing a new Pareto front for throughput and performance.
ViTPose is flexible regarding attention type, input resolution, and pre-training and fine-tuning strategy.
ViTPose+ is a novel model that deals with heterogeneous body keypoint categories in different types of body pose estimation tasks.
ViTPose outperforms representative methods on the MS COCO Human Keypoint Detection benchmark and achieves state-of-the-art performance on a series of body pose estimation tasks.

Paper Content

Introduction

Body keypoint detection is a fundamental task in computer vision with a wide range of applications
Deep learning-based methods have been developed to deal with challenges such as occlusion, truncation, scales, and human appearances
Vision transformers have been explored for human pose estimation, typically using a CNN backbone and a transformer to refine the extracted features
ViTPose is proposed as a simple baseline model for body pose estimation, achieving a state-of-the-art performance
ViTPose+ is proposed to deal with multiple types of body keypoint detection via knowledge factorization
ViTPose is simple, scalable, flexible, and transferable

Representative pose estimation methods

Top-down methods

Pose estimation has been rapidly developed from CNN to vision transformers
Most methods focus on estimating pose from given human instances
High-resolution features are used in CNN networks
Some methods use transformers as decoders after CNN backbone
HRFormer uses transformers to extract high-resolution features directly
ViTPose is a simple yet effective baseline model based on plain vision transformers

Bottom-up methods

Bottom-up methods focus on locating and grouping human body keypoints
OpenPose uses a two-stage network to predict keypoints and group them
Associate embedding simplifies bottom-up pipelines by jointly estimating keypoints and associating them
HigherHRNet introduces multi-stage supervision
Plain vision transformers are explored for bottom-up pose estimation

Vision transformer pre-training

ViT [7] has been successful, leading to many different vision transformers being proposed
Fully supervised learning requires a large amount of labeled data, which is expensive
Pre-trained models may not generalize well to tasks with different data distributions
Self-supervised pre-training methods have been proposed, such as contrastive and generative
MAE [16] uses a masked image modeling (MIM) pretext task
MIM pre-trained models on ImageNet without labels demonstrate better generalization performance
This paper focuses on pose estimation tasks and explores whether or not using ImageNet for pre-training is necessary
Smaller unlabelled pose datasets can also provide a good initialization

Foundation models

Foundation model is a unified model used to deal with real-world problems.
Vision transformers are popular for building foundation models and leveraging data from multiple modalities.
MoE layers are used to enhance the representation ability of backbone networks.
ViTPose+ is a foundation model for generic body keypoint detection.
ViTPose+ decomposes FFN layers into task-agnostic and task-specific experts.
ViTPose+ outperforms ViTPose and sets new SOTA on public benchmarks.

Comparison to the conference version

Paper extends previous study with 3 major improvements
Develops foundation model for generic body keypoint detection
Provides more experiment results, ablation studies, and analyses
Visual results demonstrate promising performance of ViTPose+

Vitpose

ViTPose uses a vision transformer for feature extraction
ViTPose is compatible with different decoders for keypoint estimation
ViTPose has properties of simplicity, scalability, flexibility, and transferability
ViTPose+ model is proposed and achieves SOTA performance on pose estimation datasets

The simplicity of vitpose

Goal of paper is to provide a simple yet effective vision transformer baseline for body pose estimation tasks.
Structure is kept as simple as possible, avoiding fancy but complex modules.
Patch embedding layer is used to embed image into tokens.
Two kinds of lightweight decoders are used to regress the heatmaps of keypoints.

The scalability of vitpose

ViTPose has structural simplicity, allowing for easy control of model size.
ViTPose can benefit from the development of scalable vision transformers without modification.
ViTPose was tested using pre-trained vision transformers of different model sizes.
Performance increases with increased model size.

The flexibility of vitpose

Pre-training the backbone networks on ImageNet is a common practice, but it requires extra data
Can the data requirements be relaxed by using only pose data during training?
Pre-trained weights are used to initialize the backbone of ViTPose and fine-tuned on MS COCO
Performance of ViTPose increases with higher input resolution or higher feature resolution
Full attention on high-resolution feature maps causes a huge memory footprint and computational cost
Window-based attention with relative position embedding is used to address this issue
Shift window and pooling window techniques are used to improve performance and reduce memory footprint
Pre-trained transformer models can generalize well to other tasks by only tuning partial parameters
Explored potential of plain vision transformers in both top-down and bottom-up paradigms

The transferability of vitpose

Knowledge distillation is a method to improve the performance of small models by transferring knowledge from larger ones.
A token-based distillation method is proposed to bridge the large and small models.
The knowledge token is initialized and appended to the teacher model, then frozen and concatenated with the student model.
The student model is trained with a combination of token distillation loss and output distillation loss.

Vitpose+

A foundation model for generic body keypoint detection should be able to deal with different body pose estimation tasks.
There are differences between body keypoints in different pose estimation tasks.
A naive solution is to train a ViTPose model by multi-task learning.
There may be conflicts between different tasks.
ViTPose+ is proposed to address the challenge from the perspective of knowledge factorization.

Experiments

Datasets and evaluation metrics

Used MS COCO, AIC, MPII, COCO-W, AP-10K, APT-36K, Interhand2.6M, and OCHuman datasets
MS COCO has 118K images and 150K human instances with up to 17 keypoints
COCO-W has up to 133 keypoints for each instance
MPII has 25K images and 40K human instances with up to 16 keypoints
AIC has 200K images and 350 human instances with up to 14 keypoints
AP-10K and APT-36K have 10K and 36K images with up to 17 keypoints
Interhand2.6M has 2.6M labeled hand frames with 2D and 3D annotations
OCHuman has 4K images and 8K instances with heavy occlusions
Used AP and PCKh as evaluation metrics

Implementation details

ViTPose is a top-down approach for human pose estimation
ViTPose is evaluated on MS COCO Keypoint val set
ViT-S, ViT-B, ViT-L, ViT-H and ViTAE-G are used as backbone networks
MAE pretrained weights are used to initialize the backbones
AdamW optimizer with a learning rate of 5e-4 is used
Udp is used for post-processing
Models are trained for 210 epochs with a learning rate decay at 170th and 200th epochs

Ablation studies of vitpose and analysis

ViTPose performs well with a simple decoder and plain vision transformer
ViTPose has excellent scalability
Pre-training on data from downstream tasks has better data efficiency than ImageNet-1k
ViTPose adapts well to different input resolutions
Window-based attention alleviates out-of-memory issue but decreases performance
Partially fine-tuning the MHSA has moderate performance drop, while partially fine-tuning the FFN has significant performance drop
Token-based distillation brings gain of 0.2 AP with marginal cost of extra memory footprint
Output distillation brings gain of 0.5 AP with moderate cost of extra memory footprint
Combining token-based and output distillation brings gain of 0.6 AP

Ablation studies of vitpose+ and analysis

Different settings of vitpose+

ViTPose is a simple and lightweight decoder that can be extended to deal with multiple types of body pose estimation tasks
Training process involves MS COCO, AIC, MPII, COCO-W, AP-10K, and APT-36K datasets
Results on MS COCO val set show performance increases from 75.8 AP to 77.1 AP when using human pose estimation datasets
Using MPII for training brings 0.1 AP increase
Performance remains the same when using COCO-W
Performance drops when using animal datasets
ViTPose+ has 3 variants: Independent FFN, Independent and Shared FFN, and Partially Shared FFN
Results show flexibility and potential for building a foundation model towards generic body pose estimation

Comparison with sota methods

The performance on ms coco

ViTPose and ViTPose+ compared to SOTA methods on MS COCO dataset
ViTPose achieves better trade-off between throughput and accuracy
ViTPose-L outperforms previous SOTA methods based on CNN and transformers
ViTPose-L sets a new SOTA in bottom-up paradigm

The performance on other datasets

ViTPose+ models are tested on 6 different datasets
ViTPose+ outperforms previous methods on MPII val set
ViTPose+ performs better than CNN-based and transformer-based models on AIC val set
ViTPose+ performs better on COCO-W val set
Weights of task-specific FFN parts have larger similarity for human pose estimation than for body pose estimation of different species
Weights of FFN layers in deeper layers are more task-specific than those in shallower layers

Limitations and discussion

ViTPose and ViTPose+ have good properties and performance for body pose estimation
Room to improve further by designing a unified foundation model for simultaneous object detection and pose estimation
Leverage multimodality data and language knowledge of body keypoints to help body pose estimation

Conclusion

ViTPose and ViTPose+ proposed as simple baseline and foundation models for body pose estimation
Properties of ViTPose include simplicity, scalability, flexibility, and transferability
Extensive experiments conducted on MS COCO, AIC, MPII, OCHuman, COCO-W, AP-10K, and APT-36K
New performance records set on these datasets
ViTPose+ obtains SOTA performance on representative benchmarks
ViTPose+ sets new SOTA on OCHuman, COCO-W, and APT-36K
ViTPose+ has strong feature representation ability
ViTPose+ has excellent scalability

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Representative pose estimation methods#

Top-down methods#

Bottom-up methods#

Vision transformer pre-training#

Foundation models#

Comparison to the conference version#

Vitpose#

The simplicity of vitpose#

The scalability of vitpose#

The flexibility of vitpose#

The transferability of vitpose#

Vitpose+#

Experiments#

Datasets and evaluation metrics#

Implementation details#

Ablation studies of vitpose and analysis#

Ablation studies of vitpose+ and analysis#

Different settings of vitpose+#

Comparison with sota methods#

The performance on ms coco#

The performance on other datasets#

Limitations and discussion#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Representative pose estimation methods

Top-down methods

Bottom-up methods

Vision transformer pre-training

Foundation models

Comparison to the conference version

Vitpose

The simplicity of vitpose

The scalability of vitpose

The flexibility of vitpose

The transferability of vitpose

Vitpose+

Experiments

Datasets and evaluation metrics

Implementation details

Ablation studies of vitpose and analysis

Ablation studies of vitpose+ and analysis

Different settings of vitpose+

Comparison with sota methods

The performance on ms coco

The performance on other datasets

Limitations and discussion

Conclusion