Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- ViTPose is a simple baseline model that uses a plain and non-hierarchical vision transformer for body pose estimation.
- ViTPose can be scaled up from 20M to 1B parameters, providing a new Pareto front for throughput and performance.
- ViTPose is flexible regarding attention type, input resolution, and pre-training and fine-tuning strategy.
- ViTPose+ is a novel model that deals with heterogeneous body keypoint categories in different types of body pose estimation tasks.
- ViTPose outperforms representative methods on the MS COCO Human Keypoint Detection benchmark and achieves state-of-the-art performance on a series of body pose estimation tasks.
Paper Content
Introduction
- Body keypoint detection is a fundamental task in computer vision with a wide range of applications
- Deep learning-based methods have been developed to deal with challenges such as occlusion, truncation, scales, and human appearances
- Vision transformers have been explored for human pose estimation, typically using a CNN backbone and a transformer to refine the extracted features
- ViTPose is proposed as a simple baseline model for body pose estimation, achieving a state-of-the-art performance
- ViTPose+ is proposed to deal with multiple types of body keypoint detection via knowledge factorization
- ViTPose is simple, scalable, flexible, and transferable
Related work
Representative pose estimation methods
Top-down methods
- Pose estimation has been rapidly developed from CNN to vision transformers
- Most methods focus on estimating pose from given human instances
- High-resolution features are used in CNN networks
- Some methods use transformers as decoders after CNN backbone
- HRFormer uses transformers to extract high-resolution features directly
- ViTPose is a simple yet effective baseline model based on plain vision transformers
Bottom-up methods
- Bottom-up methods focus on locating and grouping human body keypoints
- OpenPose uses a two-stage network to predict keypoints and group them
- Associate embedding simplifies bottom-up pipelines by jointly estimating keypoints and associating them
- HigherHRNet introduces multi-stage supervision
- Plain vision transformers are explored for bottom-up pose estimation
Vision transformer pre-training
- ViT [7] has been successful, leading to many different vision transformers being proposed
- Fully supervised learning requires a large amount of labeled data, which is expensive
- Pre-trained models may not generalize well to tasks with different data distributions
- Self-supervised pre-training methods have been proposed, such as contrastive and generative
- MAE [16] uses a masked image modeling (MIM) pretext task
- MIM pre-trained models on ImageNet without labels demonstrate better generalization performance
- This paper focuses on pose estimation tasks and explores whether or not using ImageNet for pre-training is necessary
- Smaller unlabelled pose datasets can also provide a good initialization
Foundation models
- Foundation model is a unified model used to deal with real-world problems.
- Vision transformers are popular for building foundation models and leveraging data from multiple modalities.
- MoE layers are used to enhance the representation ability of backbone networks.
- ViTPose+ is a foundation model for generic body keypoint detection.
- ViTPose+ decomposes FFN layers into task-agnostic and task-specific experts.
- ViTPose+ outperforms ViTPose and sets new SOTA on public benchmarks.
Comparison to the conference version
- Paper extends previous study with 3 major improvements
- Develops foundation model for generic body keypoint detection
- Provides more experiment results, ablation studies, and analyses
- Visual results demonstrate promising performance of ViTPose+
Vitpose
- ViTPose uses a vision transformer for feature extraction
- ViTPose is compatible with different decoders for keypoint estimation
- ViTPose has properties of simplicity, scalability, flexibility, and transferability
- ViTPose+ model is proposed and achieves SOTA performance on pose estimation datasets
The simplicity of vitpose
- Goal of paper is to provide a simple yet effective vision transformer baseline for body pose estimation tasks.
- Structure is kept as simple as possible, avoiding fancy but complex modules.
- Patch embedding layer is used to embed image into tokens.
- Two kinds of lightweight decoders are used to regress the heatmaps of keypoints.
The scalability of vitpose
- ViTPose has structural simplicity, allowing for easy control of model size.
- ViTPose can benefit from the development of scalable vision transformers without modification.
- ViTPose was tested using pre-trained vision transformers of different model sizes.
- Performance increases with increased model size.
The flexibility of vitpose
- Pre-training the backbone networks on ImageNet is a common practice, but it requires extra data
- Can the data requirements be relaxed by using only pose data during training?
- Pre-trained weights are used to initialize the backbone of ViTPose and fine-tuned on MS COCO
- Performance of ViTPose increases with higher input resolution or higher feature resolution
- Full attention on high-resolution feature maps causes a huge memory footprint and computational cost
- Window-based attention with relative position embedding is used to address this issue
- Shift window and pooling window techniques are used to improve performance and reduce memory footprint
- Pre-trained transformer models can generalize well to other tasks by only tuning partial parameters
- Explored potential of plain vision transformers in both top-down and bottom-up paradigms
The transferability of vitpose
- Knowledge distillation is a method to improve the performance of small models by transferring knowledge from larger ones.
- A token-based distillation method is proposed to bridge the large and small models.
- The knowledge token is initialized and appended to the teacher model, then frozen and concatenated with the student model.
- The student model is trained with a combination of token distillation loss and output distillation loss.
Vitpose+
- A foundation model for generic body keypoint detection should be able to deal with different body pose estimation tasks.
- There are differences between body keypoints in different pose estimation tasks.
- A naive solution is to train a ViTPose model by multi-task learning.
- There may be conflicts between different tasks.
- ViTPose+ is proposed to address the challenge from the perspective of knowledge factorization.
Experiments
Datasets and evaluation metrics
- Used MS COCO, AIC, MPII, COCO-W, AP-10K, APT-36K, Interhand2.6M, and OCHuman datasets
- MS COCO has 118K images and 150K human instances with up to 17 keypoints
- COCO-W has up to 133 keypoints for each instance
- MPII has 25K images and 40K human instances with up to 16 keypoints
- AIC has 200K images and 350 human instances with up to 14 keypoints
- AP-10K and APT-36K have 10K and 36K images with up to 17 keypoints
- Interhand2.6M has 2.6M labeled hand frames with 2D and 3D annotations
- OCHuman has 4K images and 8K instances with heavy occlusions
- Used AP and PCKh as evaluation metrics
Implementation details
- ViTPose is a top-down approach for human pose estimation
- ViTPose is evaluated on MS COCO Keypoint val set
- ViT-S, ViT-B, ViT-L, ViT-H and ViTAE-G are used as backbone networks
- MAE pretrained weights are used to initialize the backbones
- AdamW optimizer with a learning rate of 5e-4 is used
- Udp is used for post-processing
- Models are trained for 210 epochs with a learning rate decay at 170th and 200th epochs
Ablation studies of vitpose and analysis
- ViTPose performs well with a simple decoder and plain vision transformer
- ViTPose has excellent scalability
- Pre-training on data from downstream tasks has better data efficiency than ImageNet-1k
- ViTPose adapts well to different input resolutions
- Window-based attention alleviates out-of-memory issue but decreases performance
- Partially fine-tuning the MHSA has moderate performance drop, while partially fine-tuning the FFN has significant performance drop
- Token-based distillation brings gain of 0.2 AP with marginal cost of extra memory footprint
- Output distillation brings gain of 0.5 AP with moderate cost of extra memory footprint
- Combining token-based and output distillation brings gain of 0.6 AP
Ablation studies of vitpose+ and analysis
Different settings of vitpose+
- ViTPose is a simple and lightweight decoder that can be extended to deal with multiple types of body pose estimation tasks
- Training process involves MS COCO, AIC, MPII, COCO-W, AP-10K, and APT-36K datasets
- Results on MS COCO val set show performance increases from 75.8 AP to 77.1 AP when using human pose estimation datasets
- Using MPII for training brings 0.1 AP increase
- Performance remains the same when using COCO-W
- Performance drops when using animal datasets
- ViTPose+ has 3 variants: Independent FFN, Independent and Shared FFN, and Partially Shared FFN
- Results show flexibility and potential for building a foundation model towards generic body pose estimation
Comparison with sota methods
The performance on ms coco
- ViTPose and ViTPose+ compared to SOTA methods on MS COCO dataset
- ViTPose achieves better trade-off between throughput and accuracy
- ViTPose-L outperforms previous SOTA methods based on CNN and transformers
- ViTPose-L sets a new SOTA in bottom-up paradigm
The performance on other datasets
- ViTPose+ models are tested on 6 different datasets
- ViTPose+ outperforms previous methods on MPII val set
- ViTPose+ performs better than CNN-based and transformer-based models on AIC val set
- ViTPose+ performs better on COCO-W val set
- Weights of task-specific FFN parts have larger similarity for human pose estimation than for body pose estimation of different species
- Weights of FFN layers in deeper layers are more task-specific than those in shallower layers
Limitations and discussion
- ViTPose and ViTPose+ have good properties and performance for body pose estimation
- Room to improve further by designing a unified foundation model for simultaneous object detection and pose estimation
- Leverage multimodality data and language knowledge of body keypoints to help body pose estimation
Conclusion
- ViTPose and ViTPose+ proposed as simple baseline and foundation models for body pose estimation
- Properties of ViTPose include simplicity, scalability, flexibility, and transferability
- Extensive experiments conducted on MS COCO, AIC, MPII, OCHuman, COCO-W, AP-10K, and APT-36K
- New performance records set on these datasets
- ViTPose+ obtains SOTA performance on representative benchmarks
- ViTPose+ sets new SOTA on OCHuman, COCO-W, and APT-36K
- ViTPose+ has strong feature representation ability
- ViTPose+ has excellent scalability