Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Estimating 6D pose of objects is a major field in 3D computer vision
Research trends are heading towards category-level pose estimation
New dataset HouseCat6D features multi-modality, diverse objects, high-quality pose annotation, large scale scenes, and checkerboard-free environment
Benchmark results of state-of-the-art category-level pose estimation networks provided

Paper Content

Introduction

6D pose estimation is important for computer vision tasks
Many methods have been proposed to solve this task
Most methods focus on instance-level, but generalization is limited
Recent methods focus on category-level, but lack of datasets
HouseCat6D is a new category-level dataset with 194 objects from 10 categories
Includes RGB, depth, and polarimetric images with 23.5k frames and approx. 160k annotated object poses
Uses an accurate external infrared tracking system and post-processing
Provides benchmark evaluation results for SOTA category-level baselines

Instance-level 6D pose estimation focuses on estimating the pose of a single object
Category-level 6D pose estimation focuses on estimating the pose of objects within the same class
Recent state-of-the-art methods are mostly data-driven approaches
Datasets are needed for training and evaluation

Instance-level 6d object pose dataset

Early 6D pose datasets started from single image
LineMOD and LM-Occlusion are popular datasets for level pose estimation
RGBD camera and checkerboards used to annotate pose of objects
HomebrewedDB has better quality object mesh and pose annotation, but lacks scene variety
Video sequences used to simplify pose estimation problem to pose tracking

Category-level object poses and dataset

Category-level pose estimation is used to address generalizability in 6D pose estimation over multiple objects of the same category.
NOCS is an approach and dataset for category-level 6D pose and size estimation.
It contains two datasets: a mixed reality dataset (CAMERA25) and a real RGB-D dataset (REAL275).
kPam is a dataset focusing on the robotic field and uses keypoints.
TOD and PhoCal focus on translucent or transparent and reflective objects.
Wild6D is annotated via tracking and uses multiple iPhones to capture RGB images, depth, and point cloud.

Dataset

34 training scenes, 5 test scenes, 2 validation scenes
194 objects from 10 household categories
Includes photometrically challenging objects
Multiple modalities: RGB images, polarimetric images, depth maps

Objects mesh acquisition

10 household categories were chosen to represent typical household scenarios: bottle, box, can, cup, cutlery, glass, remote, shoe, teapot, tube.
High quality EinScan-SP 3D Scanner was used to scan all objects with a single shot accuracy of ≤ 0.05 mm.
Self-vanishing 3D scanning spray was used for photometrically challenging categories.
Meshes of all objects were provided as obj-file.

Hardware

Utilize external tracker system with four cameras for annotation
Evaluate accuracy of tracking setup with robotic setup
Average 0.67 mm/0.12° error in static case and 0.92 mm/0.16° error in dynamic tracking scenario
Use D435 as depth sensor over Time-of-Flight sensors for robust depth

Object pose annotation

Annotating the 6D pose of an object is essential for a 6D pose dataset.
Annotation pipeline from [50] is adopted and the robotic end-effector pose is replaced with an IR tracking body.
Tip calibration is done to ensure accuracy of annotation.

Camera trajectory annotation

Accurate camera trajectories are necessary for 6D pose dataset annotation
Hand-Eye-Calibration is used to obtain the transformation between the tracker marker body and the center of the camera image sensor
Timestamp calibration between the tracking system and the camera image acquisition time is necessary
Pose Refinement is used to minimize the reprojection error with multiview images
RGBD-based datasets suffer from the standard deviation of the sensor
Multi-view setups improve annotation quality
Checkerboard-based datasets provide more accurate annotations than RGBD-based datasets

Annotation quality evaluation

Reported point-wise RMSE between objects and camera center with and without consideration of 3 systematic errors
Upper bound includes object annotation error and static tracking error, no synchronization error
Lower bound includes all 3 systematic errors with dynamic tracking error as tracking system error
RMSE from 1.35 mm to 1.73 mm

Scene statistics

HouseCat6D contains 41 large-scale scenes with 194 objects in 10 categories.
34 training scenes with 124 objects, 5 test scenes with 50 objects and 2 validation scenes with 20 objects.
20k frames recorded for 34 training scenes, 3k frames for 5 test scenes and 1.4k frames for 2 validation scenes.
10 unseen objects per scene with different categories.
Most diverse number of instances and categories compared to other category-level datasets.

Pose coverage

Figure 9 compares the baseline predictions of FS-Net, GPV-Pose and NOCS on the HouseCat6D test set.
The predictions usually focus on the upper hemisphere, even for large-scale datasets like SterOBJ-1M.
HouseCat6D provides dense and well-distributed poses.
The trajectories of mutual classes are compared against the PhoCal dataset, which provides accurate annotations but is limited in range of motion.

Benchmark and experiments

RGB-D approaches are commonly used for 6D pose estimation.
RGB provides information about the object class, but categories can have high intra-class variance.
NOCS and GPV-Pose use depth information and ICP for 3D prediction.
FS-Net extracts 3D point cloud from depth image and estimates object size and translation.

Evaluation pipeline & results

34 training sequences, 2 validation sequences, and 5 test sequences used for baseline results
Results reported using intersection over union (IoU) with thresholds of 25% and 50%
NOCS [49] results in 37.8% mAP for 3D IoU at 25%
GPV-Pose [10] gives 68.3% mAP for 3D IoU at 25%
FS-Net [8] achieves best average results with 69.4% mAP for 3D IoU at 25%
Dataset contains cluttered scenes with occluded object parts and objects in close proximity to each other

Discussion and conclusion

HouseCat6D is a large-scale 6D pose dataset acquired with a custom multi-modal camera rig and external tracking system.
HouseCat6D provides realistic scenes without markers and objects with photometrically challenging materials.
HouseCat6D has drawbacks from the limitation of the tracking system, including manual annotation and indoor-only scenes.

Object meshes and orientation

HouseCat6D dataset contains 194 objects from 10 categories
Glass objects aligned with y axis and symmetry axis
Bottle objects not fully symmetric, x axis perpendicular to wider side
Can objects not fully symmetric, x axis perpendicular to wider side
Tube objects partially symmetric, x axis perpendicular to wider side
Teapot objects aligned with y axis and x axis from handle to tip
Cup objects aligned with y axis and x axis from handle to other side
Shoe objects aligned with y axis and x axis from handle to other side
Remote objects aligned with y axis and x axis from handle to other side
Cutlery objects aligned with y axis and x axis from handle to other side
Box objects oriented by length of sides, y,x,z for first, second, third longest side

External tracking system evaluation

Evaluated IR-based external tracking system ARTTRACK2 using robotic arm
Used KUKA LBR iiwa 7 R800 robotic arm to evaluate tracking system quality
Co-calibrated robot and tracking system to share common reference frame
Ran example trajectory to calculate difference between robot and tracking system for error evaluation

Robot-tracker co-calibration

Co-calibrate robot and tracking system
Acquire trajectory from two different coordinate bases
Extract static transformation between the two trajectories

Trajectory error evaluation

Co-calibration is used to keep the tracking body on the robotic EE
Evaluation trajectory replicates a scene
Trajectory is repeated twice, once with robot stopping and once without
First trajectory is used to evaluate accuracy in static case
Second trajectory is used to evaluate accuracy in dynamic case
Error of tracking system is calculated as pose difference between robot and tracking system
Error measured is 0.67 mm/0.12° in static case and 0.92 mm/0.16° in dynamic case

Ablation study

Trained FS-Net on data of HouseCat6D dataset
Table 1 summarizes results from reduced dataset against full dataset
Objects with symmetry in shape suffer less from reduced pose coverage
Objects with non-symmetric features show significant drop in accuracy
Dataset comprises two main modalities: Polarimetric RGB image and active stereo depth
Hand-Eye-Calibration to calibrate camera center of tracking body
Post-processing step to reduce synchronization-induced trajectory error
Camera Synchronisation Effect produces offset in dynamic scene
Post-Processing via Bundle Adjustment reduces offset error
Pose Distribution for category-level datasets compared
Pose Distribution per Category compared
Object Meshes from Symmetric and Partially Symmetric Shape Categories
Object Meshes from (Partially) Symmetric Objects With a Handle
Object Meshes from Flat Shape Categories
Object Meshes for Box category
Accuracy Comparison against existing Datasets

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Instance-level 6d object pose dataset#

Category-level object poses and dataset#

Dataset#

Objects mesh acquisition#

Hardware#

Object pose annotation#

Camera trajectory annotation#

Annotation quality evaluation#

Scene statistics#

Pose coverage#

Benchmark and experiments#

Evaluation pipeline & results#

Discussion and conclusion#

Object meshes and orientation#

External tracking system evaluation#

Robot-tracker co-calibration#

Trajectory error evaluation#

Ablation study#

Link to paper

Abstract

Paper Content

Introduction

Related work

Instance-level 6d object pose dataset

Category-level object poses and dataset

Dataset

Objects mesh acquisition

Hardware

Object pose annotation

Camera trajectory annotation

Annotation quality evaluation

Scene statistics

Pose coverage

Benchmark and experiments

Evaluation pipeline & results

Discussion and conclusion

Object meshes and orientation

External tracking system evaluation

Robot-tracker co-calibration

Trajectory error evaluation

Ablation study