Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Estimating 6D pose of objects is a major field in 3D computer vision
- Research trends are heading towards category-level pose estimation
- New dataset HouseCat6D features multi-modality, diverse objects, high-quality pose annotation, large scale scenes, and checkerboard-free environment
- Benchmark results of state-of-the-art category-level pose estimation networks provided
Paper Content
Introduction
- 6D pose estimation is important for computer vision tasks
- Many methods have been proposed to solve this task
- Most methods focus on instance-level, but generalization is limited
- Recent methods focus on category-level, but lack of datasets
- HouseCat6D is a new category-level dataset with 194 objects from 10 categories
- Includes RGB, depth, and polarimetric images with 23.5k frames and approx. 160k annotated object poses
- Uses an accurate external infrared tracking system and post-processing
- Provides benchmark evaluation results for SOTA category-level baselines
Related work
- Instance-level 6D pose estimation focuses on estimating the pose of a single object
- Category-level 6D pose estimation focuses on estimating the pose of objects within the same class
- Recent state-of-the-art methods are mostly data-driven approaches
- Datasets are needed for training and evaluation
Instance-level 6d object pose dataset
- Early 6D pose datasets started from single image
- LineMOD and LM-Occlusion are popular datasets for level pose estimation
- RGBD camera and checkerboards used to annotate pose of objects
- HomebrewedDB has better quality object mesh and pose annotation, but lacks scene variety
- Video sequences used to simplify pose estimation problem to pose tracking
Category-level object poses and dataset
- Category-level pose estimation is used to address generalizability in 6D pose estimation over multiple objects of the same category.
- NOCS is an approach and dataset for category-level 6D pose and size estimation.
- It contains two datasets: a mixed reality dataset (CAMERA25) and a real RGB-D dataset (REAL275).
- kPam is a dataset focusing on the robotic field and uses keypoints.
- TOD and PhoCal focus on translucent or transparent and reflective objects.
- Wild6D is annotated via tracking and uses multiple iPhones to capture RGB images, depth, and point cloud.
Dataset
- 34 training scenes, 5 test scenes, 2 validation scenes
- 194 objects from 10 household categories
- Includes photometrically challenging objects
- Multiple modalities: RGB images, polarimetric images, depth maps
Objects mesh acquisition
- 10 household categories were chosen to represent typical household scenarios: bottle, box, can, cup, cutlery, glass, remote, shoe, teapot, tube.
- High quality EinScan-SP 3D Scanner was used to scan all objects with a single shot accuracy of ≤ 0.05 mm.
- Self-vanishing 3D scanning spray was used for photometrically challenging categories.
- Meshes of all objects were provided as obj-file.
Hardware
- Utilize external tracker system with four cameras for annotation
- Evaluate accuracy of tracking setup with robotic setup
- Average 0.67 mm/0.12° error in static case and 0.92 mm/0.16° error in dynamic tracking scenario
- Use D435 as depth sensor over Time-of-Flight sensors for robust depth
Object pose annotation
- Annotating the 6D pose of an object is essential for a 6D pose dataset.
- Annotation pipeline from [50] is adopted and the robotic end-effector pose is replaced with an IR tracking body.
- Tip calibration is done to ensure accuracy of annotation.
Camera trajectory annotation
- Accurate camera trajectories are necessary for 6D pose dataset annotation
- Hand-Eye-Calibration is used to obtain the transformation between the tracker marker body and the center of the camera image sensor
- Timestamp calibration between the tracking system and the camera image acquisition time is necessary
- Pose Refinement is used to minimize the reprojection error with multiview images
- RGBD-based datasets suffer from the standard deviation of the sensor
- Multi-view setups improve annotation quality
- Checkerboard-based datasets provide more accurate annotations than RGBD-based datasets
Annotation quality evaluation
- Reported point-wise RMSE between objects and camera center with and without consideration of 3 systematic errors
- Upper bound includes object annotation error and static tracking error, no synchronization error
- Lower bound includes all 3 systematic errors with dynamic tracking error as tracking system error
- RMSE from 1.35 mm to 1.73 mm
Scene statistics
- HouseCat6D contains 41 large-scale scenes with 194 objects in 10 categories.
- 34 training scenes with 124 objects, 5 test scenes with 50 objects and 2 validation scenes with 20 objects.
- 20k frames recorded for 34 training scenes, 3k frames for 5 test scenes and 1.4k frames for 2 validation scenes.
- 10 unseen objects per scene with different categories.
- Most diverse number of instances and categories compared to other category-level datasets.
Pose coverage
- Figure 9 compares the baseline predictions of FS-Net, GPV-Pose and NOCS on the HouseCat6D test set.
- The predictions usually focus on the upper hemisphere, even for large-scale datasets like SterOBJ-1M.
- HouseCat6D provides dense and well-distributed poses.
- The trajectories of mutual classes are compared against the PhoCal dataset, which provides accurate annotations but is limited in range of motion.
Benchmark and experiments
- RGB-D approaches are commonly used for 6D pose estimation.
- RGB provides information about the object class, but categories can have high intra-class variance.
- NOCS and GPV-Pose use depth information and ICP for 3D prediction.
- FS-Net extracts 3D point cloud from depth image and estimates object size and translation.
Evaluation pipeline & results
- 34 training sequences, 2 validation sequences, and 5 test sequences used for baseline results
- Results reported using intersection over union (IoU) with thresholds of 25% and 50%
- NOCS [49] results in 37.8% mAP for 3D IoU at 25%
- GPV-Pose [10] gives 68.3% mAP for 3D IoU at 25%
- FS-Net [8] achieves best average results with 69.4% mAP for 3D IoU at 25%
- Dataset contains cluttered scenes with occluded object parts and objects in close proximity to each other
Discussion and conclusion
- HouseCat6D is a large-scale 6D pose dataset acquired with a custom multi-modal camera rig and external tracking system.
- HouseCat6D provides realistic scenes without markers and objects with photometrically challenging materials.
- HouseCat6D has drawbacks from the limitation of the tracking system, including manual annotation and indoor-only scenes.
Object meshes and orientation
- HouseCat6D dataset contains 194 objects from 10 categories
- Glass objects aligned with y axis and symmetry axis
- Bottle objects not fully symmetric, x axis perpendicular to wider side
- Can objects not fully symmetric, x axis perpendicular to wider side
- Tube objects partially symmetric, x axis perpendicular to wider side
- Teapot objects aligned with y axis and x axis from handle to tip
- Cup objects aligned with y axis and x axis from handle to other side
- Shoe objects aligned with y axis and x axis from handle to other side
- Remote objects aligned with y axis and x axis from handle to other side
- Cutlery objects aligned with y axis and x axis from handle to other side
- Box objects oriented by length of sides, y,x,z for first, second, third longest side
External tracking system evaluation
- Evaluated IR-based external tracking system ARTTRACK2 using robotic arm
- Used KUKA LBR iiwa 7 R800 robotic arm to evaluate tracking system quality
- Co-calibrated robot and tracking system to share common reference frame
- Ran example trajectory to calculate difference between robot and tracking system for error evaluation
Robot-tracker co-calibration
- Co-calibrate robot and tracking system
- Acquire trajectory from two different coordinate bases
- Extract static transformation between the two trajectories
Trajectory error evaluation
- Co-calibration is used to keep the tracking body on the robotic EE
- Evaluation trajectory replicates a scene
- Trajectory is repeated twice, once with robot stopping and once without
- First trajectory is used to evaluate accuracy in static case
- Second trajectory is used to evaluate accuracy in dynamic case
- Error of tracking system is calculated as pose difference between robot and tracking system
- Error measured is 0.67 mm/0.12° in static case and 0.92 mm/0.16° in dynamic case
Ablation study
- Trained FS-Net on data of HouseCat6D dataset
- Table 1 summarizes results from reduced dataset against full dataset
- Objects with symmetry in shape suffer less from reduced pose coverage
- Objects with non-symmetric features show significant drop in accuracy
- Dataset comprises two main modalities: Polarimetric RGB image and active stereo depth
- Hand-Eye-Calibration to calibrate camera center of tracking body
- Post-processing step to reduce synchronization-induced trajectory error
- Camera Synchronisation Effect produces offset in dynamic scene
- Post-Processing via Bundle Adjustment reduces offset error
- Pose Distribution for category-level datasets compared
- Pose Distribution per Category compared
- Object Meshes from Symmetric and Partially Symmetric Shape Categories
- Object Meshes from (Partially) Symmetric Objects With a Handle
- Object Meshes from Flat Shape Categories
- Object Meshes for Box category
- Accuracy Comparison against existing Datasets