Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Estimating 6D pose of objects is a major field in 3D computer vision
  • Research trends are heading towards category-level pose estimation
  • New dataset HouseCat6D features multi-modality, diverse objects, high-quality pose annotation, large scale scenes, and checkerboard-free environment
  • Benchmark results of state-of-the-art category-level pose estimation networks provided

Paper Content

Introduction

  • 6D pose estimation is important for computer vision tasks
  • Many methods have been proposed to solve this task
  • Most methods focus on instance-level, but generalization is limited
  • Recent methods focus on category-level, but lack of datasets
  • HouseCat6D is a new category-level dataset with 194 objects from 10 categories
  • Includes RGB, depth, and polarimetric images with 23.5k frames and approx. 160k annotated object poses
  • Uses an accurate external infrared tracking system and post-processing
  • Provides benchmark evaluation results for SOTA category-level baselines
  • Instance-level 6D pose estimation focuses on estimating the pose of a single object
  • Category-level 6D pose estimation focuses on estimating the pose of objects within the same class
  • Recent state-of-the-art methods are mostly data-driven approaches
  • Datasets are needed for training and evaluation

Instance-level 6d object pose dataset

  • Early 6D pose datasets started from single image
  • LineMOD and LM-Occlusion are popular datasets for level pose estimation
  • RGBD camera and checkerboards used to annotate pose of objects
  • HomebrewedDB has better quality object mesh and pose annotation, but lacks scene variety
  • Video sequences used to simplify pose estimation problem to pose tracking

Category-level object poses and dataset

  • Category-level pose estimation is used to address generalizability in 6D pose estimation over multiple objects of the same category.
  • NOCS is an approach and dataset for category-level 6D pose and size estimation.
  • It contains two datasets: a mixed reality dataset (CAMERA25) and a real RGB-D dataset (REAL275).
  • kPam is a dataset focusing on the robotic field and uses keypoints.
  • TOD and PhoCal focus on translucent or transparent and reflective objects.
  • Wild6D is annotated via tracking and uses multiple iPhones to capture RGB images, depth, and point cloud.

Dataset

  • 34 training scenes, 5 test scenes, 2 validation scenes
  • 194 objects from 10 household categories
  • Includes photometrically challenging objects
  • Multiple modalities: RGB images, polarimetric images, depth maps

Objects mesh acquisition

  • 10 household categories were chosen to represent typical household scenarios: bottle, box, can, cup, cutlery, glass, remote, shoe, teapot, tube.
  • High quality EinScan-SP 3D Scanner was used to scan all objects with a single shot accuracy of ≤ 0.05 mm.
  • Self-vanishing 3D scanning spray was used for photometrically challenging categories.
  • Meshes of all objects were provided as obj-file.

Hardware

  • Utilize external tracker system with four cameras for annotation
  • Evaluate accuracy of tracking setup with robotic setup
  • Average 0.67 mm/0.12° error in static case and 0.92 mm/0.16° error in dynamic tracking scenario
  • Use D435 as depth sensor over Time-of-Flight sensors for robust depth

Object pose annotation

  • Annotating the 6D pose of an object is essential for a 6D pose dataset.
  • Annotation pipeline from [50] is adopted and the robotic end-effector pose is replaced with an IR tracking body.
  • Tip calibration is done to ensure accuracy of annotation.

Camera trajectory annotation

  • Accurate camera trajectories are necessary for 6D pose dataset annotation
  • Hand-Eye-Calibration is used to obtain the transformation between the tracker marker body and the center of the camera image sensor
  • Timestamp calibration between the tracking system and the camera image acquisition time is necessary
  • Pose Refinement is used to minimize the reprojection error with multiview images
  • RGBD-based datasets suffer from the standard deviation of the sensor
  • Multi-view setups improve annotation quality
  • Checkerboard-based datasets provide more accurate annotations than RGBD-based datasets

Annotation quality evaluation

  • Reported point-wise RMSE between objects and camera center with and without consideration of 3 systematic errors
  • Upper bound includes object annotation error and static tracking error, no synchronization error
  • Lower bound includes all 3 systematic errors with dynamic tracking error as tracking system error
  • RMSE from 1.35 mm to 1.73 mm

Scene statistics

  • HouseCat6D contains 41 large-scale scenes with 194 objects in 10 categories.
  • 34 training scenes with 124 objects, 5 test scenes with 50 objects and 2 validation scenes with 20 objects.
  • 20k frames recorded for 34 training scenes, 3k frames for 5 test scenes and 1.4k frames for 2 validation scenes.
  • 10 unseen objects per scene with different categories.
  • Most diverse number of instances and categories compared to other category-level datasets.

Pose coverage

  • Figure 9 compares the baseline predictions of FS-Net, GPV-Pose and NOCS on the HouseCat6D test set.
  • The predictions usually focus on the upper hemisphere, even for large-scale datasets like SterOBJ-1M.
  • HouseCat6D provides dense and well-distributed poses.
  • The trajectories of mutual classes are compared against the PhoCal dataset, which provides accurate annotations but is limited in range of motion.

Benchmark and experiments

  • RGB-D approaches are commonly used for 6D pose estimation.
  • RGB provides information about the object class, but categories can have high intra-class variance.
  • NOCS and GPV-Pose use depth information and ICP for 3D prediction.
  • FS-Net extracts 3D point cloud from depth image and estimates object size and translation.

Evaluation pipeline & results

  • 34 training sequences, 2 validation sequences, and 5 test sequences used for baseline results
  • Results reported using intersection over union (IoU) with thresholds of 25% and 50%
  • NOCS [49] results in 37.8% mAP for 3D IoU at 25%
  • GPV-Pose [10] gives 68.3% mAP for 3D IoU at 25%
  • FS-Net [8] achieves best average results with 69.4% mAP for 3D IoU at 25%
  • Dataset contains cluttered scenes with occluded object parts and objects in close proximity to each other

Discussion and conclusion

  • HouseCat6D is a large-scale 6D pose dataset acquired with a custom multi-modal camera rig and external tracking system.
  • HouseCat6D provides realistic scenes without markers and objects with photometrically challenging materials.
  • HouseCat6D has drawbacks from the limitation of the tracking system, including manual annotation and indoor-only scenes.

Object meshes and orientation

  • HouseCat6D dataset contains 194 objects from 10 categories
  • Glass objects aligned with y axis and symmetry axis
  • Bottle objects not fully symmetric, x axis perpendicular to wider side
  • Can objects not fully symmetric, x axis perpendicular to wider side
  • Tube objects partially symmetric, x axis perpendicular to wider side
  • Teapot objects aligned with y axis and x axis from handle to tip
  • Cup objects aligned with y axis and x axis from handle to other side
  • Shoe objects aligned with y axis and x axis from handle to other side
  • Remote objects aligned with y axis and x axis from handle to other side
  • Cutlery objects aligned with y axis and x axis from handle to other side
  • Box objects oriented by length of sides, y,x,z for first, second, third longest side

External tracking system evaluation

  • Evaluated IR-based external tracking system ARTTRACK2 using robotic arm
  • Used KUKA LBR iiwa 7 R800 robotic arm to evaluate tracking system quality
  • Co-calibrated robot and tracking system to share common reference frame
  • Ran example trajectory to calculate difference between robot and tracking system for error evaluation

Robot-tracker co-calibration

  • Co-calibrate robot and tracking system
  • Acquire trajectory from two different coordinate bases
  • Extract static transformation between the two trajectories

Trajectory error evaluation

  • Co-calibration is used to keep the tracking body on the robotic EE
  • Evaluation trajectory replicates a scene
  • Trajectory is repeated twice, once with robot stopping and once without
  • First trajectory is used to evaluate accuracy in static case
  • Second trajectory is used to evaluate accuracy in dynamic case
  • Error of tracking system is calculated as pose difference between robot and tracking system
  • Error measured is 0.67 mm/0.12° in static case and 0.92 mm/0.16° in dynamic case

Ablation study

  • Trained FS-Net on data of HouseCat6D dataset
  • Table 1 summarizes results from reduced dataset against full dataset
  • Objects with symmetry in shape suffer less from reduced pose coverage
  • Objects with non-symmetric features show significant drop in accuracy
  • Dataset comprises two main modalities: Polarimetric RGB image and active stereo depth
  • Hand-Eye-Calibration to calibrate camera center of tracking body
  • Post-processing step to reduce synchronization-induced trajectory error
  • Camera Synchronisation Effect produces offset in dynamic scene
  • Post-Processing via Bundle Adjustment reduces offset error
  • Pose Distribution for category-level datasets compared
  • Pose Distribution per Category compared
  • Object Meshes from Symmetric and Partially Symmetric Shape Categories
  • Object Meshes from (Partially) Symmetric Objects With a Handle
  • Object Meshes from Flat Shape Categories
  • Object Meshes for Box category
  • Accuracy Comparison against existing Datasets