Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed a novel voxel-based 3D single object tracking (3D SOT) method called Voxel Pseudo Image Tracking (VPIT).
- VPIT uses voxel pseudo images as an input to a 2D-like Siamese SOT method.
- VPIT uses Bird’s-eye View (BEV) coordinates, so only object rotation can change in the new coordinate system.
- VPIT is the fastest 3D SOT method and maintains competitive Success and Precision values.
- VPIT maintains its ability to track the object on embedded devices.
Paper Content
I. introduction
- Robotics usage is increasing in real-world scenarios
- Need to develop methods for understanding 3D world to allow robots to interact with objects
- Efficient methods for 3D object detection, tracking, and active perception needed
- Variety of sensors used for 3D perception tasks, ranging from inexpensive to costly
- Lidar is a well-adopted choice for 3D perception methods
- Point cloud data contains more information than images and is more robust to changes in weather and lightning conditions
- 3D object detection and tracking methods using point clouds provide best combination of accuracy and inference speed
- Lidars usually operate at 10-20 FPS
- Perception methods usually paired with another method that makes use of the perception results
- 3D SOT focuses on tracking a single object of interest with a given initial frame position
- Methods proposed for 3D SOT do not take into consideration limited computational capabilities of embedded computing devices
- Proposed method for 3D SOT receives voxel pseudo images as an input
- Multi-rotation search used to predict both position and rotation of the object
- Evaluation results provided on high-end and embedded devices with consideration of real-time requirements
Ii. related works
- SOT in 3D is an extension of 2D SOT
- Input data is different for 2D and 3D SOT
- 3D SOT commonly uses point cloud data or a combination of point clouds and camera images
- SC3D uses a Siamese approach to compare target and search encodings
- P2B uses a point-wise network to create a similarity map
- BAT uses Box Cloud representations as point features
- 3D-SiamRPN uses a Siamese point-wise network to create features
- F-siamese tracker fuses RGB and point cloud information
- PTT creates a transformer module for point-based SOT methods
- 3D Siam-2D uses two Siamese networks
- VPIT is the first tracker to use a siamese architecture on point cloud pseudo images
Iii. proposed method
- Modified PointPillars architecture for 3D single object tracking
- Proposed modifications to shift from detection to tracking
- Training and inference processes described
- Implementation publicly available in OpenDR toolkit
A. model architecture
- Point cloud data is irregular and cannot be processed with 2D convolutional neural networks directly.
- Methods structure the data using techniques such as voxelization or use neural networks that work well on unordered data.
- Existing datasets for 3D object detection and tracking consider only a single rotation angle around the vertical axis.
- Siamese models use an identical transformation to both inputs and combine them with a similarity measure function.
- Siamese tracking methods select an embedding function and a similarity measure function.
- The output of the model is a similarity score map.
- Tracking is performed by initializing the target region and corresponding search region.
- The predicted output for the frame is computed using a Feature Generation Network.
B. training
- Identical functions κ c (•) and σ(•) are used
C. inference
- Inference is split into two parts: initialization and tracking
- Initial target and its features are given by a detection method, user, or dataset
- Multiple search regions with identical size and different rotations are created to allow for rotation changes
- Target features are compared with all search features and the one with the highest score is selected
- Score map is upscaled and a penalty technique is used for scores far from the center
- Maximum score position is translated back from score coordinates to image coordinates
- Search region is centered on a prediction, assuming the object’s position on the next frame is close to its position on the previous frame
- Linear position extrapolation is used for the search region
- Penalty map is formed using either Hann window or a 2D Gaussian function
- Target features are created at the first frame and used throughout the tracking sequence
- Target feature merge scale is used to balance between initial target features and latest frame’s target features
- Offset interpolation parameter is used to improve prediction stability
Iv. experiments
- KITTI Tracking training dataset split is used to train and test the model
- Tracks 0-18 are used for training and validation, tracks 19-20 for testing
- Precision and Success metrics are used as defined in One Pass Evaluation
- Model has 1 feature block with 4 layers
- Model is trained for 64,000 steps with BCE loss, 1 * 10-5 learning rate and 2 positive label radius
- During inference, rotations count of 3, rotation step of 0.15 and rotation penalty of 0.98 are used
A. comparison with state-of-the-art
- VPIT is the fastest method and achieves competitive Precision and Success
- VPIT outperforms P2B by 67% on 1080Ti GPU, 99% on 2080 GPU and 86% on 2080Ti GPU with 104-core CPU
- For embedded devices, VPIT outperforms P2B by 135% on TX2 and by 98% on Xavier
B. real-time evaluation
- Application of 3D tracking methods is usually done for robotic systems that do not use high-end GPUs.
- Computations are applied on embedded devices, such as TX2 or Xavier.
- A predictive real-time benchmark is implemented.
- A Kalman Filter is used to predict the position of the object at a frame.
- 3D SOT methods are evaluated on embedded devices with both non-predictive and predictive benchmarks.
- VPIT outperforms P2B and PTT in Success, Precision, FPS and Frame drop.
C. ablation study
- Experiments were conducted to measure the effect of hyperparameters on Success and Precision metrics.
- Decreasing the number of feature blocks from the Feature Generation Network leads to better Success values.
- Target and search upscaling with ι t = (127, 127) and ι s = (255, 255) reduces training time but does not work well for voxel pseudo images.
- Positive context leads to better Success with the maximum at c = 0.26.
- Linear position extrapolation leads to a 30% increase in Success compared to a conventional search region placement.
V. conclusions
- Proposed a novel method for 3D SOT called VPIT
- Used structured data instead of point-based approaches
- Used PointPillars’ pseudo images as a search space
- Applied a Siamese 2D-like approach to find position and rotation of object
- VPIT is the fastest method and achieves competitive Precision and Success values
- Implemented a real-time evaluation protocol for 3D single object tracking
- VPIT is suited for embedded devices and has low latency of predictions