Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposed a novel voxel-based 3D single object tracking (3D SOT) method called Voxel Pseudo Image Tracking (VPIT).
VPIT uses voxel pseudo images as an input to a 2D-like Siamese SOT method.
VPIT uses Bird’s-eye View (BEV) coordinates, so only object rotation can change in the new coordinate system.
VPIT is the fastest 3D SOT method and maintains competitive Success and Precision values.
VPIT maintains its ability to track the object on embedded devices.

Paper Content

I. introduction

Robotics usage is increasing in real-world scenarios
Need to develop methods for understanding 3D world to allow robots to interact with objects
Efficient methods for 3D object detection, tracking, and active perception needed
Variety of sensors used for 3D perception tasks, ranging from inexpensive to costly
Lidar is a well-adopted choice for 3D perception methods
Point cloud data contains more information than images and is more robust to changes in weather and lightning conditions
3D object detection and tracking methods using point clouds provide best combination of accuracy and inference speed
Lidars usually operate at 10-20 FPS
Perception methods usually paired with another method that makes use of the perception results
3D SOT focuses on tracking a single object of interest with a given initial frame position
Methods proposed for 3D SOT do not take into consideration limited computational capabilities of embedded computing devices
Proposed method for 3D SOT receives voxel pseudo images as an input
Multi-rotation search used to predict both position and rotation of the object
Evaluation results provided on high-end and embedded devices with consideration of real-time requirements

SOT in 3D is an extension of 2D SOT
Input data is different for 2D and 3D SOT
3D SOT commonly uses point cloud data or a combination of point clouds and camera images
SC3D uses a Siamese approach to compare target and search encodings
P2B uses a point-wise network to create a similarity map
BAT uses Box Cloud representations as point features
3D-SiamRPN uses a Siamese point-wise network to create features
F-siamese tracker fuses RGB and point cloud information
PTT creates a transformer module for point-based SOT methods
3D Siam-2D uses two Siamese networks
VPIT is the first tracker to use a siamese architecture on point cloud pseudo images

Iii. proposed method

Modified PointPillars architecture for 3D single object tracking
Proposed modifications to shift from detection to tracking
Training and inference processes described
Implementation publicly available in OpenDR toolkit

A. model architecture

Point cloud data is irregular and cannot be processed with 2D convolutional neural networks directly.
Methods structure the data using techniques such as voxelization or use neural networks that work well on unordered data.
Existing datasets for 3D object detection and tracking consider only a single rotation angle around the vertical axis.
Siamese models use an identical transformation to both inputs and combine them with a similarity measure function.
Siamese tracking methods select an embedding function and a similarity measure function.
The output of the model is a similarity score map.
Tracking is performed by initializing the target region and corresponding search region.
The predicted output for the frame is computed using a Feature Generation Network.

B. training

Identical functions κ c (•) and σ(•) are used

C. inference

Inference is split into two parts: initialization and tracking
Initial target and its features are given by a detection method, user, or dataset
Multiple search regions with identical size and different rotations are created to allow for rotation changes
Target features are compared with all search features and the one with the highest score is selected
Score map is upscaled and a penalty technique is used for scores far from the center
Maximum score position is translated back from score coordinates to image coordinates
Search region is centered on a prediction, assuming the object’s position on the next frame is close to its position on the previous frame
Linear position extrapolation is used for the search region
Penalty map is formed using either Hann window or a 2D Gaussian function
Target features are created at the first frame and used throughout the tracking sequence
Target feature merge scale is used to balance between initial target features and latest frame’s target features
Offset interpolation parameter is used to improve prediction stability

Iv. experiments

KITTI Tracking training dataset split is used to train and test the model
Tracks 0-18 are used for training and validation, tracks 19-20 for testing
Precision and Success metrics are used as defined in One Pass Evaluation
Model has 1 feature block with 4 layers
Model is trained for 64,000 steps with BCE loss, 1 * 10-5 learning rate and 2 positive label radius
During inference, rotations count of 3, rotation step of 0.15 and rotation penalty of 0.98 are used

A. comparison with state-of-the-art

VPIT is the fastest method and achieves competitive Precision and Success
VPIT outperforms P2B by 67% on 1080Ti GPU, 99% on 2080 GPU and 86% on 2080Ti GPU with 104-core CPU
For embedded devices, VPIT outperforms P2B by 135% on TX2 and by 98% on Xavier

B. real-time evaluation

Application of 3D tracking methods is usually done for robotic systems that do not use high-end GPUs.
Computations are applied on embedded devices, such as TX2 or Xavier.
A predictive real-time benchmark is implemented.
A Kalman Filter is used to predict the position of the object at a frame.
3D SOT methods are evaluated on embedded devices with both non-predictive and predictive benchmarks.
VPIT outperforms P2B and PTT in Success, Precision, FPS and Frame drop.

C. ablation study

Experiments were conducted to measure the effect of hyperparameters on Success and Precision metrics.
Decreasing the number of feature blocks from the Feature Generation Network leads to better Success values.
Target and search upscaling with ι t = (127, 127) and ι s = (255, 255) reduces training time but does not work well for voxel pseudo images.
Positive context leads to better Success with the maximum at c = 0.26.
Linear position extrapolation leads to a 30% increase in Success compared to a conventional search region placement.

V. conclusions

Proposed a novel method for 3D SOT called VPIT
Used structured data instead of point-based approaches
Used PointPillars’ pseudo images as a search space
Applied a Siamese 2D-like approach to find position and rotation of object
VPIT is the fastest method and achieves competitive Precision and Success values
Implemented a real-time evaluation protocol for 3D single object tracking
VPIT is suited for embedded devices and has low latency of predictions

Link to paper#

Abstract#

Paper Content#

I. introduction#

Ii. related works#

Iii. proposed method#

A. model architecture#

B. training#

C. inference#

Iv. experiments#

A. comparison with state-of-the-art#

B. real-time evaluation#

C. ablation study#

V. conclusions#

Link to paper

Abstract

Paper Content

I. introduction

Ii. related works

Iii. proposed method

A. model architecture

B. training

C. inference

Iv. experiments

A. comparison with state-of-the-art

B. real-time evaluation

C. ablation study

V. conclusions