Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Generic Object Tracking (GOT) is the problem of tracking target objects in a video.
- Previous research has focused on single object tracking, but multi-object tracking has wider applicability.
- A new large-scale GOT benchmark, LaGOT, is introduced to tackle key remaining challenges in GOT.
- A Transformer-based GOT tracker, TaMOS, is proposed to process multiple objects simultaneously.
Paper Content
Introduction
- Visual object tracking is a fundamental problem in computer vision
- Two task definitions: Generic Object Tracking (GOT) and Multiple Object Tracking (MOT)
- GOT focuses on tracking objects defined by given bounding boxes in the first video frame
- MOT requires a detector separately trained on objects of all classes
- GOT has only focused on the Single Object Tracking (SOT) scenario
- Tracking multiple objects leads to a linear increase in computation
- TaMOs jointly tracks multiple objects, leading to computational savings
- Proposed new GOT benchmark LaGOT with up to 10 tracks per sequence
- TaMOs proposed to tackle challenges of tracking multiple objects
- TaMOs evaluated on LaGOT, LaSOT and TrackingNet
- TaMOs outperforms recent trackers and sets new state of the art on TrackingNet
- TaMOs operates at over 4x faster run-times compared to the baseline when tracking 10 objects
Related work
- Generic object tracking is a well explored topic
- Specialized datasets and challenges focus on short-term or long-term tracking
- GMOT-40 focuses on Generic Multi Object Tracking (GMOT)
- MOT focuses on tracking multiple objects of different predefined classes
- TAO focuses on tracking objects of a long-tailed class distributions
- Open world tracking aims at detecting and tracking any object in a video sequence
- Video Object Segmentation (VOS) datasets provide multiobject annotations
- Global tracking operates on the whole video frame
- GlobalTrack and Siam R-CNN track the target by using global RPNs
- MetaUpdater and SPLT use a redetector to re-localize the target
- Transformers have been used for tracking
- Most trackers focus on encoding a single object per training frame
- AOT encodes each object into the identification embedding
Benchmark
- TAO and Ima-geNetVID are suitable for multi-object GOT evaluation
- TAO has hundreds of classes but only 1fps annotation
- Ima-geNetVID has 30 classes but short tracks
- LaGOT benchmark created with 294 video sequences and 837 tracks
- Annotation process uses interactive tool and manual verification
- 31 additional generic object classes added
- Average track length of LaGOT is 70 seconds
Method
- Tracker TaMOs uses a Transformer to track a set of objects in a video
- ToMP [38] is a Transformer-based single object tracker used as a starting point
- Transformer-based multi-object tracking architecture is introduced
- Training is discussed in Sec. 4.3
Background -transforming model prediction
- ToMP is a high-performance and versatile architecture used as a baseline tracker.
- Visual features are extracted from the training and test image crops.
- A foreground embedding is combined with a Gaussian score map and a LTRB bounding box encoding.
- A Transformer encoder and decoder are used to produce an enhanced test feature and a target appearance model.
Generic multi-object tracker -overview
- Proposed generic multi-object tracker TaMOs operates on full train and test images instead of crops
- Uses pool of learnable object embeddings to encode location and extent of each target object
- Object embedding represents target in entire video sequence
- Computational cost of Transformer operations limited to certain feature resolution
- FPN-based feature fusion of test frame features with higher resolution backbone features to track small objects
- Correlation filter based target localization and bounding box regression mechanism of ToMP applied on higher resolution FPN features
- Novel object encoding allows to encode multiple objects in shared feature map without requiring multiple templates
- Model predictor conditioned on object embeddings to produce target models
- Target models used to localize targets in test frame and regress their bounding boxes
- High-resolution multi-channel score and bounding box prediction maps used during inference
- Losses applied on Transformer encoder features, low- and high-resolution FPN feature maps during training
Training
- Classification and bounding box regression losses are used during training.
- Score maps corresponding to unused object embeddings should produce low scores.
- Bounding box regression loss is enforced only for predictions corresponding to encoded objects.
- Overall training loss is a combination of classification and bounding box regression losses.
Experiments
- Evaluated proposed tracking architecture on GOT benchmark LaGOT
- Compared method to recent trackers on SOT benchmarks
- Presented ablation study to evaluate impact of different components of tracker
Implementation details
- Method is implemented using PyTracking
- Image pair of one training and one test frame is randomly sampled from a training sequence
- Feature maps with stride 16 used for Resnet-50 and SwinBase backbones
- Pretrained weights on ImageNet-1k and ImageNet-22k used
- Training splits of LaSOT, GOT10k, TrackingNet, MS-COCO, ImageNet-Vid, TAO, and YoutubeVOS used
- Random scaling, cropping, color jittering, and flipping used
- 40k image pairs sampled with equal probability from all datasets
- Trained for 300 epochs on 4 Nvidia A100 GPUs
- Memory updating approach used during inference
State-of-the-art evaluation on lagot
- Evaluated tracker with Resnet-50 and Swin-Base backbones
- Compared to 8 other trackers on LaGOT benchmark
- Measured performance with One Pass Evaluation (OPE) setting
- Used GOT Success rate Area Under the Curve (AUC) and VOTLT metrics
- Tracker achieved best AUC, outperforming MixFormerLarge-22k by 1 point
- Highest AUC among all trackers with ResNet-50 backbone
- Outperformed all other trackers in VOTLT by 2.2 points
- TaMOs achieves 4x speedup for 10 concurrent objects
- TaMOs-SwinBase achieved 13.1 FPS for single object and 9.3 FPS for 10 objects
State-of-the-art comparison on sot datasets
- Evaluated TaMOs tracker on GOT benchmarks with single object per video
- No changes to weights or hyper-parameters
- LaSOT dataset consists of 280 test sequences with 2500 frames on average
- TaMOs achieved highest precision and second highest success rate AUC
- TrackingNet dataset consists of 511 test sequences
- TaMOs set new state of the art on TrackingNet in terms of success rate and precision AUC
Ablation study
- Resnet-50 is the backbone for all ablation experiments
- LTRB bounding box encoding is more important than Gaussian score map encoding
- Best results achieved with 10 object embeddings
- SwinBase and FPN improve tracking performance
- Adding a second training frame during inference improves results
Conclusion
- Propose a novel multiple object GOT tracking benchmark, LaGOT
- Propose a Transformer-based tracker capable of processing multiple targets at the same time
- Integrate a novel generic multi object encoding and an FPN
- Outperforms recent trackers on LaGOT benchmark
- Operates 4× faster than SOT baseline when tracking 10 objects
- Excellent results on large-scale SOT benchmarks
- Extract backbone features from Resnet-50 or SwinBase
- Use linear layer to decrease number of channels from 1024 to 256 or 512 to 256
- MLP to project LTRB bounding box encoding map from 4 to 256 channels
- Gradient norm clipping with parameter 0.1 to stabilize training
- Loss weighting parameters set to λ cls = 100 and λ bbreg = 1
- Train all models on four A100 GPUs with batch size of 4 × 12 or 4 × 6
- Fixed size Gaussian when producing score map encoding for each object with σ = 0.25
- 31 new classes added during annotation process
- Most tracks are between 30 and 110 seconds long
- Size distribution of annotated objects shows large objects are rare than small ones
- VOS sequences much shorter than LaGOT benchmark
- ImagenetVID contains shorter sequences, fewer classes and smaller number of average tracks per sequence
- MOT datasets typically focus on fewer classes, shorter sequences or are annotated at low frame rates
- TAO contains many more classes but provides annotations only at 1 FPS
- GMOT-40 dataset contains fewer classes, fewer videos, shorter sequences and provides only annotations of one particular object class per sequence
- SOT datasets provide only a single annotated object per sequence
- LaGOT enables to properly evaluate the robustness and accuracy of multiple object GOT methods
- LaGOT contains longer sequences than most listed SOT datasets