Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Generic Object Tracking (GOT) is the problem of tracking target objects in a video.
Previous research has focused on single object tracking, but multi-object tracking has wider applicability.
A new large-scale GOT benchmark, LaGOT, is introduced to tackle key remaining challenges in GOT.
A Transformer-based GOT tracker, TaMOS, is proposed to process multiple objects simultaneously.

Paper Content

Introduction

Visual object tracking is a fundamental problem in computer vision
Two task definitions: Generic Object Tracking (GOT) and Multiple Object Tracking (MOT)
GOT focuses on tracking objects defined by given bounding boxes in the first video frame
MOT requires a detector separately trained on objects of all classes
GOT has only focused on the Single Object Tracking (SOT) scenario
Tracking multiple objects leads to a linear increase in computation
TaMOs jointly tracks multiple objects, leading to computational savings
Proposed new GOT benchmark LaGOT with up to 10 tracks per sequence
TaMOs proposed to tackle challenges of tracking multiple objects
TaMOs evaluated on LaGOT, LaSOT and TrackingNet
TaMOs outperforms recent trackers and sets new state of the art on TrackingNet
TaMOs operates at over 4x faster run-times compared to the baseline when tracking 10 objects

Generic object tracking is a well explored topic
Specialized datasets and challenges focus on short-term or long-term tracking
GMOT-40 focuses on Generic Multi Object Tracking (GMOT)
MOT focuses on tracking multiple objects of different predefined classes
TAO focuses on tracking objects of a long-tailed class distributions
Open world tracking aims at detecting and tracking any object in a video sequence
Video Object Segmentation (VOS) datasets provide multiobject annotations
Global tracking operates on the whole video frame
GlobalTrack and Siam R-CNN track the target by using global RPNs
MetaUpdater and SPLT use a redetector to re-localize the target
Transformers have been used for tracking
Most trackers focus on encoding a single object per training frame
AOT encodes each object into the identification embedding

Benchmark

TAO and Ima-geNetVID are suitable for multi-object GOT evaluation
TAO has hundreds of classes but only 1fps annotation
Ima-geNetVID has 30 classes but short tracks
LaGOT benchmark created with 294 video sequences and 837 tracks
Annotation process uses interactive tool and manual verification
31 additional generic object classes added
Average track length of LaGOT is 70 seconds

Method

Tracker TaMOs uses a Transformer to track a set of objects in a video
ToMP [38] is a Transformer-based single object tracker used as a starting point
Transformer-based multi-object tracking architecture is introduced
Training is discussed in Sec. 4.3

Background -transforming model prediction

ToMP is a high-performance and versatile architecture used as a baseline tracker.
Visual features are extracted from the training and test image crops.
A foreground embedding is combined with a Gaussian score map and a LTRB bounding box encoding.
A Transformer encoder and decoder are used to produce an enhanced test feature and a target appearance model.

Generic multi-object tracker -overview

Proposed generic multi-object tracker TaMOs operates on full train and test images instead of crops
Uses pool of learnable object embeddings to encode location and extent of each target object
Object embedding represents target in entire video sequence
Computational cost of Transformer operations limited to certain feature resolution
FPN-based feature fusion of test frame features with higher resolution backbone features to track small objects
Correlation filter based target localization and bounding box regression mechanism of ToMP applied on higher resolution FPN features
Novel object encoding allows to encode multiple objects in shared feature map without requiring multiple templates
Model predictor conditioned on object embeddings to produce target models
Target models used to localize targets in test frame and regress their bounding boxes
High-resolution multi-channel score and bounding box prediction maps used during inference
Losses applied on Transformer encoder features, low- and high-resolution FPN feature maps during training

Training

Classification and bounding box regression losses are used during training.
Score maps corresponding to unused object embeddings should produce low scores.
Bounding box regression loss is enforced only for predictions corresponding to encoded objects.
Overall training loss is a combination of classification and bounding box regression losses.

Experiments

Evaluated proposed tracking architecture on GOT benchmark LaGOT
Compared method to recent trackers on SOT benchmarks
Presented ablation study to evaluate impact of different components of tracker

Implementation details

Method is implemented using PyTracking
Image pair of one training and one test frame is randomly sampled from a training sequence
Feature maps with stride 16 used for Resnet-50 and SwinBase backbones
Pretrained weights on ImageNet-1k and ImageNet-22k used
Training splits of LaSOT, GOT10k, TrackingNet, MS-COCO, ImageNet-Vid, TAO, and YoutubeVOS used
Random scaling, cropping, color jittering, and flipping used
40k image pairs sampled with equal probability from all datasets
Trained for 300 epochs on 4 Nvidia A100 GPUs
Memory updating approach used during inference

State-of-the-art evaluation on lagot

Evaluated tracker with Resnet-50 and Swin-Base backbones
Compared to 8 other trackers on LaGOT benchmark
Measured performance with One Pass Evaluation (OPE) setting
Used GOT Success rate Area Under the Curve (AUC) and VOTLT metrics
Tracker achieved best AUC, outperforming MixFormerLarge-22k by 1 point
Highest AUC among all trackers with ResNet-50 backbone
Outperformed all other trackers in VOTLT by 2.2 points
TaMOs achieves 4x speedup for 10 concurrent objects
TaMOs-SwinBase achieved 13.1 FPS for single object and 9.3 FPS for 10 objects

State-of-the-art comparison on sot datasets

Evaluated TaMOs tracker on GOT benchmarks with single object per video
No changes to weights or hyper-parameters
LaSOT dataset consists of 280 test sequences with 2500 frames on average
TaMOs achieved highest precision and second highest success rate AUC
TrackingNet dataset consists of 511 test sequences
TaMOs set new state of the art on TrackingNet in terms of success rate and precision AUC

Ablation study

Resnet-50 is the backbone for all ablation experiments
LTRB bounding box encoding is more important than Gaussian score map encoding
Best results achieved with 10 object embeddings
SwinBase and FPN improve tracking performance
Adding a second training frame during inference improves results

Conclusion

Propose a novel multiple object GOT tracking benchmark, LaGOT
Propose a Transformer-based tracker capable of processing multiple targets at the same time
Integrate a novel generic multi object encoding and an FPN
Outperforms recent trackers on LaGOT benchmark
Operates 4× faster than SOT baseline when tracking 10 objects
Excellent results on large-scale SOT benchmarks
Extract backbone features from Resnet-50 or SwinBase
Use linear layer to decrease number of channels from 1024 to 256 or 512 to 256
MLP to project LTRB bounding box encoding map from 4 to 256 channels
Gradient norm clipping with parameter 0.1 to stabilize training
Loss weighting parameters set to λ cls = 100 and λ bbreg = 1
Train all models on four A100 GPUs with batch size of 4 × 12 or 4 × 6
Fixed size Gaussian when producing score map encoding for each object with σ = 0.25
31 new classes added during annotation process
Most tracks are between 30 and 110 seconds long
Size distribution of annotated objects shows large objects are rare than small ones
VOS sequences much shorter than LaGOT benchmark
ImagenetVID contains shorter sequences, fewer classes and smaller number of average tracks per sequence
MOT datasets typically focus on fewer classes, shorter sequences or are annotated at low frame rates
TAO contains many more classes but provides annotations only at 1 FPS
GMOT-40 dataset contains fewer classes, fewer videos, shorter sequences and provides only annotations of one particular object class per sequence
SOT datasets provide only a single annotated object per sequence
LaGOT enables to properly evaluate the robustness and accuracy of multiple object GOT methods
LaGOT contains longer sequences than most listed SOT datasets

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Benchmark#

Method#

Background -transforming model prediction#

Generic multi-object tracker -overview#

Training#

Experiments#

Implementation details#

State-of-the-art evaluation on lagot#

State-of-the-art comparison on sot datasets#

Ablation study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Benchmark

Method

Background -transforming model prediction

Generic multi-object tracker -overview

Training

Experiments

Implementation details

State-of-the-art evaluation on lagot

State-of-the-art comparison on sot datasets

Ablation study

Conclusion