Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Generic Object Tracking (GOT) is the problem of tracking target objects in a video.
  • Previous research has focused on single object tracking, but multi-object tracking has wider applicability.
  • A new large-scale GOT benchmark, LaGOT, is introduced to tackle key remaining challenges in GOT.
  • A Transformer-based GOT tracker, TaMOS, is proposed to process multiple objects simultaneously.

Paper Content

Introduction

  • Visual object tracking is a fundamental problem in computer vision
  • Two task definitions: Generic Object Tracking (GOT) and Multiple Object Tracking (MOT)
  • GOT focuses on tracking objects defined by given bounding boxes in the first video frame
  • MOT requires a detector separately trained on objects of all classes
  • GOT has only focused on the Single Object Tracking (SOT) scenario
  • Tracking multiple objects leads to a linear increase in computation
  • TaMOs jointly tracks multiple objects, leading to computational savings
  • Proposed new GOT benchmark LaGOT with up to 10 tracks per sequence
  • TaMOs proposed to tackle challenges of tracking multiple objects
  • TaMOs evaluated on LaGOT, LaSOT and TrackingNet
  • TaMOs outperforms recent trackers and sets new state of the art on TrackingNet
  • TaMOs operates at over 4x faster run-times compared to the baseline when tracking 10 objects
  • Generic object tracking is a well explored topic
  • Specialized datasets and challenges focus on short-term or long-term tracking
  • GMOT-40 focuses on Generic Multi Object Tracking (GMOT)
  • MOT focuses on tracking multiple objects of different predefined classes
  • TAO focuses on tracking objects of a long-tailed class distributions
  • Open world tracking aims at detecting and tracking any object in a video sequence
  • Video Object Segmentation (VOS) datasets provide multiobject annotations
  • Global tracking operates on the whole video frame
  • GlobalTrack and Siam R-CNN track the target by using global RPNs
  • MetaUpdater and SPLT use a redetector to re-localize the target
  • Transformers have been used for tracking
  • Most trackers focus on encoding a single object per training frame
  • AOT encodes each object into the identification embedding

Benchmark

  • TAO and Ima-geNetVID are suitable for multi-object GOT evaluation
  • TAO has hundreds of classes but only 1fps annotation
  • Ima-geNetVID has 30 classes but short tracks
  • LaGOT benchmark created with 294 video sequences and 837 tracks
  • Annotation process uses interactive tool and manual verification
  • 31 additional generic object classes added
  • Average track length of LaGOT is 70 seconds

Method

  • Tracker TaMOs uses a Transformer to track a set of objects in a video
  • ToMP [38] is a Transformer-based single object tracker used as a starting point
  • Transformer-based multi-object tracking architecture is introduced
  • Training is discussed in Sec. 4.3

Background -transforming model prediction

  • ToMP is a high-performance and versatile architecture used as a baseline tracker.
  • Visual features are extracted from the training and test image crops.
  • A foreground embedding is combined with a Gaussian score map and a LTRB bounding box encoding.
  • A Transformer encoder and decoder are used to produce an enhanced test feature and a target appearance model.

Generic multi-object tracker -overview

  • Proposed generic multi-object tracker TaMOs operates on full train and test images instead of crops
  • Uses pool of learnable object embeddings to encode location and extent of each target object
  • Object embedding represents target in entire video sequence
  • Computational cost of Transformer operations limited to certain feature resolution
  • FPN-based feature fusion of test frame features with higher resolution backbone features to track small objects
  • Correlation filter based target localization and bounding box regression mechanism of ToMP applied on higher resolution FPN features
  • Novel object encoding allows to encode multiple objects in shared feature map without requiring multiple templates
  • Model predictor conditioned on object embeddings to produce target models
  • Target models used to localize targets in test frame and regress their bounding boxes
  • High-resolution multi-channel score and bounding box prediction maps used during inference
  • Losses applied on Transformer encoder features, low- and high-resolution FPN feature maps during training

Training

  • Classification and bounding box regression losses are used during training.
  • Score maps corresponding to unused object embeddings should produce low scores.
  • Bounding box regression loss is enforced only for predictions corresponding to encoded objects.
  • Overall training loss is a combination of classification and bounding box regression losses.

Experiments

  • Evaluated proposed tracking architecture on GOT benchmark LaGOT
  • Compared method to recent trackers on SOT benchmarks
  • Presented ablation study to evaluate impact of different components of tracker

Implementation details

  • Method is implemented using PyTracking
  • Image pair of one training and one test frame is randomly sampled from a training sequence
  • Feature maps with stride 16 used for Resnet-50 and SwinBase backbones
  • Pretrained weights on ImageNet-1k and ImageNet-22k used
  • Training splits of LaSOT, GOT10k, TrackingNet, MS-COCO, ImageNet-Vid, TAO, and YoutubeVOS used
  • Random scaling, cropping, color jittering, and flipping used
  • 40k image pairs sampled with equal probability from all datasets
  • Trained for 300 epochs on 4 Nvidia A100 GPUs
  • Memory updating approach used during inference

State-of-the-art evaluation on lagot

  • Evaluated tracker with Resnet-50 and Swin-Base backbones
  • Compared to 8 other trackers on LaGOT benchmark
  • Measured performance with One Pass Evaluation (OPE) setting
  • Used GOT Success rate Area Under the Curve (AUC) and VOTLT metrics
  • Tracker achieved best AUC, outperforming MixFormerLarge-22k by 1 point
  • Highest AUC among all trackers with ResNet-50 backbone
  • Outperformed all other trackers in VOTLT by 2.2 points
  • TaMOs achieves 4x speedup for 10 concurrent objects
  • TaMOs-SwinBase achieved 13.1 FPS for single object and 9.3 FPS for 10 objects

State-of-the-art comparison on sot datasets

  • Evaluated TaMOs tracker on GOT benchmarks with single object per video
  • No changes to weights or hyper-parameters
  • LaSOT dataset consists of 280 test sequences with 2500 frames on average
  • TaMOs achieved highest precision and second highest success rate AUC
  • TrackingNet dataset consists of 511 test sequences
  • TaMOs set new state of the art on TrackingNet in terms of success rate and precision AUC

Ablation study

  • Resnet-50 is the backbone for all ablation experiments
  • LTRB bounding box encoding is more important than Gaussian score map encoding
  • Best results achieved with 10 object embeddings
  • SwinBase and FPN improve tracking performance
  • Adding a second training frame during inference improves results

Conclusion

  • Propose a novel multiple object GOT tracking benchmark, LaGOT
  • Propose a Transformer-based tracker capable of processing multiple targets at the same time
  • Integrate a novel generic multi object encoding and an FPN
  • Outperforms recent trackers on LaGOT benchmark
  • Operates 4× faster than SOT baseline when tracking 10 objects
  • Excellent results on large-scale SOT benchmarks
  • Extract backbone features from Resnet-50 or SwinBase
  • Use linear layer to decrease number of channels from 1024 to 256 or 512 to 256
  • MLP to project LTRB bounding box encoding map from 4 to 256 channels
  • Gradient norm clipping with parameter 0.1 to stabilize training
  • Loss weighting parameters set to λ cls = 100 and λ bbreg = 1
  • Train all models on four A100 GPUs with batch size of 4 × 12 or 4 × 6
  • Fixed size Gaussian when producing score map encoding for each object with σ = 0.25
  • 31 new classes added during annotation process
  • Most tracks are between 30 and 110 seconds long
  • Size distribution of annotated objects shows large objects are rare than small ones
  • VOS sequences much shorter than LaGOT benchmark
  • ImagenetVID contains shorter sequences, fewer classes and smaller number of average tracks per sequence
  • MOT datasets typically focus on fewer classes, shorter sequences or are annotated at low frame rates
  • TAO contains many more classes but provides annotations only at 1 FPS
  • GMOT-40 dataset contains fewer classes, fewer videos, shorter sequences and provides only annotations of one particular object class per sequence
  • SOT datasets provide only a single annotated object per sequence
  • LaGOT enables to properly evaluate the robustness and accuracy of multiple object GOT methods
  • LaGOT contains longer sequences than most listed SOT datasets