Skip to content

[ICLR 2025] Track-On: Transformer-based Online Point Tracking with Memory

License

Notifications You must be signed in to change notification settings

gorkaydemir/track_on

Repository files navigation

Track-On: Transformer-based Online Point Tracking with Memory

arXiv | Webpage

This repository is the official implementation of the paper:

Track-On: Transformer-based Online Point Tracking with Memory

Görkay Aydemir, Xiongyi Cai, Weidi Xie, Fatma Guney

International Conference on Learning Representations (ICLR), 2025

Overview

Track-On is an efficient, online point tracking model that tracks points in a frame-by-frame manner using memory. It leverages a transformer-based architecture to maintain a compact yet effective memory of previously tracked points.

Track-On Overview


Installation

1. Clone the repository

git clone https://github.com/gorkaydemir/track_on.git 
cd track_on

2. Set up the environment

conda create -n trackon python=3.8 -y
conda activate trackon
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install mmcv==2.2.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.4/index.html
pip install -r requirements.txt

3. Download Datasets

To obtain the necessary datasets, follow the instructions provided in the TAP-Vid repository:

  • Evaluation Datasets:

    • TAP-Vid Benchmark (DAVIS, RGB-Stacking, Kinetics)
    • Robo-TAP
  • Training Dataset:

    • MOVi-F – Refer to this GitHub issue for additional guidance.

Quick Demo

Check out the demo notebook for a quick start with the model.

Usage Options

Track-On provides two practical usage modes, both handling frames online but differing in input format:

1. Frame-by-frame input (for streaming videos)

from model.track_on_ff import TrackOnFF

model = TrackOnFF(args)
model.init_queries_and_memory(queries, first_frame)

while True:
    out = model.ff_forward(new_frame)

2. Video input (for benchmarking)

from model.track_on import TrackOn

model = TrackOn(args)
out = model.inference(video, queries)

Evaluation

1. Download Pretrained Weights

Download the pre-trained checkpoint from Hugging Face.

2. Run Evaluation

Given:

  • evaluation_dataset: The dataset to evaluate on
  • tapvid_root: Path to evaluation dataset
  • checkpoint_path: Path to the downloaded checkpoint

Run the following command:

torchrun --master_port=12345 --nproc_per_node=1 main.py \
    --eval_dataset evaluation_dataset \
    --tapvid_root /path/to/eval/data \
    --checkpoint_path /path/to/checkpoint \
    --online_validation

This should reproduce the exact results reported in the paper when configured correctly.


Training

1. Prepare datasets

  • Movi-f dataset: Located at /root/to/movi_f
  • TAP-Vid evaluation dataset:
    • Dataset name: eval_dataset
    • Path: /root/to/tap_vid
  • Training name: training_name

2. Run Training

A multi-node training script is provided in train.sh. Default training arguments are set within the script.


📖 Citation

If you find our work useful, please cite:

@InProceedings{Aydemir2025ICLR,
    author    = {Aydemir, G\"orkay and Cai, Xiongyi and Xie, Weidi and G\"uney, Fatma},
    title     = {{Track-On}: Transformer-based Online Point Tracking with Memory},
    booktitle = {The Thirteenth International Conference on Learning Representations},
    year      = {2025}
}

Acknowledgments

This repository incorporates code from several public works, including CoTracker, TAPNet, DINOv2, ViTAdapter, and SPINO. Special thanks to the authors of these projects for making their code available.

About

[ICLR 2025] Track-On: Transformer-based Online Point Tracking with Memory

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published