This repository is the official implementation of the paper:
Track-On: Transformer-based Online Point Tracking with Memory
Görkay Aydemir, Xiongyi Cai, Weidi Xie, Fatma Guney
International Conference on Learning Representations (ICLR), 2025
Track-On is an efficient, online point tracking model that tracks points in a frame-by-frame manner using memory. It leverages a transformer-based architecture to maintain a compact yet effective memory of previously tracked points.
git clone https://github.com/gorkaydemir/track_on.git
cd track_on
conda create -n trackon python=3.8 -y
conda activate trackon
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install mmcv==2.2.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.4/index.html
pip install -r requirements.txt
To obtain the necessary datasets, follow the instructions provided in the TAP-Vid repository:
-
Evaluation Datasets:
- TAP-Vid Benchmark (DAVIS, RGB-Stacking, Kinetics)
- Robo-TAP
-
Training Dataset:
- MOVi-F – Refer to this GitHub issue for additional guidance.
Check out the demo notebook for a quick start with the model.
Track-On provides two practical usage modes, both handling frames online but differing in input format:
from model.track_on_ff import TrackOnFF
model = TrackOnFF(args)
model.init_queries_and_memory(queries, first_frame)
while True:
out = model.ff_forward(new_frame)
from model.track_on import TrackOn
model = TrackOn(args)
out = model.inference(video, queries)
Download the pre-trained checkpoint from Hugging Face.
Given:
evaluation_dataset
: The dataset to evaluate ontapvid_root
: Path to evaluation datasetcheckpoint_path
: Path to the downloaded checkpoint
Run the following command:
torchrun --master_port=12345 --nproc_per_node=1 main.py \
--eval_dataset evaluation_dataset \
--tapvid_root /path/to/eval/data \
--checkpoint_path /path/to/checkpoint \
--online_validation
This should reproduce the exact results reported in the paper when configured correctly.
- Movi-f dataset: Located at
/root/to/movi_f
- TAP-Vid evaluation dataset:
- Dataset name:
eval_dataset
- Path:
/root/to/tap_vid
- Dataset name:
- Training name:
training_name
A multi-node training script is provided in train.sh
. Default training arguments are set within the script.
If you find our work useful, please cite:
@InProceedings{Aydemir2025ICLR,
author = {Aydemir, G\"orkay and Cai, Xiongyi and Xie, Weidi and G\"uney, Fatma},
title = {{Track-On}: Transformer-based Online Point Tracking with Memory},
booktitle = {The Thirteenth International Conference on Learning Representations},
year = {2025}
}
This repository incorporates code from several public works, including CoTracker, TAPNet, DINOv2, ViTAdapter, and SPINO. Special thanks to the authors of these projects for making their code available.