Skip to content

Latest commit

 

History

History
169 lines (126 loc) · 15.4 KB

MODEL_ZOO.md

File metadata and controls

169 lines (126 loc) · 15.4 KB

DECOLA Model Zoo

In all our experiments, we used 8 Quadro RTX 6000 and 8 V100 GPUs.

How to read the tables

The "config" column contains a link to the config file. To train a model, run

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml

To evaluate a model with a trained/ pretrained model, run

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth

Third-party ImageNet-21K pre-trained models

Our paper uses ImageNet-21K pretrained models that are not part of Detectron2 (ResNet-50-21K from MIIL and SwinB-21K from Swin-Transformer). Before training, please download the models and place them under DECOLA_ROOT/weights/, and following this tool to convert the format.

DECOLA and baselines

Here we provide the configs and checkpoints of DECOLA and Detic as our main baseline. Please refer to Detic to learn about it. The baseline is trained on detection dataset (LVIS-base or LVIS) for 4x and further trained on weak dataset (ImageNet-21K) for another 4x. DECOLA is trained on the same detection dataset with language condition for 4x (phase 1) and finetuned on the same weak dataset for another 4x (phase 2). For more training detail, please see training details.

Open-vocabulary LVIS with Deformable DETR

ResNet-50 backbone

name box AP_novel box AP_c box AP_f box mAP model
baseline 9.4 33.8 40.4 32.2 weight
baseline + self-train 23.2 36.5 41.6 36.2 weight
DECOLA [Phase 2] 27.6 38.3 42.9 38.3 weight

Swin-B backbone

name box AP_novel box AP_c box AP_f box mAP model
baseline 16.2 43.8 49.1 41.1 weight
baseline + self-train 30.8 43.6 45.9 42.3 weight
DECOLA [Phase 2] 35.7 47.5 49.7 46.3 weight

Swin-L backbone (w/ O365)

name box AP_novel box AP_c box AP_f box mAP model
baseline 21.9 53.3 57.7 49.6 weight
baseline + self-train 36.5 53.5 56.5 51.8 weight
DECOLA [Phase 2] 46.9 56.0 58.0 55.2 weight

Standard LVIS with Deformable DETR

ResNet-50 backbone

name box AP_rare box AP_c box AP_f box mAP model
baseline 26.3 34.1 41.3 35.6 weight
baseline + self-train 30.0 35.3 41.0 36.6 weight
DECOLA [Phase 2] 34.8 38.7 42.5 39.6 weight
DECOLA [Phase 2 (offline)] 35.9 38.0 42.4 39.4 weight

Swin-B backbone

name box AP_rare box AP_c box AP_f box mAP model
baseline 38.3 43.4 48.6 44.5 weight
baseline + self-train 42.0 44.0 48.1 45.2 weight
DECOLA [Phase 2] 46.4 46.9 49.4 47.8 weight
DECOLA [Phase 2 (offline)] 47.4 47.4 49.6 48.3 weight

Open-vocabulary LVIS with CenterNet2

For DECOLA training, we use pseudo-labels generated from Phase 1 DECOLA(R50, SwinB) trained on LVIS-base. See here to learn about how to generate pseudo-labels.

ResNet-50 backbone

name box AP_novel box mAP mask AP_novel mask mAP model
Detic-base 17.6 33.8 16.4 30.2 weight
Detic 26.7 36.3 24.6 32.4 weight
DECOLA label [config] 29.0 37.6 26.8 33.6 weight
DECOLA label [config] 29.5 37.7 27.0 33.7 weight

Swin-B backbone

name box AP_novel box mAP mask AP_novel mask mAP model
Detic-base 24.6 43.0 21.9 38.4 weight
Detic 36.6 45.7 33.8 40.7 weight
DECOLA label [config] 38.4 46.7 35.3 42.0 weight

Direct zero-shot transfer to LVIS minival

name backbone data AP_r AP_c AP_f mAP_fixed model
DECOLA [Phase 1] Swin-T O365
DECOLA [Phase 2] Swin-T O365, IN21K 32.8 32.0 31.8 32.0 weight
DECOLA [Phase 1] Swin-L O365
DECOLA [Phase 2] Swin-L O365, OID, IN21K 41.5 38.0 34.9 36.8 weight

Direct zero-shot transfer to LVIS v1.0

name backbone data AP_r AP_c AP_f mAP_fixed model
DECOLA [Phase 1] Swin-T O365 -
DECOLA [Phase 2] Swin-T O365, IN21K 27.2 24.9 28.0 26.6 weight
DECOLA [Phase 1] Swin-L O365 -
DECOLA [Phase 2] Swin-L O365, OID, IN21K 32.9 29.1 30.3 30.2 weight

Standard LVIS with CenterNet2

For DECOLA training, we use pseudo-labels generated from Phase 1 DECOLA(R50, SwinB) trained on LVIS.

ResNet-50 backbone

name box AP_rare box mAP mask AP_rare mask mAP model
Detic-base 28.2 35.3 25.6 31.4 weight
Detic 31.4 36.8 29.7 33.2 weight
DECOLA label [config] 35.6 38.6 32.1 34.4 weight
DECOLA label [config] 35.4 38.3 32.1 34.2 weight

Swin-B backbone

name box AP_rare box mAP mask AP_rare mask mAP model
Detic-base 39.9 45.4 35.9 40.7 weight
Detic 45.8 46.9 41.7 41.7 weight
DECOLA label [config] 46.6 48.3 42.3 43.4 weight

DECOLA phase 1 on conditioned-mAP (c-mAP)

Here, we provide the DECOLA checkpoints in phase 1 training (language-condition). The main evaluation metric for these models as well as standard detector (baseline) is c-mAP@k, where k is per-image detection limit.

To evaluate a baseline model for c-mAP, run

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth MODEL.DETR.ORACLE_EVALUATION True TEST.DETECTIONS_PER_IMAGE $k

To evaluate a Phase 1 DECOLA model for c-mAP, run

python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth MODEL.DECOLA.ORACLE_EVALUATION True MODEL.DECOLA.TEST_CLASS_CONDITIONED True TEST.DETECTIONS_PER_IMAGE $k

Change k for different per-image detection limits.

ResNet-50 backbone

name data AP_r@10 AP_r@20 AP_r@50 AP_r@100 AP_r@300 model
baseline LVIS-base 6.0 11.3 19.2 26.8 31.9 weight
DECOLA [Phase 1 ] LVIS-base 19.4 28.5 34.1 38.7 40.0 weight
baseline LVIS 21.3 29.4 36.9 41.1 44.6 weight
DECOLA [Phase 1 ] LVIS 26.6 39.1 45.2 47.1 48.8 weight

Swin-B backbone

name data AP_r@10 AP_r@20 AP_r@50 AP_r@100 AP_r@300 model
baseline LVIS-base 7.4 16.1 27.5 33.1 41.9 weight
DECOLA [Phase 1] LVIS-base 21.9 32.0 40.0 44.0 47.7 weight
baseline LVIS 30.1 38.2 45.5 49.3 53.2 weight
DECOLA [Phase 1 ] LVIS 33.5 43.9 51.4 53.8 55.8 weight