Skip to content

Latest commit

 

History

History

ins_seg

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Instance Segmentation Models of BDD100K

The instance segmentation task involves detecting and segmenting each distinct object of interest in the scene.

ins_seg1

The BDD100K dataset contains object segmentation annotations for 10K images (7K/1K/2K for train/val/test). Each annotation contains labels for 8 object classes. For details about downloading the data and the annotation format for this task, see the official documentation.

Model Zoo

For training the models listed below, we follow the common settings used by MMDetection (details here), unless otherwise stated. All models are trained on either 8 GeForce RTX 2080 Ti GPUs or 8 TITAN RTX GPUs with a batch size of 2x8=16. See the config files for the detailed setting for each model.


Mask R-CNN

Mask R-CNN [ICCV 2017]

Authors: Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick

Abstract We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: [this https URL](https://github.com/facebookresearch/detectron2).

Results

Backbone Batch Size Lr schd MS-train Mask AP-val Box AP-val Scores-val Mask AP-test Box AP-test Scores-test Config Weights Preds Visuals
R-50-FPN 16 1x 16.24 22.34 scores 14.86 19.59 scores config model | MD5 preds visuals
R-50-FPN 16 3x 19.88 25.93 scores 17.46 22.32 scores config model | MD5 preds visuals
R-101-FPN 16 3x 20.51 26.08 scores 17.88 22.01 scores config model | MD5 preds visuals
R-50-FPN 32 1x 16.15 21.54 scores 14.90 19.24 scores config model | MD5 preds visuals
R-50-FPN 32 3x 20.20 26.14 scores 17.60 22.03 scores config model | MD5 preds visuals
R-101-FPN 32 3x 20.48 25.70 scores 17.71 21.94 scores config model | MD5 preds visuals

[Code] [Usage Instructions]


Cascade Mask R-CNN

Cascade R-CNN: High Quality Object Detection and Instance Segmentation [TPAMI 2019]

Authors: Zhaowei Cai, Nuno Vasconcelos

Abstract In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its quality. While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object detection datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN. To facilitate future research, two implementations are made available at [this https URL](https://github.com/zhaoweicai/cascade-rcnn) (Caffe) and [this https URL](https://github.com/zhaoweicai/Detectron-Cascade-RCNN) (Detectron).

Results

Backbone Batch Size Lr schd MS-train Mask AP-val Box AP-val Scores-val Mask AP-test Box AP-test Scores-test Config Weights Preds Visuals
R-50-FPN 16 1x 18.63 25.97 scores 15.89 21.55 scores config model | MD5 preds visuals
R-50-FPN 16 3x 19.43 25.26 scores 16.18 20.46 scores config model | MD5 preds visuals
R-101-FPN 16 3x 19.79 24.79 scores 16.65 20.63 scores config model | MD5 preds visuals
R-50-FPN 32 1x 17.67 25.67 scores 15.73 21.40 scores config model | MD5 preds visuals
R-50-FPN 32 3x 18.58 24.63 scores 16.34 20.75 scores config model | MD5 preds visuals

[Code] [Usage Instructions]


Deformable ConvNets v2

Deformable ConvNets v2: More Deformable, Better Results [CVPR 2019]

Authors: Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai

Abstract The superior performance of Deformable Convolutional Networks arises from its ability to adapt to the geometric variations of objects. Through an examination of its adaptive behavior, we observe that while the spatial support for its neural features conforms more closely than regular ConvNets to object structure, this support may nevertheless extend well beyond the region of interest, causing features to be influenced by irrelevant image content. To address this problem, we present a reformulation of Deformable ConvNets that improves its ability to focus on pertinent image regions, through increased modeling power and stronger training. The modeling power is enhanced through a more comprehensive integration of deformable convolution within the network, and by introducing a modulation mechanism that expands the scope of deformation modeling. To effectively harness this enriched modeling capability, we guide network training via a proposed feature mimicking scheme that helps the network to learn features that reflect the object focus and classification power of R-CNN features. With the proposed contributions, this new version of Deformable ConvNets yields significant performance gains over the original model and produces leading results on the COCO benchmark for object detection and instance segmentation.

Results

Backbone Batch Size Lr schd MS-train Mask AP-val Box AP-val Scores-val Mask AP-test Box AP-test Scores-test Config Weights Preds Visuals
R-50-FPN 16 1x 17.72 23.37 scores 15.80 20.73 scores config model | MD5 preds visuals
R-50-FPN 16 3x 20.89 27.17 scores 18.29 22.88 scores config model | MD5 preds visuals
R-101-FPN 16 3x 20.98 25.99 scores 18.74 22.73 scores config model | MD5 preds visuals
R-50-FPN 32 1x 17.50 23.16 scores 15.53 20.66 scores config model | MD5 preds visuals
R-50-FPN 32 3x 20.72 27.51 scores 18.40 22.80 scores config model | MD5 preds visuals
R-101-FPN 32 3x 20.91 25.87 scores 18.60 22.83 scores config model | MD5 preds visuals

[Code] [Usage Instructions]


GCNet

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond [TPAMI 2020]

Authors: Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, Han Hu

Abstract The Non-Local Network (NLNet) presents a pioneering approach for capturing long-range dependencies, via aggregating query-specific global context to each query position. However, through a rigorous empirical analysis, we have found that the global contexts modeled by non-local network are almost the same for different query positions within an image. In this paper, we take advantage of this finding to create a simplified network based on a query-independent formulation, which maintains the accuracy of NLNet but with significantly less computation. We further observe that this simplified design shares similar structure with Squeeze-Excitation Network (SENet). Hence we unify them into a three-step general framework for global context modeling. Within the general framework, we design a better instantiation, called the global context (GC) block, which is lightweight and can effectively model the global context. The lightweight property allows us to apply it for multiple layers in a backbone network to construct a global context network (GCNet), which generally outperforms both simplified NLNet and SENet on major benchmarks for various recognition tasks. The code and configurations are released at [this https URL](https://github.com/xvjiarui/GCNet).

Results

Backbone Lr schd MS-train Mask AP-val Box AP-val Scores-val Mask AP-test Box AP-test Scores-test Config Weights Preds Visuals
R-50-FPN 1x 16.12 22.41 scores 14.96 19.43 scores config model | MD5 preds visuals
R-50-FPN 3x 20.07 26.34 scores 17.82 22.26 scores config model | MD5 preds visuals
R-101-FPN 3x 20.77 26.27 scores 17.76 22.21 scores config model | MD5 preds visuals

[Code] [Usage Instructions]


HRNet

Deep High-Resolution Representation Learning for Visual Recognition [CVPR 2019 / TPAMI 2020]

Authors: Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, Bin Xiao

Abstract High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at [this https URL](https://github.com/HRNet).

Results

Backbone Lr schd MS-train Mask AP-val Box AP-val Scores-val Mask AP-test Box AP-test Scores-test Config Weights Preds Visuals
HRNet-w18 1x 16.54 22.97 scores 14.62 19.81 scores config model | MD5 preds visuals
HRNet-w18 3x 21.28 27.53 scores 17.74 22.98 scores config model | MD5 preds visuals
HRNet-w32 1x 18.69 24.75 scores 16.35 21.52 scores config model | MD5 preds visuals
HRNet-w32 3x 22.45 28.16 scores 18.76 23.48 scores config model | MD5 preds visuals
HRNet-w40 1x 19.65 25.46 scores 16.64 21.98 scores config model | MD5 preds visuals
HRNet-w40 3x 22.57 28.17 scores 19.38 24.37 scores config model | MD5 preds visuals

[Code] [Usage Instructions]


Install

a. Create a conda virtual environment and activate it.

conda create -n bdd100k-mmdet python=3.8
conda activate bdd100k-mmdet

b. Install PyTorch and torchvision following the official instructions, e.g.,

conda install pytorch torchvision -c pytorch

Note: Make sure that your compilation CUDA version and runtime CUDA version match. You can check the supported CUDA version for precompiled packages on the PyTorch website.

c. Install mmcv and mmdetection.

pip install mmcv-full
pip install mmdet

You can also refer to the official instructions.

Note that mmdetection uses their forked version of pycocotools via the github repo instead of pypi for better compatibility. If you meet issues, you may need to re-install the cocoapi through

pip uninstall pycocotools
pip install git+https://github.com/open-mmlab/cocoapi.git#subdirectory=pycocotools

Usage

Model Inference

Single GPU inference:

python ./test.py ${CONFIG_FILE} --format-only --format-dir ${OUTPUT_DIR} [--cfg-options]

Multiple GPU inference:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
    --nproc_per_node=4 --master_port=12000 ./test.py $CFG_FILE \
    --format-only --format-dir ${OUTPUT_DIR} [--cfg-options] \
    --launcher pytorch

Note: This will save 1K bitmasks for the validation set or 2K bitmasks for the test set to ${OUTPUT_DIR}.

Output Evaluation

Validation Set

To evaluate the instance segmentation performance on the BDD100K validation set, you can follow the official evaluation scripts provided by BDD100K:

python -m bdd100k.eval.run -t ins_seg \
    -g ../data/bdd100k/labels/ins_seg/bitmasks/${SET_NAME} \
    -r ${OUTPUT_DIR}/bitmasks \
    --score-file ${OUTPUT_DIR}/score.json \
    [--out-file ${RESULTS_FILE}] [--nproc ${NUM_PROCESS}]

Test Set

You can obtain the performance on the BDD100K test set by submitting your model predictions to our evaluation server hosted on EvalAI.

Colorful Visualization

For visualization, please follow the official colorize scripts provided by BDD100K scripts.

To colorize the bitmasks, you can run:

python -m bdd100k.label.to_color -m ins_seg -i ${OUTPUT_DIR}/bitmasks \
    -o ${COLOR_DIR} [--nproc ${NUM_PROCESS}]

Afterwards, you can use our provided visualization script to overlay the colorized bitmasks on the original images:

python vis.py -i ../data/bdd100k/images/10k/${SET_NAME} -c ${COLOR_DIR} \
    -o ${VIS_DIR} [--nproc ${NUM_PROCESS}]

Contribution

You can include your models in this repo as well! Please follow the contribution instructions.