The semantic segmentation task involves predicting a segmentation mask for each image indicating a class label for every pixel.
The BDD100K dataset contains fine-grained semantic segmentation annotations for 10K images (7K/1K/2K for train/val/test). Each annotation is a segmentation mask containing labels for 19 diverse object classes. For details about downloading the data and the annotation format for this task, see the official documentation.
For training the models listed below, we follow the common settings used by MMSegmentation (details here), unless otherwise stated. All models are trained on either 8 GeForce RTX 2080 Ti GPUs or 8 TITAN RTX GPUs with a batch size of 2x8=16.
Fully Convolutional Networks for Semantic Segmentation [CVPR 2015 / TPAMI 2017]
Authors: Jonathan Long, Evan Shelhamer, Trevor Darrell
Abstract
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20\% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 769 * 769 | 59.87 | scores | 52.59 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 40K | 512 * 1024 | 59.80 | scores | 53.06 | scores | config | model | MD5 | preds | visuals |
Pyramid Scene Parsing Network [CVPR 2017]
Authors: Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, Jiaya Jia
Abstract
Scene parsing is challenging for unrestricted open vocabulary and diverse scenes. In this paper, we exploit the capability of global context information by different-region-based context aggregation through our pyramid pooling module together with the proposed pyramid scene parsing network (PSPNet). Our global prior representation is effective to produce good quality results on the scene parsing task, while PSPNet provides a superior framework for pixel-level prediction tasks. The proposed approach achieves state-of-the-art performance on various datasets. It came first in ImageNet scene parsing challenge 2016, PASCAL VOC 2012 benchmark and Cityscapes benchmark. A single PSPNet yields new record of mIoU accuracy 85.4\% on PASCAL VOC 2012 and accuracy 80.2\% on Cityscapes.Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 512 * 1024 | 61.88 | scores | 54.50 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 80K | 512 * 1024 | 62.03 | scores | 54.99 | scores | config | model | MD5 | preds | masks | visuals |
R-101-D8 | 80K | 512 * 1024 | 63.62 | scores | 56.32 | scores | config | model | MD5 | preds | masks | visuals |
Rethinking Atrous Convolution for Semantic Image Segmentation [CVPR 2017]
Authors: Liang-Chieh Chen, George Papandreou, Florian Schroff, Hartwig Adam
Abstract
In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales, with image-level features encoding global context and further boost performance. We also elaborate on implementation details and share our experience on training our system. The proposed'`DeepLabv3' system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 769 * 769 | 61.62 | scores | 55.17 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 40K | 512 * 1024 | 62.16 | scores | 55.20 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 80K | 512 * 1024 | 62.55 | scores | 55.19 | scores | config | model | MD5 | preds | masks | visuals |
R-101-D8 | 80K | 512 * 1024 | 63.23 | scores | 56.24 | scores | config | model | MD5 | preds | masks | visuals |
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation [ECCV 2018]
Authors: Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam
Abstract
Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on PASCAL VOC 2012 and Cityscapes datasets, achieving the test set performance of 89.0\% and 82.1\% without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow at [this https URL](https://github.com/tensorflow/models/tree/master/research/deeplab).Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 769 * 769 | 61.22 | scores | 55.61 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 40K | 512 * 1024 | 62.51 | scores | 55.14 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 80K | 512 * 1024 | 63.96 | scores | 56.08 | scores | config | model | MD5 | preds | masks | visuals |
R-101-D8 | 80K | 512 * 1024 | 64.49 | scores | 57.00 | scores | config | model | MD5 | preds | masks | visuals |
Unified Perceptual Parsing for Scene Understanding [ECCV 2018]
Authors: Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, Jian Sun
Abstract
Humans recognize the visual world at multiple levels: we effortlessly categorize scenes and detect objects inside, while also identifying the textures and surfaces of the objects along with their different compositional parts. In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We benchmark our framework on Unified Perceptual Parsing and show that it is able to effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes. Models are available at [this https URL](https://github.com/CSAILVision/unifiedparsing).Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 769 * 769 | 60.01 | scores | 54.39 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 40K | 512 * 1024 | 61.12 | scores | 53.97 | scores | config | model | MD5 | preds | visuals |
PSANet: Point-wise Spatial Attention Network for Scene Parsing [ECCV 2018]
Authors: Hengshuang Zhao*, Yi Zhang*, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, Jiaya Jia
Abstract
Recent studies witnessed that context features can significantly improve the performance of deep semantic segmentation networks. Current context based segmentation methods differ with each other in how to construct context features and perform differently in practice. This paper firstly introduces three desirable properties of context features in segmentation task. Specially, we find that Global-guided Local Affinity (GLA) can play a vital role in constructing effective context features, while this property has been largely ignored in previous works. Based on this analysis, this paper proposes Adaptive Pyramid Context Network (APCNet) for semantic segmentation. APCNet adaptively constructs multi-scale contextual representations with multiple welldesigned Adaptive Context Modules (ACMs). Specifically, each ACM leverages a global image representation as a guidance to estimate the local affinity coefficients for each sub-region, and then calculates a context vector with these affinities. We empirically evaluate our APCNet on three semantic segmentation and scene parsing datasets, including PASCAL VOC 2012, Pascal-Context, and ADE20K dataset. Experimental results show that APCNet achieves state-ofthe-art performance on all three benchmarks, and obtains a new record 84.2\% on PASCAL VOC 2012 test set without MS COCO pre-trained and any post-processing.We notice information flow in convolutional neural networks is restricted inside local neighborhood regions due to the physical design of convolutional filters, which limits the overall understanding of complex scenes. In this paper, we propose the point-wise spatial attention network (PSANet) to relax the local neighborhood constraint. Each position on the feature map is connected to all the other ones through a self-adaptively learned attention mask. Moreover, information propagation in bi-direction for scene parsing is enabled. Information at other positions can be collected to help the prediction of the current position and vice versa, information at the current position can be distributed to assist the prediction of other ones. Our proposed approach achieves top performance on various competitive scene parsing datasets, including ADE20K, PASCAL VOC 2012 and Cityscapes, demonstrating its effectiveness and generality.Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 512 * 1024 | 61.41 | scores | 54.56 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 80K | 512 * 1024 | 61.99 | scores | 54.59 | scores | config | model | MD5 | preds | masks | visuals |
Non-local Neural Networks [CVPR 2018]
Authors: Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He
Abstract
Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our non-local models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code is available at [this https URL](https://github.com/facebookresearch/video-nonlocal-net).Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 512 * 1024 | 61.38 | scores | 54.11 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 80K | 512 * 1024 | 60.98 | scores | 55.00 | scores | config | model | MD5 | preds | visuals |
Panoptic Feature Pyramid Networks [CVPR 2019]
Authors: Alexander Kirillov, Ross Girshick, Kaiming He, Piotr Dollár
Abstract
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-of-the-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, top-performing method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.Backbone | GN | Deform. Conv. | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|---|---|
R-50-FPN | 40K | 512 * 1024 | 59.24 | scores | 52.89 | scores | config | model | MD5 | preds | visuals | ||
R-50-FPN | 80K | 512 * 1024 | 60.36 | scores | 52.92 | scores | config | model | MD5 | preds | visuals | ||
R-50-FPN | ✓ | 40K | 512 * 1024 | 59.44 | scores | 53.42 | scores | config | model | MD5 | preds | visuals | |
R-50-FPN | ✓ | 80K | 512 * 1024 | 60.21 | scores | 53.00 | scores | config | model | MD5 | preds | visuals | |
R-50-FPN | ✓ | ✓ | 40K | 512 * 1024 | 61.53 | scores | 54.31 | scores | config | model | MD5 | preds | visuals |
R-50-FPN | ✓ | ✓ | 80K | 512 * 1024 | 60.55 | scores | 53.91 | scores | config | model | MD5 | preds | visuals |
Expectation-Maximization Attention Networks for Semantic Segmentation [ICCV 2019]
Authors: Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, Hong Liu
Abstract
Self-attention mechanism has been widely used for various tasks. It is designed to compute the representation of each position by a weighted sum of the features at all positions. Thus, it can capture long-range relations for computer vision tasks. However, it is computationally consuming. Since the attention maps are computed w.r.t all other positions. In this paper, we formulate the attention mechanism into an expectation-maximization manner and iteratively estimate a much more compact set of bases upon which the attention maps are computed. By a weighted summation upon these bases, the resulting representation is low-rank and deprecates noisy information from the input. The proposed Expectation-Maximization Attention (EMA) module is robust to the variance of input and is also friendly in memory and computation. Moreover, we set up the bases maintenance and normalization methods to stabilize its training procedure. We conduct extensive experiments on popular semantic segmentation benchmarks including PASCAL VOC, PASCAL Context and COCO Stuff, on which we set new records.Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 769 * 769 | 62.05 | scores | 54.52 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 80K | 769 * 769 | 62.30 | scores | 55.46 | scores | config | model | MD5 | preds | masks | visuals |
Dynamic Multi-scale Filters for Semantic Segmentation [ICCV 2019]
Authors: Junjun He, Zhongying Deng, Yu Qiao
Abstract
Multi-scale representation provides an effective way to address scale variation of objects and stuff in semantic segmentation. Previous works construct multi-scale representation by utilizing different filter sizes, expanding filter sizes with dilated filters or pooling grids, and the parameters of these filters are fixed after training. These methods often suffer from heavy computational cost or have more parameters, and are not adaptive to the input image during inference. To address these problems, this paper proposes a Dynamic Multi-scale Network (DMNet) to adaptively capture multi-scale contents for predicting pixel-level semantic labels. DMNet is composed of multiple Dynamic Convolutional Modules (DCMs) arranged in parallel, each of which exploits context-aware filters to estimate semantic representation for a specific scale. The outputs of multiple DCMs are further integrated for final segmentation. We conduct extensive experiments to evaluate our DMNet on three challenging semantic segmentation and scene parsing datasets, PASCAL VOC 2012, Pascal-Context, and ADE20K. DMNet achieves a new record 84.4% mIoU on PASCAL VOC 2012 test set without MS COCO pre-trained and post-processing, and also obtains state-of-the-art performance on PascalContext and ADE20K.Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 769 * 769 | 62.12 | scores | 55.15 | scores | config | model | MD5 | preds | masks | visuals |
Adaptive Pyramid Context Network for Semantic Segmentation [CVPR 2019]
Authors: Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, Yu Qiao
Abstract
Recent studies witnessed that context features can significantly improve the performance of deep semantic segmentation networks. Current context based segmentation methods differ with each other in how to construct context features and perform differently in practice. This paper firstly introduces three desirable properties of context features in segmentation task. Specially, we find that Global-guided Local Affinity (GLA) can play a vital role in constructing effective context features, while this property has been largely ignored in previous works. Based on this analysis, this paper proposes Adaptive Pyramid Context Network (APCNet) for semantic segmentation. APCNet adaptively constructs multi-scale contextual representations with multiple welldesigned Adaptive Context Modules (ACMs). Specifically, each ACM leverages a global image representation as a guidance to estimate the local affinity coefficients for each sub-region, and then calculates a context vector with these affinities. We empirically evaluate our APCNet on three semantic segmentation and scene parsing datasets, including PASCAL VOC 2012, Pascal-Context, and ADE20K dataset. Experimental results show that APCNet achieves state-ofthe-art performance on all three benchmarks, and obtains a new record 84.2\% on PASCAL VOC 2012 test set without MS COCO pre-trained and any post-processing.Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 512 * 1024 | 60.94 | scores | 54.08 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 80K | 512 * 1024 | 62.30 | scores | 54.82 | scores | config | model | MD5 | preds | masks | visuals |
Deep High-Resolution Representation Learning for Visual Recognition [CVPR 2019 / TPAMI 2020]
Authors: Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, Bin Xiao
Abstract
High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at [this https URL](https://github.com/HRNet).Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
HRNet48 | 40K | 512 * 1024 | 63.37 | scores | 56.01 | scores | config | model | MD5 | preds | masks | visuals |
HRNet48 | 80K | 512 * 1024 | 63.93 | scores | 55.89 | scores | config | model | MD5 | preds | masks | visuals |
CCNet: Criss-Cross Attention for Semantic Segmentation [ICCV 2019 / TPAMI 2020]
Authors: Zilong Huang, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, Thomas S. Huang
Abstract
Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a Criss-Cross Network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11x less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9%, 45.76% and 55.47% on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at [this https URL](https://github.com/speedinghzl/CCNet).Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 512 * 1024 | 62.11 | scores | 54.61 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 80K | 512 * 1024 | 62.52 | scores | 55.10 | scores | config | model | MD5 | preds | masks | visuals |
R-101-D8 | 80K | 512 * 1024 | 60.44 | scores | 55.93 | scores | config | model | MD5 | preds | masks | visuals |
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond [TPAMI 2020]
Authors: Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, Han Hu
Abstract
The Non-Local Network (NLNet) presents a pioneering approach for capturing long-range dependencies, via aggregating query-specific global context to each query position. However, through a rigorous empirical analysis, we have found that the global contexts modeled by non-local network are almost the same for different query positions within an image. In this paper, we take advantage of this finding to create a simplified network based on a query-independent formulation, which maintains the accuracy of NLNet but with significantly less computation. We further observe that this simplified design shares similar structure with Squeeze-Excitation Network (SENet). Hence we unify them into a three-step general framework for global context modeling. Within the general framework, we design a better instantiation, called the global context (GC) block, which is lightweight and can effectively model the global context. The lightweight property allows us to apply it for multiple layers in a backbone network to construct a global context network (GCNet), which generally outperforms both simplified NLNet and SENet on major benchmarks for various recognition tasks. The code and configurations are released at [this https URL](https://github.com/xvjiarui/GCNet).Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 769 * 769 | 61.20 | scores | 53.96 | scores | config | model | MD5 | preds | masks | visuals |
Disentangled Non-Local Neural Networks [ECCV 2020]
Authors: Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang, Stephen Lin, Han Hu
Abstract
The non-local block is a popular module for strengthening the context modeling ability of a regular convolutional neural network. This paper first studies the non-local block in depth, where we find that its attention computation can be split into two terms, a whitened pairwise term accounting for the relationship between two pixels and a unary term representing the saliency of every pixel. We also observe that the two terms trained alone tend to model different visual clues, e.g. the whitened pairwise term learns within-region relationships while the unary term learns salient boundaries. However, the two terms are tightly coupled in the non-local block, which hinders the learning of each. Based on these findings, we present the disentangled non-local block, where the two terms are decoupled to facilitate learning for both terms. We demonstrate the effectiveness of the decoupled design on various tasks, such as semantic segmentation on Cityscapes, ADE20K and PASCAL Context, object detection on COCO, and action recognition on Kinetics.Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-D8 | 40K | 512 * 1024 | 61.93 | scores | 54.35 | scores | config | model | MD5 | preds | masks | visuals |
R-50-D8 | 80K | 512 * 1024 | 62.64 | scores | 54.72 | scores | config | model | MD5 | preds | masks | visuals |
R-101-D8 | 80K | 512 * 1024 | 59.54 | scores | 56.31 | scores | config | model | MD5 | preds | masks | visuals |
PointRend: Image Segmentation as Rendering [CVPR 2020]
Authors: Alexander Kirillov, Yuxin Wu, Kaiming He, Ross Girshick
Abstract
We present a new method for efficient high-quality image segmentation of objects and scenes. By analogizing classical computer graphics methods for efficient rendering with over- and undersampling challenges faced in pixel labeling tasks, we develop a unique perspective of image segmentation as a rendering problem. From this vantage, we present the PointRend (Point-based Rendering) neural network module: a module that performs point-based segmentation predictions at adaptively selected locations based on an iterative subdivision algorithm. PointRend can be flexibly applied to both instance and semantic segmentation tasks by building on top of existing state-of-the-art models. While many concrete implementations of the general idea are possible, we show that a simple design already achieves excellent results. Qualitatively, PointRend outputs crisp object boundaries in regions that are over-smoothed by previous methods. Quantitatively, PointRend yields significant gains on COCO and Cityscapes, for both instance and semantic segmentation. PointRend's efficiency enables output resolutions that are otherwise impractical in terms of memory or computation compared to existing approaches. Code has been made available at [this https URL](https://github.com/facebookresearch/detectron2/tree/main/projects/PointRend).Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
R-50-FPN | 40K | 512 * 1024 | 61.80 | scores | 53.61 | scores | config | model | MD5 | preds | masks | visuals |
R-50-FPN | 80K | 512 * 1024 | 61.02 | scores | 52.53 | scores | config | model | MD5 | preds | visuals |
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [ICLR 2021]
Authors: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
Abstract
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
ViT-B | 80K | 512 * 1024 | 62.11 | scores | 53.98 | scores | config | model | MD5 | preds | visuals |
Training data-efficient image transformers & distillation through attention [ICML 2021]
Authors: Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou
Abstract
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
DeiT-S | 80K | 512 * 1024 | 61.52 | scores | 53.44 | scores | config | model | MD5 | preds | visuals |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [ICCV 2021]
Authors: Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo
Abstract
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at [this https URL](https://github.com/microsoft/Swin-Transformer).Backbone | FP16 | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|---|
Swin-T | 40K | 512 * 1024 | 62.00 | scores | 54.33 | scores | config | model | MD5 | preds | visuals | |
Swin-T | 80K | 512 * 1024 | 63.10 | scores | 54.81 | scores | config | model | MD5 | preds | visuals | |
Swin-S | 80K | 512 * 1024 | 65.76 | scores | 58.00 | scores | config | model | MD5 | preds | visuals | |
Swin-S | ✓ | 80K | 512 * 1024 | 65.51 | scores | 57.67 | scores | config | model | MD5 | preds | visuals |
Swin-B | ✓ | 80K | 512 * 1024 | 65.98 | scores | 58.33 | scores | config | model | MD5 | preds | visuals |
Vision Transformers for Dense Prediction [ICCV 2021]
Authors: René Ranftl, Alexey Bochkovskiy, Vladlen Koltun
Abstract
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at [this https URL](https://github.com/isl-org/DPT).Backbone | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|
ViT-B | 80K | 512 * 1024 | 63.53 | scores | 54.66 | scores | config | model | MD5 | preds | visuals |
A ConvNet for the 2020s [CVPR 2022]
Authors: Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie
Abstract
The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.Backbone | FP16 | Iters | Input | mIoU-val | Scores-val | mIoU-test | Scores-test | Config | Weights | Preds | Visuals |
---|---|---|---|---|---|---|---|---|---|---|---|
ConvNeXt-T | ✓ | 40K | 512 * 1024 | 63.21 | scores | 56.09 | scores | config | model | MD5 | preds | visuals |
ConvNeXt-T | ✓ | 80K | 512 * 1024 | 64.36 | scores | 57.02 | scores | config | model | MD5 | preds | visuals |
ConvNeXt-S | ✓ | 80K | 512 * 1024 | 66.13 | scores | 58.15 | scores | config | model | MD5 | preds | visuals |
ConvNeXt-B | ✓ | 80K | 512 * 1024 | 67.26 | scores | 59.82 | scores | config | model | MD5 | preds | visuals |
a. Create a conda virtual environment and activate it.
conda create -n bdd100k-mmseg python=3.8
conda activate bdd100k-mmseg
b. Install PyTorch and torchvision following the official instructions, e.g.,
conda install pytorch torchvision -c pytorch
Note: Make sure that your compilation CUDA version and runtime CUDA version match. You can check the supported CUDA version for precompiled packages on the PyTorch website.
c. Install mmcv and mmsegmentation.
pip install mmcv-full
pip install mmsegmentation
You can also refer to the official installation instructions.
Single GPU inference:
python ./test.py ${CONFIG_FILE} --format-only --format-dir ${OUTPUT_DIR} [--options]
Multiple GPU inference:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
--nproc_per_node=4 --master_port=12000 ./test.py $CFG_FILE \
--format-only --format-dir ${OUTPUT_DIR} [--options] \
--launcher pytorch
To evaluate the semantic segmentation performance on the BDD100K validation set, you can follow the official evaluation scripts provided by BDD100K:
python -m bdd100k.eval.run -t sem_seg \
-g ../data/bdd100k/labels/sem_seg_${SET_NAME}.json \
-r ${OUTPUT_DIR}/sem_seg.json \
[--out-file ${RESULTS_FILE}] [--nproc ${NUM_PROCESS}]
You can obtain the performance on the BDD100K test set by submitting your model predictions to our evaluation server hosted on EvalAI.
For visualization, you can use the visualization tool provided by Scalabel.
Below is an example:
import os
import numpy as np
from PIL import Image
from bdd100k.common.utils import load_bdd100k_config
from scalabel.label.io import load
from scalabel.vis.label import LabelViewer
# load prediction frames
frames = load('$OUTPUT_DIR/sem_seg.json').frames
viewer = LabelViewer(label_cfg=load_bdd100k_config('sem_seg'))
for frame in frames:
img = np.array(Image.open(os.path.join('$IMG_DIR', frame.name)))
viewer.draw(img, frame)
viewer.save(os.path.join('$VIS_DIR', frame.name))
You can include your models in this repo as well! Please follow the contribution instructions.