diff --git a/configs/maskformer/README.md b/configs/maskformer/README.md
index ce1384ae77e..54110004c94 100644
--- a/configs/maskformer/README.md
+++ b/configs/maskformer/README.md
@@ -1,37 +1,18 @@
-# Per-Pixel Classification is Not All You Need for Semantic Segmentation
+# MaskFormer
+
+> [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278)
+
+
## Abstract
-Modern approaches typically formulate semantic segmentation as a per-pixel classification
-task, while instance-level segmentation is handled with an alternative mask
-classification. Our key insight: mask classification is sufficiently general to solve
-both semantic- and instance-level segmentation tasks in a unified manner using
-the exact same model, loss, and training procedure. Following this observation,
-we propose MaskFormer, a simple mask classification model which predicts a
-set of binary masks, each associated with a single global class label prediction.
-Overall, the proposed mask classification-based method simplifies the landscape
-of effective approaches to semantic and panoptic segmentation tasks and shows
-excellent empirical results. In particular, we observe that MaskFormer outperforms
-per-pixel classification baselines when the number of classes is large. Our mask
-classification-based method outperforms both current state-of-the-art semantic
-(55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
+Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
-## Citation
-
-```
-@inproceedings{cheng2021maskformer,
- title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
- author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
- journal={NeurIPS},
- year={2021}
-}
-```
-
-## Dataset
+## Introduction
MaskFormer requires COCO and [COCO-panoptic](http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip) dataset for training and evaluation. You need to download and extract it in the COCO dataset path.
The directory should be like this.
@@ -55,6 +36,17 @@ mmdetection
## Results and Models
-| Backbone | style | Lr schd | Mem (GB) | Inf time (fps) | PQ | SQ | RQ | PQ_th | SQ_th | RQ_th | PQ_st | SQ_st | RQ_st | Config | Download | detail |
-| :------: | :-----: | :-----: | :------: | :------------: | :-: | :-: | :-: | :---: | :---: | :---: | :---: | :---: | :---: | :---------------------------------------------------------------------------------------------------------------------: | :----------------------: | :---: |
-| R-50 | pytorch | 75e | | | | | | | | | | | | [config](https://github.com/open-mmlab/mmdetection/tree/master/configs/maskformer/maskformer_r50_mstrain_16x1_75e_coco.py) | | This version was mentioned in Table XI, in paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) |
+| Backbone | style | Lr schd | Mem (GB) | Inf time (fps) | PQ | SQ | RQ | PQ_th | SQ_th | RQ_th | PQ_st | SQ_st | RQ_st | Config | Download | detail |
+|:--------:|:-------:|:-------:|:--------:|:--------------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:--------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| R-50 | pytorch | 75e | 16.6 | - | 46.854 | 80.617 | 57.085 | 51.089 | 81.511 | 61.853 | 40.463 | 79.269 | 49.888 | [config](https://download.openmmlab.com/mmdetection/v2.0/maskformer/maskformer_r50_mstrain_16x1_75e_coco/maskformer_r50_mstrain_16x1_75e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/maskformer/maskformer_r50_mstrain_16x1_75e_coco/maskformer_r50_mstrain_16x1_75e_coco_20220221_141956-bc2699cb.pth) | [log](https://download.openmmlab.com/mmdetection/v2.0/maskformer/maskformer_r50_mstrain_16x1_75e_coco/maskformer_r50_mstrain_16x1_75e_coco_20220221_141956.log.json) | This version was mentioned in Table XI, in paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) |
+
+## Citation
+
+```latex
+@inproceedings{cheng2021maskformer,
+ title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
+ author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
+ journal={NeurIPS},
+ year={2021}
+}
+```
diff --git a/mmdet/core/bbox/assigners/mask_hungarian_assigner.py b/mmdet/core/bbox/assigners/mask_hungarian_assigner.py
index ef0f35831d6..d10e62edb84 100644
--- a/mmdet/core/bbox/assigners/mask_hungarian_assigner.py
+++ b/mmdet/core/bbox/assigners/mask_hungarian_assigner.py
@@ -29,9 +29,9 @@ class MaskHungarianAssigner(BaseAssigner):
- positive integer: positive sample, index (1-based) of assigned gt
Args:
- cls_cost (obj:`mmcv.ConfigDict` | dict): Classification cost config.
- mask_cost (obj:`mmcv.ConfigDict` | dict): Mask cost config.
- dice_cost (obj:`mmcv.ConfigDict` | dict): Dice cost config.
+ cls_cost (:obj:`mmcv.ConfigDict` | dict): Classification cost config.
+ mask_cost (:obj:`mmcv.ConfigDict` | dict): Mask cost config.
+ dice_cost (:obj:`mmcv.ConfigDict` | dict): Dice cost config.
"""
def __init__(self,
diff --git a/mmdet/models/dense_heads/maskformer_head.py b/mmdet/models/dense_heads/maskformer_head.py
index 3cd060e53b6..7d7a644c7e1 100644
--- a/mmdet/models/dense_heads/maskformer_head.py
+++ b/mmdet/models/dense_heads/maskformer_head.py
@@ -28,24 +28,24 @@ class MaskFormerHead(AnchorFreeHead):
num_things_classes (int): Number of things.
num_stuff_classes (int): Number of stuff.
num_queries (int): Number of query in Transformer.
- pixel_decoder (obj:`mmcv.ConfigDict`|dict): Config for pixel decoder.
- Defaults to None.
+ pixel_decoder (:obj:`mmcv.ConfigDict` | dict): Config for pixel
+ decoder. Defaults to None.
enforce_decoder_input_project (bool, optional): Whether to add a layer
to change the embed_dim of tranformer encoder in pixel decoder to
the embed_dim of transformer decoder. Defaults to False.
- transformer_decoder (obj:`mmcv.ConfigDict`|dict): Config for
+ transformer_decoder (:obj:`mmcv.ConfigDict` | dict): Config for
transformer decoder. Defaults to None.
- positional_encoding (obj:`mmcv.ConfigDict`|dict): Config for
+ positional_encoding (:obj:`mmcv.ConfigDict` | dict): Config for
transformer decoder position encoding. Defaults to None.
- loss_cls (obj:`mmcv.ConfigDict`|dict): Config of the classification
+ loss_cls (:obj:`mmcv.ConfigDict` | dict): Config of the classification
loss. Defaults to `CrossEntropyLoss`.
- loss_mask (obj:`mmcv.ConfigDict`|dict): Config of the mask loss.
+ loss_mask (:obj:`mmcv.ConfigDict` | dict): Config of the mask loss.
Defaults to `FocalLoss`.
- loss_dice (obj:`mmcv.ConfigDict`|dict): Config of the dice loss.
+ loss_dice (:obj:`mmcv.ConfigDict` | dict): Config of the dice loss.
Defaults to `DiceLoss`.
- train_cfg (obj:`mmcv.ConfigDict`|dict): Training config of Maskformer
- head.
- test_cfg (obj:`mmcv.ConfigDict`|dict): Testing config of Maskformer
+ train_cfg (:obj:`mmcv.ConfigDict` | dict): Training config of
+ Maskformer head.
+ test_cfg (:obj:`mmcv.ConfigDict` | dict): Testing config of Maskformer
head.
init_cfg (dict or list[dict], optional): Initialization config dict.
Defaults to None.
@@ -177,12 +177,11 @@ def preprocess_gt(self, gt_labels_list, gt_masks_list, gt_semantic_segs):
Returns:
tuple: a tuple containing the following targets.
-
- - labels (list[Tensor]): Ground truth class indices for all\
- images. Each with shape (n, ), n is the sum of number\
- of stuff type and number of instance in a image.
- - masks (list[Tensor]): Ground truth mask for each image, each\
- with shape (n, h, w).
+ - labels (list[Tensor]): Ground truth class indices\
+ for all images. Each with shape (n, ), n is the sum of\
+ number of stuff type and number of instance in a image.
+ - masks (list[Tensor]): Ground truth mask for each\
+ image, each with shape (n, h, w).
"""
num_things_list = [self.num_things_classes] * len(gt_labels_list)
num_stuff_list = [self.num_stuff_classes] * len(gt_labels_list)
@@ -213,19 +212,18 @@ def get_targets(self, cls_scores_list, mask_preds_list, gt_labels_list,
Returns:
tuple[list[Tensor]]: a tuple containing the following targets.
-
- labels_list (list[Tensor]): Labels of all images.\
Each with shape (num_queries, ).
- - label_weights_list (list[Tensor]): Label weights of all\
- images. Each with shape (num_queries, ).
- - mask_targets_list (list[Tensor]): Mask targets of all\
- images. Each with shape (num_queries, h, w).
- - mask_weights_list (list[Tensor]): Mask weights of all\
- images. Each with shape (num_queries, ).
- - num_total_pos (int): Number of positive samples in all\
- images.
- - num_total_neg (int): Number of negative samples in all\
- images.
+ - label_weights_list (list[Tensor]): Label weights\
+ of all images. Each with shape (num_queries, ).
+ - mask_targets_list (list[Tensor]): Mask targets of\
+ all images. Each with shape (num_queries, h, w).
+ - mask_weights_list (list[Tensor]): Mask weights of\
+ all images. Each with shape (num_queries, ).
+ - num_total_pos (int): Number of positive samples in\
+ all images.
+ - num_total_neg (int): Number of negative samples in\
+ all images.
"""
(labels_list, label_weights_list, mask_targets_list, mask_weights_list,
pos_inds_list,
@@ -256,7 +254,6 @@ def _get_target_single(self, cls_score, mask_pred, gt_labels, gt_masks,
Returns:
tuple[Tensor]: a tuple containing the following for one image.
-
- labels (Tensor): Labels of each image.
shape (num_queries, ).
- label_weights (Tensor): Label weights of each image.
@@ -444,13 +441,14 @@ def forward(self, feats, img_metas):
img_metas (list[dict]): List of image information.
Returns:
- all_cls_scores (Tensor): Classification scores for each\
- scale level. Each is a 4D-tensor with shape\
- (num_decoder, batch_size, num_queries, cls_out_channels).\
- Note `cls_out_channels` should includes background.
- all_mask_preds (Tensor): Mask scores for each decoder\
- layer. Each with shape (num_decoder, batch_size,\
- num_queries, h, w).
+ tuple: a tuple contains two elements.
+ - all_cls_scores (Tensor): Classification scores for each\
+ scale level. Each is a 4D-tensor with shape\
+ (num_decoder, batch_size, num_queries, cls_out_channels).\
+ Note `cls_out_channels` should includes background.
+ - all_mask_preds (Tensor): Mask scores for each decoder\
+ layer. Each with shape (num_decoder, batch_size,\
+ num_queries, h, w).
"""
batch_size = len(img_metas)
input_img_h, input_img_w = img_metas[0]['batch_input_shape']
@@ -528,7 +526,7 @@ def forward_train(self,
ignored. Defaults to None.
Returns:
- losses (dict[str, Tensor]): a dictionary of loss components
+ dict[str, Tensor]: a dictionary of loss components
"""
# not consider ignoring bboxes
assert gt_bboxes_ignore is None
@@ -607,8 +605,8 @@ def simple_test(self, feats, img_metas, rescale=False):
def post_process(self, mask_cls, mask_pred):
"""Panoptic segmengation inference.
- This implementation is modified from\
- https://github.com/facebookresearch/MaskFormer
+ This implementation is modified from `MaskFormer
+ `_.
Args:
mask_cls (Tensor): Classfication outputs for a image.
@@ -617,7 +615,7 @@ def post_process(self, mask_cls, mask_pred):
shape = (num_queries, h, w).
Returns:
- panoptic_seg (Tensor): panoptic segment result of shape (h, w),\
+ Tensor: panoptic segment result of shape (h, w),\
each element in Tensor means:
segment_id = _cls + instance_id * INSTANCE_OFFSET.
"""
diff --git a/mmdet/models/detectors/maskformer.py b/mmdet/models/detectors/maskformer.py
index 17c5d6c895c..73676bcdf50 100644
--- a/mmdet/models/detectors/maskformer.py
+++ b/mmdet/models/detectors/maskformer.py
@@ -7,7 +7,7 @@
class MaskFormer(SingleStageDetector):
r"""Implementation of `Per-Pixel Classification is
NOT All You Need for Semantic Segmentation
- `_"""
+ `_."""
def __init__(self,
backbone,
diff --git a/mmdet/models/plugins/pixel_decoder.py b/mmdet/models/plugins/pixel_decoder.py
index f69daf46f9a..d1193551ddd 100644
--- a/mmdet/models/plugins/pixel_decoder.py
+++ b/mmdet/models/plugins/pixel_decoder.py
@@ -17,17 +17,17 @@ class PixelDecoder(BaseModule):
input feature maps.
feat_channels (int): Number channels for feature.
out_channels (int): Number channels for output.
- norm_cfg (obj:`mmcv.ConfigDict`|dict): Config for normalization.
+ norm_cfg (:obj:`mmcv.ConfigDict` | dict): Config for normalization.
Defaults to dict(type='GN', num_groups=32).
- act_cfg (obj:`mmcv.ConfigDict`|dict): Config for activation.
+ act_cfg (:obj:`mmcv.ConfigDict` | dict): Config for activation.
Defaults to dict(type='ReLU').
- encoder (obj:`mmcv.ConfigDict`|dict): Config for transorformer
+ encoder (:obj:`mmcv.ConfigDict` | dict): Config for transorformer
encoder.Defaults to None.
- positional_encoding (obj:`mmcv.ConfigDict`|dict): Config for
+ positional_encoding (:obj:`mmcv.ConfigDict` | dict): Config for
transformer encoder position encoding. Defaults to
dict(type='SinePositionalEncoding', num_feats=128,
normalize=True).
- init_cfg (obj:`mmcv.ConfigDict`|dict): Initialization config dict.
+ init_cfg (:obj:`mmcv.ConfigDict` | dict): Initialization config dict.
Default: None
"""
@@ -95,10 +95,9 @@ def forward(self, feats, img_metas):
Returns:
tuple: a tuple containing the following:
-
- mask_feature (Tensor): Shape (batch_size, c, h, w).
- memory (Tensor): Output of last stage of backbone.\
- Shape (batch_size, c, h, w).
+ Shape (batch_size, c, h, w).
"""
y = self.last_feat_conv(feats[-1])
for i in range(self.num_inputs - 2, -1, -1):
@@ -122,17 +121,17 @@ class TransformerEncoderPixelDecoder(PixelDecoder):
input feature maps.
feat_channels (int): Number channels for feature.
out_channels (int): Number channels for output.
- norm_cfg (obj:`mmcv.ConfigDict`|dict): Config for normalization.
+ norm_cfg (:obj:`mmcv.ConfigDict` | dict): Config for normalization.
Defaults to dict(type='GN', num_groups=32).
- act_cfg (obj:`mmcv.ConfigDict`|dict): Config for activation.
+ act_cfg (:obj:`mmcv.ConfigDict` | dict): Config for activation.
Defaults to dict(type='ReLU').
- encoder (obj:`mmcv.ConfigDict`|dict): Config for transorformer
+ encoder (:obj:`mmcv.ConfigDict` | dict): Config for transorformer
encoder.Defaults to None.
- positional_encoding (obj:`mmcv.ConfigDict`|dict): Config for
+ positional_encoding (:obj:`mmcv.ConfigDict` | dict): Config for
transformer encoder position encoding. Defaults to
dict(type='SinePositionalEncoding', num_feats=128,
normalize=True).
- init_cfg (obj:`mmcv.ConfigDict`|dict): Initialization config dict.
+ init_cfg (:obj:`mmcv.ConfigDict` | dict): Initialization config dict.
Default: None
"""
@@ -200,7 +199,6 @@ def forward(self, feats, img_metas):
Returns:
tuple: a tuple containing the following:
-
- mask_feature (Tensor): shape (batch_size, c, h, w).
- memory (Tensor): shape (batch_size, c, h, w).
"""
diff --git a/tools/deployment/test.py b/tools/deployment/test.py
index afbad176841..2daf8866e58 100644
--- a/tools/deployment/test.py
+++ b/tools/deployment/test.py
@@ -141,7 +141,7 @@ def main():
if __name__ == '__main__':
- # main()
+ main()
# Following strings of text style are from colorama package
bright_style, reset_style = '\x1b[1m', '\x1b[0m'