Skip to content

Commit

Permalink
update MaskFormer readme and docs (#7241)
Browse files Browse the repository at this point in the history
* update docs for maskformer

* update readme

* update readme format

* update link

* update json link

* update format of ConfigDict

* update format of function returns

* uncomment main in deployment/test.py
  • Loading branch information
chhluo authored Feb 24, 2022
1 parent d3703c6 commit cda1e21
Show file tree
Hide file tree
Showing 6 changed files with 74 additions and 86 deletions.
50 changes: 21 additions & 29 deletions configs/maskformer/README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,18 @@
# Per-Pixel Classification is Not All You Need for Semantic Segmentation
# MaskFormer

> [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278)
<!-- [ALGORITHM] -->

## Abstract

Modern approaches typically formulate semantic segmentation as a per-pixel classification
task, while instance-level segmentation is handled with an alternative mask
classification. Our key insight: mask classification is sufficiently general to solve
both semantic- and instance-level segmentation tasks in a unified manner using
the exact same model, loss, and training procedure. Following this observation,
we propose MaskFormer, a simple mask classification model which predicts a
set of binary masks, each associated with a single global class label prediction.
Overall, the proposed mask classification-based method simplifies the landscape
of effective approaches to semantic and panoptic segmentation tasks and shows
excellent empirical results. In particular, we observe that MaskFormer outperforms
per-pixel classification baselines when the number of classes is large. Our mask
classification-based method outperforms both current state-of-the-art semantic
(55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.
Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

<div align=center>
<img src="https://camo.githubusercontent.com/29fb22298d506ce176caad3006a7b05ef2603ca12cece6c788b7e73c046e8bc9/68747470733a2f2f626f77656e63303232312e6769746875622e696f2f696d616765732f6d61736b666f726d65722e706e67" height="300"/>
</div>

## Citation

```
@inproceedings{cheng2021maskformer,
title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
journal={NeurIPS},
year={2021}
}
```

## Dataset
## Introduction

MaskFormer requires COCO and [COCO-panoptic](http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip) dataset for training and evaluation. You need to download and extract it in the COCO dataset path.
The directory should be like this.
Expand All @@ -55,6 +36,17 @@ mmdetection

## Results and Models

| Backbone | style | Lr schd | Mem (GB) | Inf time (fps) | PQ | SQ | RQ | PQ_th | SQ_th | RQ_th | PQ_st | SQ_st | RQ_st | Config | Download | detail |
| :------: | :-----: | :-----: | :------: | :------------: | :-: | :-: | :-: | :---: | :---: | :---: | :---: | :---: | :---: | :---------------------------------------------------------------------------------------------------------------------: | :----------------------: | :---: |
| R-50 | pytorch | 75e | | | | | | | | | | | | [config](https://github.com/open-mmlab/mmdetection/tree/master/configs/maskformer/maskformer_r50_mstrain_16x1_75e_coco.py) | | This version was mentioned in Table XI, in paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) |
| Backbone | style | Lr schd | Mem (GB) | Inf time (fps) | PQ | SQ | RQ | PQ_th | SQ_th | RQ_th | PQ_st | SQ_st | RQ_st | Config | Download | detail |
|:--------:|:-------:|:-------:|:--------:|:--------------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:--------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------:|
| R-50 | pytorch | 75e | 16.6 | - | 46.854 | 80.617 | 57.085 | 51.089 | 81.511 | 61.853 | 40.463 | 79.269 | 49.888 | [config](https://download.openmmlab.com/mmdetection/v2.0/maskformer/maskformer_r50_mstrain_16x1_75e_coco/maskformer_r50_mstrain_16x1_75e_coco.py) | [model](https://download.openmmlab.com/mmdetection/v2.0/maskformer/maskformer_r50_mstrain_16x1_75e_coco/maskformer_r50_mstrain_16x1_75e_coco_20220221_141956-bc2699cb.pth) &#124; [log](https://download.openmmlab.com/mmdetection/v2.0/maskformer/maskformer_r50_mstrain_16x1_75e_coco/maskformer_r50_mstrain_16x1_75e_coco_20220221_141956.log.json) | This version was mentioned in Table XI, in paper [Masked-attention Mask Transformer for Universal Image Segmentation](https://arxiv.org/abs/2112.01527) |

## Citation

```latex
@inproceedings{cheng2021maskformer,
title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
journal={NeurIPS},
year={2021}
}
```
6 changes: 3 additions & 3 deletions mmdet/core/bbox/assigners/mask_hungarian_assigner.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@ class MaskHungarianAssigner(BaseAssigner):
- positive integer: positive sample, index (1-based) of assigned gt
Args:
cls_cost (obj:`mmcv.ConfigDict` | dict): Classification cost config.
mask_cost (obj:`mmcv.ConfigDict` | dict): Mask cost config.
dice_cost (obj:`mmcv.ConfigDict` | dict): Dice cost config.
cls_cost (:obj:`mmcv.ConfigDict` | dict): Classification cost config.
mask_cost (:obj:`mmcv.ConfigDict` | dict): Mask cost config.
dice_cost (:obj:`mmcv.ConfigDict` | dict): Dice cost config.
"""

def __init__(self,
Expand Down
76 changes: 37 additions & 39 deletions mmdet/models/dense_heads/maskformer_head.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,24 +28,24 @@ class MaskFormerHead(AnchorFreeHead):
num_things_classes (int): Number of things.
num_stuff_classes (int): Number of stuff.
num_queries (int): Number of query in Transformer.
pixel_decoder (obj:`mmcv.ConfigDict`|dict): Config for pixel decoder.
Defaults to None.
pixel_decoder (:obj:`mmcv.ConfigDict` | dict): Config for pixel
decoder. Defaults to None.
enforce_decoder_input_project (bool, optional): Whether to add a layer
to change the embed_dim of tranformer encoder in pixel decoder to
the embed_dim of transformer decoder. Defaults to False.
transformer_decoder (obj:`mmcv.ConfigDict`|dict): Config for
transformer_decoder (:obj:`mmcv.ConfigDict` | dict): Config for
transformer decoder. Defaults to None.
positional_encoding (obj:`mmcv.ConfigDict`|dict): Config for
positional_encoding (:obj:`mmcv.ConfigDict` | dict): Config for
transformer decoder position encoding. Defaults to None.
loss_cls (obj:`mmcv.ConfigDict`|dict): Config of the classification
loss_cls (:obj:`mmcv.ConfigDict` | dict): Config of the classification
loss. Defaults to `CrossEntropyLoss`.
loss_mask (obj:`mmcv.ConfigDict`|dict): Config of the mask loss.
loss_mask (:obj:`mmcv.ConfigDict` | dict): Config of the mask loss.
Defaults to `FocalLoss`.
loss_dice (obj:`mmcv.ConfigDict`|dict): Config of the dice loss.
loss_dice (:obj:`mmcv.ConfigDict` | dict): Config of the dice loss.
Defaults to `DiceLoss`.
train_cfg (obj:`mmcv.ConfigDict`|dict): Training config of Maskformer
head.
test_cfg (obj:`mmcv.ConfigDict`|dict): Testing config of Maskformer
train_cfg (:obj:`mmcv.ConfigDict` | dict): Training config of
Maskformer head.
test_cfg (:obj:`mmcv.ConfigDict` | dict): Testing config of Maskformer
head.
init_cfg (dict or list[dict], optional): Initialization config dict.
Defaults to None.
Expand Down Expand Up @@ -177,12 +177,11 @@ def preprocess_gt(self, gt_labels_list, gt_masks_list, gt_semantic_segs):
Returns:
tuple: a tuple containing the following targets.
- labels (list[Tensor]): Ground truth class indices for all\
images. Each with shape (n, ), n is the sum of number\
of stuff type and number of instance in a image.
- masks (list[Tensor]): Ground truth mask for each image, each\
with shape (n, h, w).
- labels (list[Tensor]): Ground truth class indices\
for all images. Each with shape (n, ), n is the sum of\
number of stuff type and number of instance in a image.
- masks (list[Tensor]): Ground truth mask for each\
image, each with shape (n, h, w).
"""
num_things_list = [self.num_things_classes] * len(gt_labels_list)
num_stuff_list = [self.num_stuff_classes] * len(gt_labels_list)
Expand Down Expand Up @@ -213,19 +212,18 @@ def get_targets(self, cls_scores_list, mask_preds_list, gt_labels_list,
Returns:
tuple[list[Tensor]]: a tuple containing the following targets.
- labels_list (list[Tensor]): Labels of all images.\
Each with shape (num_queries, ).
- label_weights_list (list[Tensor]): Label weights of all\
images. Each with shape (num_queries, ).
- mask_targets_list (list[Tensor]): Mask targets of all\
images. Each with shape (num_queries, h, w).
- mask_weights_list (list[Tensor]): Mask weights of all\
images. Each with shape (num_queries, ).
- num_total_pos (int): Number of positive samples in all\
images.
- num_total_neg (int): Number of negative samples in all\
images.
- label_weights_list (list[Tensor]): Label weights\
of all images. Each with shape (num_queries, ).
- mask_targets_list (list[Tensor]): Mask targets of\
all images. Each with shape (num_queries, h, w).
- mask_weights_list (list[Tensor]): Mask weights of\
all images. Each with shape (num_queries, ).
- num_total_pos (int): Number of positive samples in\
all images.
- num_total_neg (int): Number of negative samples in\
all images.
"""
(labels_list, label_weights_list, mask_targets_list, mask_weights_list,
pos_inds_list,
Expand Down Expand Up @@ -256,7 +254,6 @@ def _get_target_single(self, cls_score, mask_pred, gt_labels, gt_masks,
Returns:
tuple[Tensor]: a tuple containing the following for one image.
- labels (Tensor): Labels of each image.
shape (num_queries, ).
- label_weights (Tensor): Label weights of each image.
Expand Down Expand Up @@ -444,13 +441,14 @@ def forward(self, feats, img_metas):
img_metas (list[dict]): List of image information.
Returns:
all_cls_scores (Tensor): Classification scores for each\
scale level. Each is a 4D-tensor with shape\
(num_decoder, batch_size, num_queries, cls_out_channels).\
Note `cls_out_channels` should includes background.
all_mask_preds (Tensor): Mask scores for each decoder\
layer. Each with shape (num_decoder, batch_size,\
num_queries, h, w).
tuple: a tuple contains two elements.
- all_cls_scores (Tensor): Classification scores for each\
scale level. Each is a 4D-tensor with shape\
(num_decoder, batch_size, num_queries, cls_out_channels).\
Note `cls_out_channels` should includes background.
- all_mask_preds (Tensor): Mask scores for each decoder\
layer. Each with shape (num_decoder, batch_size,\
num_queries, h, w).
"""
batch_size = len(img_metas)
input_img_h, input_img_w = img_metas[0]['batch_input_shape']
Expand Down Expand Up @@ -528,7 +526,7 @@ def forward_train(self,
ignored. Defaults to None.
Returns:
losses (dict[str, Tensor]): a dictionary of loss components
dict[str, Tensor]: a dictionary of loss components
"""
# not consider ignoring bboxes
assert gt_bboxes_ignore is None
Expand Down Expand Up @@ -607,8 +605,8 @@ def simple_test(self, feats, img_metas, rescale=False):
def post_process(self, mask_cls, mask_pred):
"""Panoptic segmengation inference.
This implementation is modified from\
https://github.com/facebookresearch/MaskFormer
This implementation is modified from `MaskFormer
<https://github.com/facebookresearch/MaskFormer>`_.
Args:
mask_cls (Tensor): Classfication outputs for a image.
Expand All @@ -617,7 +615,7 @@ def post_process(self, mask_cls, mask_pred):
shape = (num_queries, h, w).
Returns:
panoptic_seg (Tensor): panoptic segment result of shape (h, w),\
Tensor: panoptic segment result of shape (h, w),\
each element in Tensor means:
segment_id = _cls + instance_id * INSTANCE_OFFSET.
"""
Expand Down
2 changes: 1 addition & 1 deletion mmdet/models/detectors/maskformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
class MaskFormer(SingleStageDetector):
r"""Implementation of `Per-Pixel Classification is
NOT All You Need for Semantic Segmentation
<https://arxiv.org/pdf/2107.06278>`_"""
<https://arxiv.org/pdf/2107.06278>`_."""

def __init__(self,
backbone,
Expand Down
24 changes: 11 additions & 13 deletions mmdet/models/plugins/pixel_decoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,17 @@ class PixelDecoder(BaseModule):
input feature maps.
feat_channels (int): Number channels for feature.
out_channels (int): Number channels for output.
norm_cfg (obj:`mmcv.ConfigDict`|dict): Config for normalization.
norm_cfg (:obj:`mmcv.ConfigDict` | dict): Config for normalization.
Defaults to dict(type='GN', num_groups=32).
act_cfg (obj:`mmcv.ConfigDict`|dict): Config for activation.
act_cfg (:obj:`mmcv.ConfigDict` | dict): Config for activation.
Defaults to dict(type='ReLU').
encoder (obj:`mmcv.ConfigDict`|dict): Config for transorformer
encoder (:obj:`mmcv.ConfigDict` | dict): Config for transorformer
encoder.Defaults to None.
positional_encoding (obj:`mmcv.ConfigDict`|dict): Config for
positional_encoding (:obj:`mmcv.ConfigDict` | dict): Config for
transformer encoder position encoding. Defaults to
dict(type='SinePositionalEncoding', num_feats=128,
normalize=True).
init_cfg (obj:`mmcv.ConfigDict`|dict): Initialization config dict.
init_cfg (:obj:`mmcv.ConfigDict` | dict): Initialization config dict.
Default: None
"""

Expand Down Expand Up @@ -95,10 +95,9 @@ def forward(self, feats, img_metas):
Returns:
tuple: a tuple containing the following:
- mask_feature (Tensor): Shape (batch_size, c, h, w).
- memory (Tensor): Output of last stage of backbone.\
Shape (batch_size, c, h, w).
Shape (batch_size, c, h, w).
"""
y = self.last_feat_conv(feats[-1])
for i in range(self.num_inputs - 2, -1, -1):
Expand All @@ -122,17 +121,17 @@ class TransformerEncoderPixelDecoder(PixelDecoder):
input feature maps.
feat_channels (int): Number channels for feature.
out_channels (int): Number channels for output.
norm_cfg (obj:`mmcv.ConfigDict`|dict): Config for normalization.
norm_cfg (:obj:`mmcv.ConfigDict` | dict): Config for normalization.
Defaults to dict(type='GN', num_groups=32).
act_cfg (obj:`mmcv.ConfigDict`|dict): Config for activation.
act_cfg (:obj:`mmcv.ConfigDict` | dict): Config for activation.
Defaults to dict(type='ReLU').
encoder (obj:`mmcv.ConfigDict`|dict): Config for transorformer
encoder (:obj:`mmcv.ConfigDict` | dict): Config for transorformer
encoder.Defaults to None.
positional_encoding (obj:`mmcv.ConfigDict`|dict): Config for
positional_encoding (:obj:`mmcv.ConfigDict` | dict): Config for
transformer encoder position encoding. Defaults to
dict(type='SinePositionalEncoding', num_feats=128,
normalize=True).
init_cfg (obj:`mmcv.ConfigDict`|dict): Initialization config dict.
init_cfg (:obj:`mmcv.ConfigDict` | dict): Initialization config dict.
Default: None
"""

Expand Down Expand Up @@ -200,7 +199,6 @@ def forward(self, feats, img_metas):
Returns:
tuple: a tuple containing the following:
- mask_feature (Tensor): shape (batch_size, c, h, w).
- memory (Tensor): shape (batch_size, c, h, w).
"""
Expand Down
2 changes: 1 addition & 1 deletion tools/deployment/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ def main():


if __name__ == '__main__':
# main()
main()

# Following strings of text style are from colorama package
bright_style, reset_style = '\x1b[1m', '\x1b[0m'
Expand Down

0 comments on commit cda1e21

Please sign in to comment.