-
Notifications
You must be signed in to change notification settings - Fork 435
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* [Fix]: Fix outdated problem * [Fix]: Update MoCov3 bibtex * [Fix]: Use abs path in README * [Fix]: Reformat MAE bibtex * [Fix]: Reformat MoCov3 bibtex
- Loading branch information
1 parent
6dbed8e
commit 7860475
Showing
8 changed files
with
201 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# MAE | ||
|
||
> [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) | ||
<!-- [ALGORITHM] --> | ||
|
||
## Abstract | ||
|
||
This paper shows that masked autoencoders (MAE) are | ||
scalable self-supervised learners for computer vision. Our | ||
MAE approach is simple: we mask random patches of the | ||
input image and reconstruct the missing pixels. It is based | ||
on two core designs. First, we develop an asymmetric | ||
encoder-decoder architecture, with an encoder that operates only on the | ||
visible subset of patches (without mask tokens), along with a lightweight | ||
decoder that reconstructs the original image from the latent representation | ||
and mask tokens. Second, we find that masking a high proportion | ||
of the input image, e.g., 75%, yields a nontrivial and | ||
meaningful self-supervisory task. Coupling these two designs enables us to | ||
train large models efficiently and effectively: we accelerate | ||
training (by 3× or more) and improve accuracy. Our scalable approach allows | ||
for learning high-capacity models that generalize well: e.g., a vanilla | ||
ViT-Huge model achieves the best accuracy (87.8%) among | ||
methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior. | ||
|
||
<div align="center"> | ||
<img src="https://user-images.githubusercontent.com/30762564/150733959-2959852a-c7bd-4d3f-911f-3e8d8839fe67.png" width="40%"/> | ||
</div> | ||
|
||
|
||
## Models and Benchmarks | ||
|
||
Here, we report the results of the model, which is pre-trained on ImageNet1K | ||
for 400 epochs, the details are below: | ||
|
||
|
||
|
||
| Backbone | Pre-train epoch | Fine-tuning Top-1 | Pre-train Config | Fine-tuning Config | Download | | ||
| :------: | :-------------: | :---------------: | :-------------------------------------------------: | :---------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | ||
| ViT-B/16 | 400 | 83.1 | [config](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/mae/mae_vit-b-p16_8xb512-coslr-400e_in1k.py) | [config](https://github.com/open-mmlab/mmselfsup/blob/master/configs/benchmarks/classification/imagenet/vit-b-p16_ft-8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-400e_in1k-224_20220223-85be947b.pth) | [log](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-300e_in1k-224_20220210_140925.log.json) | | ||
|
||
|
||
## Citation | ||
|
||
```bibtex | ||
@article{He2021MaskedAA, | ||
title={Masked Autoencoders Are Scalable Vision Learners}, | ||
author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and | ||
Piotr Doll'ar and Ross B. Girshick}, | ||
journal={ArXiv}, | ||
year={2021} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# MoCo v3 | ||
|
||
> [An Empirical Study of Training Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.02057) | ||
<!-- [ALGORITHM] --> | ||
|
||
## Abstract | ||
|
||
This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research. | ||
|
||
<div align="center"> | ||
<img src="https://user-images.githubusercontent.com/36138628/151305362-e6e8ea35-b3b8-45f6-8819-634e67083218.png" width="500" /> | ||
</div> | ||
|
||
## Results and Models | ||
|
||
**Back to [model_zoo.md](https://github.com/open-mmlab/mmselfsup/blob/master/docs/en/model_zoo.md) to download models.** | ||
|
||
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models were trained on ImageNet1k dataset. | ||
|
||
### Classification | ||
|
||
The classification benchmarks includes 4 downstream task datasets, **VOC**, **ImageNet**, **iNaturalist2018** and **Places205**. If not specified, the results are Top-1 (%). | ||
|
||
#### ImageNet Linear Evaluation | ||
|
||
The **Linear Evaluation** result is obtained by training a linear head upon the pre-trained backbone. Please refer to [vit-small-p16_8xb128-coslr-90e_in1k](https://github.com/open-mmlab/mmselfsup/blob/master/configs/benchmarks/classification/imagenet/vit-small-p16_8xb128-coslr-90e_in1k.py) for details of config. | ||
|
||
| Self-Supervised Config | Linear Evaluation | | ||
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | | ||
| [vit-small-p16_32xb128-fp16-coslr-300e_in1k-224](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/mocov3/mocov3_vit-small-p16_32xb128-fp16-coslr-300e_in1k-224.py) | 73.19 | | ||
|
||
## Citation | ||
|
||
```bibtex | ||
@InProceedings{Chen_2021_ICCV, | ||
title = {An Empirical Study of Training Self-Supervised Vision Transformers}, | ||
author = {Chen, Xinlei and Xie, Saining and He, Kaiming}, | ||
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, | ||
year = {2021} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# MAE | ||
|
||
> [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) | ||
<!-- [ALGORITHM] --> | ||
|
||
## Abstract | ||
|
||
This paper shows that masked autoencoders (MAE) are | ||
scalable self-supervised learners for computer vision. Our | ||
MAE approach is simple: we mask random patches of the | ||
input image and reconstruct the missing pixels. It is based | ||
on two core designs. First, we develop an asymmetric | ||
encoder-decoder architecture, with an encoder that operates only on the | ||
visible subset of patches (without mask tokens), along with a lightweight | ||
decoder that reconstructs the original image from the latent representation | ||
and mask tokens. Second, we find that masking a high proportion | ||
of the input image, e.g., 75%, yields a nontrivial and | ||
meaningful self-supervisory task. Coupling these two designs enables us to | ||
train large models efficiently and effectively: we accelerate | ||
training (by 3× or more) and improve accuracy. Our scalable approach allows | ||
for learning high-capacity models that generalize well: e.g., a vanilla | ||
ViT-Huge model achieves the best accuracy (87.8%) among | ||
methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior. | ||
|
||
<div align="center"> | ||
<img src="https://user-images.githubusercontent.com/30762564/150733959-2959852a-c7bd-4d3f-911f-3e8d8839fe67.png" width="40%"/> | ||
</div> | ||
|
||
|
||
## Models and Benchmarks | ||
|
||
Here, we report the results of the model, which is pre-trained on ImageNet1K | ||
for 400 epochs, the details are below: | ||
|
||
|
||
|
||
| Backbone | Pre-train epoch | Fine-tuning Top-1 | Pre-train Config | Fine-tuning Config | Download | | ||
| :------: | :-------------: | :---------------: | :-------------------------------------------------: | :---------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | ||
| ViT-B/16 | 400 | 83.1 | [config](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/mae/mae_vit-b-p16_8xb512-coslr-400e_in1k.py) | [config](https://github.com/open-mmlab/mmselfsup/blob/master/configs/benchmarks/classification/imagenet/vit-b-p16_ft-8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-400e_in1k-224_20220223-85be947b.pth) | [log](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-300e_in1k-224_20220210_140925.log.json) | | ||
|
||
|
||
## Citation | ||
|
||
```bibtex | ||
@article{He2021MaskedAA, | ||
title={Masked Autoencoders Are Scalable Vision Learners}, | ||
author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and | ||
Piotr Doll'ar and Ross B. Girshick}, | ||
journal={ArXiv}, | ||
year={2021} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# MoCo v3 | ||
|
||
> [An Empirical Study of Training Self-Supervised Vision Transformers](https://arxiv.org/abs/2104.02057) | ||
<!-- [ALGORITHM] --> | ||
|
||
## Abstract | ||
|
||
This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results. We reveal that these results are indeed partial failure, and they can be improved when training is made more stable. We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects. We discuss the currently positive evidence as well as challenges and open questions. We hope that this work will provide useful data points and experience for future research. | ||
|
||
<div align="center"> | ||
<img src="https://user-images.githubusercontent.com/36138628/151305362-e6e8ea35-b3b8-45f6-8819-634e67083218.png" width="500" /> | ||
</div> | ||
|
||
## Results and Models | ||
|
||
**Back to [model_zoo.md](https://github.com/open-mmlab/mmselfsup/blob/master/docs/en/model_zoo.md) to download models.** | ||
|
||
In this page, we provide benchmarks as much as possible to evaluate our pre-trained models. If not mentioned, all models were trained on ImageNet1k dataset. | ||
|
||
### Classification | ||
|
||
The classification benchmarks includes 4 downstream task datasets, **VOC**, **ImageNet**, **iNaturalist2018** and **Places205**. If not specified, the results are Top-1 (%). | ||
|
||
#### ImageNet Linear Evaluation | ||
|
||
The **Linear Evaluation** result is obtained by training a linear head upon the pre-trained backbone. Please refer to [vit-small-p16_8xb128-coslr-90e_in1k](https://github.com/open-mmlab/mmselfsup/blob/master/configs/benchmarks/classification/imagenet/vit-small-p16_8xb128-coslr-90e_in1k.py) for details of config. | ||
|
||
| Self-Supervised Config | Linear Evaluation | | ||
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | | ||
| [vit-small-p16_32xb128-fp16-coslr-300e_in1k-224](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/mocov3/mocov3_vit-small-p16_32xb128-fp16-coslr-300e_in1k-224.py) | 73.19 | | ||
|
||
## Citation | ||
|
||
```bibtex | ||
@InProceedings{Chen_2021_ICCV, | ||
title = {An Empirical Study of Training Self-Supervised Vision Transformers}, | ||
author = {Chen, Xinlei and Xie, Saining and He, Kaiming}, | ||
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, | ||
year = {2021} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters