Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mae_visualize models vs mae_pretrain_full models #12

Closed
amirhfarzaneh opened this issue Jan 13, 2022 · 2 comments
Closed

mae_visualize models vs mae_pretrain_full models #12

amirhfarzaneh opened this issue Jan 13, 2022 · 2 comments

Comments

@amirhfarzaneh
Copy link

amirhfarzaneh commented Jan 13, 2022

Hello,

thank you for the great work and the great repo. I was playing with different pre-trained models for visualization. When I use a mae_visualize_vit_base.pth I get the reconstruction results as in the demo and the paper such as below:
Screen Shot 2022-01-12 at 5 56 40 PM

However when I use the mae_pretrain_vit_base_full.pth checkpoint the results are as below:
Screen Shot 2022-01-12 at 5 57 07 PM

mask_ratio=0.75 for both results.
So here are my questions:

  1. Can you please clarify what is the difference between visualize and full checkpoints and why the results look worse with full checkpoints?
  2. If I want to finetune an MAE model (both encoder and decoder parts) for reconstruction on a custom dataset, which checkpoint is recommended?

I would appreciate it if you could help me with these questions.

@KaimingHe
Copy link
Contributor

As noted in the issue where you find this checkpoint (#8), mae_pretrain_vit_base_full.pth is trained with normalized pixels (see Table 1d in paper). So its reconstruction produces results that are normalized (for each patch). What you see are the correct reconstruction. If you do the same normalization on the ground-truth image (https://github.com/facebookresearch/mae/blob/main/models_mae.py#L205), you can see what the model is expected to reconstruct.

mae_visualize_vit_base.pth is trained with unnormalized pixels. It is the default in all results in Table 1 (except 1d). It is slightly worse in terms of representation quality (e.g., classification results).

If your goal is to reconstruct a good-looking image, use unnormalized pixels. If your goal is to finetune for a downstream recognition task, use normalized pixels.

@amirhfarzaneh
Copy link
Author

That makes total sense. I missed the norm_pix_loss. Thanks for the clarification @KaimingHe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants