-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPT2 dropout from huggingface is 0.1 #205
GPT2 dropout from huggingface is 0.1 #205
Conversation
In the config file one can find that dropout value for all 4 dropouts is identical and equal to 0.1, not 0.0
Scaled dropout, which is default, scales the activations during training in addition to dropping. During inference Dropout can then be noop. |
My bad, didn't pay enough attention (no pun intended 😄 ).
So that explains why changing dropout value for sampling doesn't affect the output. But here's another question: why to provide dropout value for sampling at all if there is no impact at all? From what I see in the documentation of @karpathy Could you confirm it? If it is the case, I'll close this PR. P.S. There is another approach of how to deal with it (from the docs): class CausalSelfAttention(nn.Module):
...
def forward(self, x):
...
if self.training:
dropout = self.dropout
is_causal = self.is_causal
else:
dropout = 0.0
is_causal = False
y = F.scaled_dot_product_attention(..., dropout_p=dropout, is_causal=is_causal)
... The author of this code stated that it was inspired by nanoGPT repository, so you can make changes inspired by the code above ➿ |
Yes you are right. The last code snippet from https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html is wrong though. |
Hello @YassineYousfi I provided wrong link above the code snippet, here is the correct one. So basically I copy-pasted the code like a typical 🐒 . I mean if we set Lines 204 to 217 in a82b33b
Is this the reason or there is something else? |
Correct. There are cases where is_causal can be set to False (e.g. when giving one query at a time with a kvcache) but this is not one of them. |
Thanks for the response. |
Add Mixture of Experts (MoE) support
Add Mixture of Experts (MoE) support
During sampling when the model is loaded from pretrained GPT2 weights, the dropout value is set to 0.0. I assume that was done because previously with pytorch 2.0 flash attention didn't work with dropout value that is not 0.0. From this merged PR it seems that it is already fixed in the latest build of pytorch, so such limitation is no longer true.
If to take a look at the config of all versions of GPT2 from huggingface it appears that the dropout value for all 4 types of dropout is 0.1.
But there is something that I don't fully understand.
When I change value of dropout the output from sampling doesn't change. This is true for torch 1.13.1.
This kinda make sense for me (yet not fully). On inference for dropout instead of randomly zeroing out nodes we need to scale the output accordingly. Like it stated in torch docs. But this constant factor can be dropped because of normalization.
And some layers in the model are followed by LayerNorm, but some - aren't. That's why I am not 100% sure that this is the case.
So the question: if it's true, that normalization layers effectively make dropout scaling during inference not important, then do we need to set it at all (for sampling)?
I also tried different dropout values for sampling on Google Colab and torch==2.1.0.dev20230313+cu117 (to test flash attention).
In this case different dropout values affect sampling output.
Which makes me confused. Either my assumption is wrong (about non importance of dropout value with normalization layers), or it's something wrong with flash attention. Maybe
scaled_dot_attention
doesn't respect model's state (like train and eval) and thus works incorrectly?At least from the code snippet I don't see what might be an issue.