Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT2 dropout from huggingface is 0.1 #205

Conversation

Andrei-Aksionov
Copy link

During sampling when the model is loaded from pretrained GPT2 weights, the dropout value is set to 0.0. I assume that was done because previously with pytorch 2.0 flash attention didn't work with dropout value that is not 0.0. From this merged PR it seems that it is already fixed in the latest build of pytorch, so such limitation is no longer true.

If to take a look at the config of all versions of GPT2 from huggingface it appears that the dropout value for all 4 types of dropout is 0.1.

from transformers import GPT2Config
dropout_set = set()
for model_size in ["gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl"]:
    config = GPT2Config.get_config_dict(model_size)[0]
    dropout_set.update(
        (key, value)
        for key, value in config.items()
        if "drop" in key
    )
print(dropout_set)
>> {('attn_pdrop', 0.1),
    ('embd_pdrop', 0.1),
    ('resid_pdrop', 0.1),
    ('summary_first_dropout', 0.1)}

But there is something that I don't fully understand.
When I change value of dropout the output from sampling doesn't change. This is true for torch 1.13.1.
This kinda make sense for me (yet not fully). On inference for dropout instead of randomly zeroing out nodes we need to scale the output accordingly. Like it stated in torch docs. But this constant factor can be dropped because of normalization.

import numpy as np
normalize = lambda x: (x - x.mean()) / x.std()
x = np.random.rand(10)
print(np.allclose(normalize(x), normalize(x * 0.9)))
>>> True

And some layers in the model are followed by LayerNorm, but some - aren't. That's why I am not 100% sure that this is the case.

So the question: if it's true, that normalization layers effectively make dropout scaling during inference not important, then do we need to set it at all (for sampling)?


I also tried different dropout values for sampling on Google Colab and torch==2.1.0.dev20230313+cu117 (to test flash attention).
In this case different dropout values affect sampling output.
Which makes me confused. Either my assumption is wrong (about non importance of dropout value with normalization layers), or it's something wrong with flash attention. Maybe scaled_dot_attention doesn't respect model's state (like train and eval) and thus works incorrectly?
At least from the code snippet I don't see what might be an issue.

In the config file one can find that dropout value for all 4 dropouts is
identical and equal to 0.1, not 0.0
@karpathy
Copy link
Owner

Scaled dropout, which is default, scales the activations during training in addition to dropping. During inference Dropout can then be noop.

@Andrei-Aksionov
Copy link
Author

My bad, didn't pay enough attention (no pun intended 😄 ).
That's literally what's said in the documentation:

... This means that during evaluation the module simply computes an identity function.

So that explains why changing dropout value for sampling doesn't affect the output.

But here's another question: why to provide dropout value for sampling at all if there is no impact at all?

From what I see in the documentation of scaled_dot_product_attention it applies dropout no matter what mode the model is in, only cares if the value is greater than 0.0.
My assumption that by setting dropout to 0.0 you disabled scaling of outputs for scaled_dot_product_attention and it works fine because dropout in other places doesn't care about the dropout value (during sampling).

@karpathy Could you confirm it? If it is the case, I'll close this PR.

P.S. There is another approach of how to deal with it (from the docs):

class CausalSelfAttention(nn.Module):
  ...
  def forward(self, x):
  ...
  if self.training:
      dropout = self.dropout
      is_causal = self.is_causal
  else:
      dropout = 0.0
      is_causal = False

  y = F.scaled_dot_product_attention(..., dropout_p=dropout, is_causal=is_causal)
  ...

The author of this code stated that it was inspired by nanoGPT repository, so you can make changes inspired by the code above ➿

@YassineYousfi
Copy link
Contributor

Could you confirm it? If it is the case, I'll close this PR.

Yes you are right.

The last code snippet from https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html is wrong though. is_causal should not be set to False.

@Andrei-Aksionov
Copy link
Author

Hello @YassineYousfi
Thanks for the reply.

I provided wrong link above the code snippet, here is the correct one.

So basically I copy-pasted the code like a typical 🐒 .
In my defense I want to say that combo of pytorch.org + words "... inspired by Andrej Karpathy’s NanoGPT repository ..." did some magic to me so I wasn't attentive enough. There is always something to learn ...

I mean if we set is_causal to False, attention in general should work fine (that's why it's shown on pytorch.org in that way), but it will ruin calculation of validation loss:

nanoGPT/train.py

Lines 204 to 217 in a82b33b

@torch.no_grad()
def estimate_loss():
out = {}
model.eval()
for split in ['train', 'val']:
losses = torch.zeros(eval_iters)
for k in range(eval_iters):
X, Y = get_batch(split)
with ctx:
logits, loss = model(X, Y)
losses[k] = loss.item()
out[split] = losses.mean()
model.train()
return out

Is this the reason or there is something else?

@karpathy karpathy closed this Apr 13, 2023
@YassineYousfi
Copy link
Contributor

Correct. There are cases where is_causal can be set to False (e.g. when giving one query at a time with a kvcache) but this is not one of them.

@Andrei-Aksionov
Copy link
Author

Thanks for the response.

gkielian added a commit to gkielian/ReaLLMASIC_nanogpt that referenced this pull request Aug 9, 2024
Add Mixture of Experts (MoE) support
gkielian added a commit to gkielian/ReaLLMASIC_nanogpt that referenced this pull request Sep 5, 2024
Add Mixture of Experts (MoE) support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants