GPT2 dropout from huggingface is 0.1 #205

Andrei-Aksionov · 2023-03-13T16:58:05Z

During sampling when the model is loaded from pretrained GPT2 weights, the dropout value is set to 0.0. I assume that was done because previously with pytorch 2.0 flash attention didn't work with dropout value that is not 0.0. From this merged PR it seems that it is already fixed in the latest build of pytorch, so such limitation is no longer true.

If to take a look at the config of all versions of GPT2 from huggingface it appears that the dropout value for all 4 types of dropout is 0.1.

from transformers import GPT2Config
dropout_set = set()
for model_size in ["gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl"]:
    config = GPT2Config.get_config_dict(model_size)[0]
    dropout_set.update(
        (key, value)
        for key, value in config.items()
        if "drop" in key
    )
print(dropout_set)
>> {('attn_pdrop', 0.1),
    ('embd_pdrop', 0.1),
    ('resid_pdrop', 0.1),
    ('summary_first_dropout', 0.1)}

But there is something that I don't fully understand.
When I change value of dropout the output from sampling doesn't change. This is true for torch 1.13.1.
This kinda make sense for me (yet not fully). On inference for dropout instead of randomly zeroing out nodes we need to scale the output accordingly. Like it stated in torch docs. But this constant factor can be dropped because of normalization.

import numpy as np
normalize = lambda x: (x - x.mean()) / x.std()
x = np.random.rand(10)
print(np.allclose(normalize(x), normalize(x * 0.9)))
>>> True

And some layers in the model are followed by LayerNorm, but some - aren't. That's why I am not 100% sure that this is the case.

So the question: if it's true, that normalization layers effectively make dropout scaling during inference not important, then do we need to set it at all (for sampling)?

I also tried different dropout values for sampling on Google Colab and torch==2.1.0.dev20230313+cu117 (to test flash attention).
In this case different dropout values affect sampling output.
Which makes me confused. Either my assumption is wrong (about non importance of dropout value with normalization layers), or it's something wrong with flash attention. Maybe scaled_dot_attention doesn't respect model's state (like train and eval) and thus works incorrectly?
At least from the code snippet I don't see what might be an issue.

In the config file one can find that dropout value for all 4 dropouts is identical and equal to 0.1, not 0.0

karpathy · 2023-03-18T22:08:45Z

Scaled dropout, which is default, scales the activations during training in addition to dropping. During inference Dropout can then be noop.

Andrei-Aksionov · 2023-03-19T12:20:26Z

My bad, didn't pay enough attention (no pun intended 😄 ).
That's literally what's said in the documentation:

... This means that during evaluation the module simply computes an identity function.

So that explains why changing dropout value for sampling doesn't affect the output.

But here's another question: why to provide dropout value for sampling at all if there is no impact at all?

From what I see in the documentation of scaled_dot_product_attention it applies dropout no matter what mode the model is in, only cares if the value is greater than 0.0.
My assumption that by setting dropout to 0.0 you disabled scaling of outputs for scaled_dot_product_attention and it works fine because dropout in other places doesn't care about the dropout value (during sampling).

@karpathy Could you confirm it? If it is the case, I'll close this PR.

P.S. There is another approach of how to deal with it (from the docs):

class CausalSelfAttention(nn.Module):
  ...
  def forward(self, x):
  ...
  if self.training:
      dropout = self.dropout
      is_causal = self.is_causal
  else:
      dropout = 0.0
      is_causal = False

  y = F.scaled_dot_product_attention(..., dropout_p=dropout, is_causal=is_causal)
  ...

The author of this code stated that it was inspired by nanoGPT repository, so you can make changes inspired by the code above ➿

YassineYousfi · 2023-04-11T06:12:30Z

Could you confirm it? If it is the case, I'll close this PR.

Yes you are right.

The last code snippet from https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html is wrong though. is_causal should not be set to False.

Andrei-Aksionov · 2023-04-11T11:59:19Z

Hello @YassineYousfi
Thanks for the reply.

I provided wrong link above the code snippet, here is the correct one.

So basically I copy-pasted the code like a typical 🐒 .
In my defense I want to say that combo of pytorch.org + words "... inspired by Andrej Karpathy’s NanoGPT repository ..." did some magic to me so I wasn't attentive enough. There is always something to learn ...

I mean if we set is_causal to False, attention in general should work fine (that's why it's shown on pytorch.org in that way), but it will ruin calculation of validation loss:

nanoGPT/train.py

Lines 204 to 217 in a82b33b

    
           @torch.no_grad() 
        
           def estimate_loss(): 
        
               out = {} 
        
               model.eval() 
        
               for split in ['train', 'val']: 
        
                   losses = torch.zeros(eval_iters) 
        
                   for k in range(eval_iters): 
        
                       X, Y = get_batch(split) 
        
                       with ctx: 
        
                           logits, loss = model(X, Y) 
        
                       losses[k] = loss.item() 
        
                   out[split] = losses.mean() 
        
               model.train() 
        
               return out

Is this the reason or there is something else?

YassineYousfi · 2023-04-13T05:36:57Z

Correct. There are cases where is_causal can be set to False (e.g. when giving one query at a time with a kvcache) but this is not one of them.

Andrei-Aksionov · 2023-04-13T12:33:37Z

Thanks for the response.

Add Mixture of Experts (MoE) support

GPT2 dropout from huggingface is 0.1

c3eb450

In the config file one can find that dropout value for all 4 dropouts is identical and equal to 0.1, not 0.0

karpathy closed this Apr 13, 2023

gkielian added a commit to gkielian/ReaLLMASIC_nanogpt that referenced this pull request Aug 9, 2024

Merge pull request karpathy#205 from djlisbonne/add_moe

5191acd

Add Mixture of Experts (MoE) support

gkielian added a commit to gkielian/ReaLLMASIC_nanogpt that referenced this pull request Sep 5, 2024

Merge pull request karpathy#205 from djlisbonne/add_moe

f336f90

Add Mixture of Experts (MoE) support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT2 dropout from huggingface is 0.1 #205

GPT2 dropout from huggingface is 0.1 #205

Andrei-Aksionov commented Mar 13, 2023

karpathy commented Mar 18, 2023

Andrei-Aksionov commented Mar 19, 2023

YassineYousfi commented Apr 11, 2023

Andrei-Aksionov commented Apr 11, 2023

YassineYousfi commented Apr 13, 2023

Andrei-Aksionov commented Apr 13, 2023

GPT2 dropout from huggingface is 0.1 #205

GPT2 dropout from huggingface is 0.1 #205

Conversation

Andrei-Aksionov commented Mar 13, 2023

karpathy commented Mar 18, 2023

Andrei-Aksionov commented Mar 19, 2023

YassineYousfi commented Apr 11, 2023

Andrei-Aksionov commented Apr 11, 2023

YassineYousfi commented Apr 13, 2023

Andrei-Aksionov commented Apr 13, 2023