-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero Level 3 Offload SOMETIMES FAILS on 8 GPUs, ALWAYS WORKS on 4 GPUs #940
Comments
Upon further investigation, this error only happens for Zero level 3. Zero level 2 works just fine. |
tensor([1.], device='cuda:3', dtype=torch.float16, requires_grad=True) This is the type of tensor that I get when I print what tensor is causing this issue. This makes sense: this tensor has 1 element (tens.numel() = 1), therefore the partition size is 1, so if the rank > 1, partition_size * rank > 1 and the bug will happen! |
Even weirder: WORKS with 4 GPUs (batch size 1, 2, 4, 8, 16 all works) FAILS with 8 GPUs (batch size 1, 2, 4, 8, 16 ALL FAILS) What could be going on here? |
SOLVED: When there is a a parameter in the network with numel < num_gpus, the system FAILS. E.g if num_gpus = 8, but a parameter in the network only has 6 elements, the system will fail as above. |
@jeffra Not sure if this is intended behaviour? If so, would be definitely good to warn people. The reason why this is important is because regression problems often have a linear layer with a bias that has very few parameters e.g self.linear = nn.Linear(256, 1, bias=True) The bias in this layer will only have 1 parameter, therefore the system will fail if it has more than 1 GPU. |
@aced125, thanks for reporting and investigating this corner case. This is not intended behavior, our approach is not to partition or offload tiny parameters to cpu, and that should handle this case. Based on your new findings, can you please clarify under what conditions the error is triggered? |
Sorry - actually I think I was wrong... The error is still happening... |
More findings:
|
Okay - I'm now actually finding that sometimes it works, and sometimes it doesn't work. This is getting really weird. I'll run it once with some settings. It works. Then run it again and boom I get this error. It could be because of the dataloader. Let me turn shuffle off and drop the last batch. |
No luck on the dataloader. @tjruwase could it be because of low CPU RAM? And if so, how to debug? |
Could you try disabling cpu offloading of params by setting |
@tjruwase Just tried turning off CPU offloading. Works on 4 GPUs |
Btw - level 2 works on everything. It's level 3 that's the issue |
More info - fails on 5 GPU. |
Another weird quirk on stage 2: Sometimes it says cannot allocate memory, sometimes it runs just fine... Dataloader shuffle is off |
Can you share logs of stage 2 failing to allocate memory? |
@aced125, thanks for the hard work in creating a stable repro with zero stage 3. Is failure on 5 & 8 GPU repeatable? |
@tjruwase @jeffra I have FINALLY spotted the error! In my network I am outputting some tensors for classification (where there are N classes). When N = 36, the whole thing works on 8 GPUs. When N = 35, it FAILS on 8 GPUs with the above error, but WORKS on 4 GPUs!!! import torch.nn.functional as F
import torch as th
labels = th.randint(low=0, high=36, size=(32, ))
predictions = model(**inputs) # shape (32, 36). Or do th.randn(32, 36)
loss = F.cross_entropy(predictions, labels)
model.backward(loss) Any idea why this is the case? |
@aced125, are you still seeing issues? |
Yes - but I solved it in the following hacky way: output_dim = 15
lin = nn.Linear(in_dim, 64)
y = lin(x)
y = y[:, :output_dim] |
It seems that when |
So can you please provide steps to repro the failure so we can continue investigation? |
I am having the same issue, trying to get the bing squad example to work with 4 gpus. How were you able to print the exact tensor that was causing the issue? I wish to do the same to figure out where the issue is happening. |
@SantoshGuptaML , can you clarify the exact error you are seeing since multiple issues were involved here. To your question about printing actual tensor values, you need to use Gather api like as follows: for n, p in model.named_parameters():
with deepspeed.zero.GatheredParameters(p):
val = p.detach().to('cpu').data.float()
print0("{} {}: {} {}".format(tag, n, val.shape, val)) |
Hi - I'm getting a new error while trying to train a model on a 8 x V100 box. I'm using pytorch lightning but don't think that should make a difference too much.
Sys config:
Pytorch 1.8
Cuda 10.2
Ubuntu 18.04
Deepspeed 0.3.14
Triton 0.2.3
Apex master branch
Pytorch lightning 1.3.0rc1
Error trace:
The text was updated successfully, but these errors were encountered: