Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero Level 3 Offload SOMETIMES FAILS on 8 GPUs, ALWAYS WORKS on 4 GPUs #940

Closed
aced125 opened this issue Apr 10, 2021 · 24 comments · Fixed by #968
Closed

Zero Level 3 Offload SOMETIMES FAILS on 8 GPUs, ALWAYS WORKS on 4 GPUs #940

aced125 opened this issue Apr 10, 2021 · 24 comments · Fixed by #968

Comments

@aced125
Copy link

aced125 commented Apr 10, 2021

Hi - I'm getting a new error while trying to train a model on a 8 x V100 box. I'm using pytorch lightning but don't think that should make a difference too much.

Sys config:

Pytorch 1.8
Cuda 10.2
Ubuntu 18.04
Deepspeed 0.3.14
Triton 0.2.3
Apex master branch
Pytorch lightning 1.3.0rc1

Error trace:

Epoch 0:   0%|                                                                                | 0/564 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 488, in fit
    self.dispatch()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in dispatch
    self.accelerator.start_training(self)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 95, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 142, in start_training
    self._results = trainer.run_stage()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in run_stage
    self.run_train()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train
    self.train_loop.run_training_epoch()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 422, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 575, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1414, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 301, in optimizer_step
    self.lightning_module, optimizer, opt_idx, lambda_closure, **kwargs
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 47, in pre_optimizer_step
    lambda_closure()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 570, in train_step_and_backward_closure
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 673, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 709, in backward
    result.closure_loss, optimizer, opt_idx, should_accumulate, *args, **kwargs
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 284, in backward
    self.lightning_module, closure_loss, optimizer, optimizer_idx, should_accumulate, *args, **kwargs
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 73, in backward
    deepspeed_engine.backward(closure_loss, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1020, in backward
    self.allreduce_gradients()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 940, in allreduce_gradients
    self.optimizer.overlapping_partition_gradients_reduce_epilogue()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1393, in overlapping_partition_gradients_reduce_epilogue
    self.independent_gradient_partition_epilogue()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1295, in independent_gradient_partition_epilogue
    self.partition_previous_reduced_grads()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1657, in partition_previous_reduced_grads
    param.partition_gradients(partition_buffers=self.temp_grad_gpu_buffer)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 460, in partition_gradients
    accumulate=accumulate)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 794, in _partition_gradients
    accumulate=accumulate)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 862, in _partition_gradient
    param.grad.data = dest_tensor_full_buffer.data
UnboundLocalError: local variable 'dest_tensor_full_buffer' referenced before assignment
@aced125 aced125 changed the title UnboundLocalError: local variable 'dest_tensor_full_buffer' referenced before assignment Zero Level 2 Works but Level 3 Fails: UnboundLocalError: local variable 'dest_tensor_full_buffer' referenced before assignment Apr 13, 2021
@aced125
Copy link
Author

aced125 commented Apr 13, 2021

Upon further investigation, this error only happens for Zero level 3. Zero level 2 works just fine.

@aced125
Copy link
Author

aced125 commented Apr 13, 2021

tensor([1.], device='cuda:3', dtype=torch.float16, requires_grad=True)

This is the type of tensor that I get when I print what tensor is causing this issue.

This makes sense: this tensor has 1 element (tens.numel() = 1), therefore the partition size is 1, so if the rank > 1, partition_size * rank > 1 and the bug will happen!

@aced125
Copy link
Author

aced125 commented Apr 13, 2021

Even weirder:

WORKS with 4 GPUs (batch size 1, 2, 4, 8, 16 all works)

FAILS with 8 GPUs (batch size 1, 2, 4, 8, 16 ALL FAILS)

What could be going on here?

@aced125 aced125 changed the title Zero Level 2 Works but Level 3 Fails: UnboundLocalError: local variable 'dest_tensor_full_buffer' referenced before assignment Zero Level 3 Offload FAILS on 8 GPUs, WORKS on 4 GPUs Apr 13, 2021
@aced125 aced125 changed the title Zero Level 3 Offload FAILS on 8 GPUs, WORKS on 4 GPUs SOLVED: Zero Level 3 Offload FAILS on 8 GPUs, WORKS on 4 GPUs, because a parameter had numel = 4 Apr 13, 2021
@aced125
Copy link
Author

aced125 commented Apr 13, 2021

SOLVED: When there is a a parameter in the network with numel < num_gpus, the system FAILS.

E.g if num_gpus = 8, but a parameter in the network only has 6 elements, the system will fail as above.

@aced125
Copy link
Author

aced125 commented Apr 13, 2021

@jeffra Not sure if this is intended behaviour? If so, would be definitely good to warn people.

The reason why this is important is because regression problems often have a linear layer with a bias that has very few parameters e.g

self.linear = nn.Linear(256, 1, bias=True)

The bias in this layer will only have 1 parameter, therefore the system will fail if it has more than 1 GPU.

@tjruwase
Copy link
Contributor

@aced125, thanks for reporting and investigating this corner case. This is not intended behavior, our approach is not to partition or offload tiny parameters to cpu, and that should handle this case. Based on your new findings, can you please clarify under what conditions the error is triggered?

@aced125
Copy link
Author

aced125 commented Apr 13, 2021

Sorry - actually I think I was wrong... The error is still happening...

@aced125 aced125 changed the title SOLVED: Zero Level 3 Offload FAILS on 8 GPUs, WORKS on 4 GPUs, because a parameter had numel = 4 Zero Level 3 Offload FAILS on 8 GPUs, WORKS on 4 GPUs Apr 13, 2021
@aced125
Copy link
Author

aced125 commented Apr 13, 2021

More findings:

  • Works on 4 and 6 GPUs
  • Using normal Attention works on 8 GPUs but using DeepSpeed's SparseAttention fails!

@aced125
Copy link
Author

aced125 commented Apr 13, 2021

Okay - I'm now actually finding that sometimes it works, and sometimes it doesn't work. This is getting really weird.

I'll run it once with some settings. It works. Then run it again and boom I get this error.

It could be because of the dataloader. Let me turn shuffle off and drop the last batch.

@aced125 aced125 changed the title Zero Level 3 Offload FAILS on 8 GPUs, WORKS on 4 GPUs Zero Level 3 Offload SOMETIMES FAILS on 8 GPUs, ALWAYS WORKS on 4 GPUs Apr 13, 2021
@aced125
Copy link
Author

aced125 commented Apr 13, 2021

No luck on the dataloader.

@tjruwase could it be because of low CPU RAM? And if so, how to debug?

@tjruwase
Copy link
Contributor

Could you try disabling cpu offloading of params by setting cpu_offload_params to false?

@aced125
Copy link
Author

aced125 commented Apr 13, 2021

@tjruwase Just tried turning off CPU offloading.

Works on 4 GPUs
Fails on 8 GPUs, same issue

@aced125
Copy link
Author

aced125 commented Apr 13, 2021

Btw - level 2 works on everything. It's level 3 that's the issue

@aced125
Copy link
Author

aced125 commented Apr 13, 2021

More info - fails on 5 GPU.

@aced125
Copy link
Author

aced125 commented Apr 13, 2021

Another weird quirk on stage 2: Sometimes it says cannot allocate memory, sometimes it runs just fine... Dataloader shuffle is off

@tjruwase
Copy link
Contributor

Can you share logs of stage 2 failing to allocate memory?

@tjruwase
Copy link
Contributor

@aced125, thanks for the hard work in creating a stable repro with zero stage 3. Is failure on 5 & 8 GPU repeatable?

@aced125
Copy link
Author

aced125 commented Apr 13, 2021

@tjruwase @jeffra I have FINALLY spotted the error!

In my network I am outputting some tensors for classification (where there are N classes).

When N = 36, the whole thing works on 8 GPUs.

When N = 35, it FAILS on 8 GPUs with the above error, but WORKS on 4 GPUs!!!

import torch.nn.functional as F
import torch as th


labels = th.randint(low=0, high=36, size=(32, ))
predictions = model(**inputs) # shape (32, 36). Or do th.randn(32, 36)

loss = F.cross_entropy(predictions, labels)
model.backward(loss)

Any idea why this is the case?

@aced125 aced125 closed this as completed Apr 13, 2021
@aced125 aced125 reopened this Apr 13, 2021
@tjruwase
Copy link
Contributor

@aced125, are you still seeing issues?

@aced125
Copy link
Author

aced125 commented Apr 15, 2021

Yes - but I solved it in the following hacky way:

output_dim = 15

lin = nn.Linear(in_dim, 64)

y = lin(x)
y = y[:, :output_dim]

@aced125
Copy link
Author

aced125 commented Apr 15, 2021

It seems that when output_dim >= 36 things seem to work, else it fails.

@tjruwase
Copy link
Contributor

So can you please provide steps to repro the failure so we can continue investigation?

@SantoshGuptaML
Copy link

tensor([1.], device='cuda:3', dtype=torch.float16, requires_grad=True)

This is the type of tensor that I get when I print what tensor is causing this issue.

This makes sense: this tensor has 1 element (tens.numel() = 1), therefore the partition size is 1, so if the rank > 1, partition_size * rank > 1 and the bug will happen!

I am having the same issue, trying to get the bing squad example to work with 4 gpus. How were you able to print the exact tensor that was causing the issue? I wish to do the same to figure out where the issue is happening.

@tjruwase
Copy link
Contributor

@SantoshGuptaML , can you clarify the exact error you are seeing since multiple issues were involved here.

To your question about printing actual tensor values, you need to use Gather api like as follows:

       for n, p in model.named_parameters():
            with deepspeed.zero.GatheredParameters(p):
                val = p.detach().to('cpu').data.float()
                print0("{} {}: {} {}".format(tag, n, val.shape, val))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants