Zero Level 3 Offload SOMETIMES FAILS on 8 GPUs, ALWAYS WORKS on 4 GPUs #940

aced125 · 2021-04-10T03:51:11Z

Hi - I'm getting a new error while trying to train a model on a 8 x V100 box. I'm using pytorch lightning but don't think that should make a difference too much.

Sys config:

Pytorch 1.8
Cuda 10.2
Ubuntu 18.04
Deepspeed 0.3.14
Triton 0.2.3
Apex master branch
Pytorch lightning 1.3.0rc1

Error trace:

Epoch 0:   0%|                                                                                | 0/564 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 488, in fit
    self.dispatch()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in dispatch
    self.accelerator.start_training(self)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 95, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 142, in start_training
    self._results = trainer.run_stage()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in run_stage
    self.run_train()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train
    self.train_loop.run_training_epoch()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 422, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 575, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 370, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1414, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 301, in optimizer_step
    self.lightning_module, optimizer, opt_idx, lambda_closure, **kwargs
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 47, in pre_optimizer_step
    lambda_closure()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 570, in train_step_and_backward_closure
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 673, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 709, in backward
    result.closure_loss, optimizer, opt_idx, should_accumulate, *args, **kwargs
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 284, in backward
    self.lightning_module, closure_loss, optimizer, optimizer_idx, should_accumulate, *args, **kwargs
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/deepspeed_precision.py", line 73, in backward
    deepspeed_engine.backward(closure_loss, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1020, in backward
    self.allreduce_gradients()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 940, in allreduce_gradients
    self.optimizer.overlapping_partition_gradients_reduce_epilogue()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1393, in overlapping_partition_gradients_reduce_epilogue
    self.independent_gradient_partition_epilogue()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1295, in independent_gradient_partition_epilogue
    self.partition_previous_reduced_grads()
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 1657, in partition_previous_reduced_grads
    param.partition_gradients(partition_buffers=self.temp_grad_gpu_buffer)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 460, in partition_gradients
    accumulate=accumulate)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 794, in _partition_gradients
    accumulate=accumulate)
  File "/home/ubuntu/anaconda3/envs/torch/lib/python3.7/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 862, in _partition_gradient
    param.grad.data = dest_tensor_full_buffer.data
UnboundLocalError: local variable 'dest_tensor_full_buffer' referenced before assignment

aced125 · 2021-04-13T11:42:58Z

Upon further investigation, this error only happens for Zero level 3. Zero level 2 works just fine.

aced125 · 2021-04-13T12:55:41Z

tensor([1.], device='cuda:3', dtype=torch.float16, requires_grad=True)

This is the type of tensor that I get when I print what tensor is causing this issue.

This makes sense: this tensor has 1 element (tens.numel() = 1), therefore the partition size is 1, so if the rank > 1, partition_size * rank > 1 and the bug will happen!

aced125 · 2021-04-13T13:15:35Z

Even weirder:

WORKS with 4 GPUs (batch size 1, 2, 4, 8, 16 all works)

FAILS with 8 GPUs (batch size 1, 2, 4, 8, 16 ALL FAILS)

What could be going on here?

aced125 · 2021-04-13T13:21:28Z

SOLVED: When there is a a parameter in the network with numel < num_gpus, the system FAILS.

E.g if num_gpus = 8, but a parameter in the network only has 6 elements, the system will fail as above.

aced125 · 2021-04-13T13:23:00Z

@jeffra Not sure if this is intended behaviour? If so, would be definitely good to warn people.

The reason why this is important is because regression problems often have a linear layer with a bias that has very few parameters e.g

self.linear = nn.Linear(256, 1, bias=True)

The bias in this layer will only have 1 parameter, therefore the system will fail if it has more than 1 GPU.

tjruwase · 2021-04-13T13:27:20Z

@aced125, thanks for reporting and investigating this corner case. This is not intended behavior, our approach is not to partition or offload tiny parameters to cpu, and that should handle this case. Based on your new findings, can you please clarify under what conditions the error is triggered?

aced125 · 2021-04-13T13:48:32Z

Sorry - actually I think I was wrong... The error is still happening...

aced125 · 2021-04-13T13:55:57Z

More findings:

Works on 4 and 6 GPUs
Using normal Attention works on 8 GPUs but using DeepSpeed's SparseAttention fails!

aced125 · 2021-04-13T13:59:08Z

Okay - I'm now actually finding that sometimes it works, and sometimes it doesn't work. This is getting really weird.

I'll run it once with some settings. It works. Then run it again and boom I get this error.

It could be because of the dataloader. Let me turn shuffle off and drop the last batch.

aced125 · 2021-04-13T14:25:39Z

No luck on the dataloader.

@tjruwase could it be because of low CPU RAM? And if so, how to debug?

tjruwase · 2021-04-13T14:28:24Z

Could you try disabling cpu offloading of params by setting cpu_offload_params to false?

aced125 · 2021-04-13T14:31:46Z

@tjruwase Just tried turning off CPU offloading.

Works on 4 GPUs
Fails on 8 GPUs, same issue

aced125 · 2021-04-13T14:31:57Z

Btw - level 2 works on everything. It's level 3 that's the issue

aced125 · 2021-04-13T14:38:11Z

More info - fails on 5 GPU.

aced125 · 2021-04-13T14:59:17Z

Another weird quirk on stage 2: Sometimes it says cannot allocate memory, sometimes it runs just fine... Dataloader shuffle is off

tjruwase · 2021-04-13T16:50:44Z

Can you share logs of stage 2 failing to allocate memory?

tjruwase · 2021-04-13T16:52:43Z

@aced125, thanks for the hard work in creating a stable repro with zero stage 3. Is failure on 5 & 8 GPU repeatable?

aced125 · 2021-04-13T20:38:20Z

@tjruwase @jeffra I have FINALLY spotted the error!

In my network I am outputting some tensors for classification (where there are N classes).

When N = 36, the whole thing works on 8 GPUs.

When N = 35, it FAILS on 8 GPUs with the above error, but WORKS on 4 GPUs!!!

import torch.nn.functional as F
import torch as th


labels = th.randint(low=0, high=36, size=(32, ))
predictions = model(**inputs) # shape (32, 36). Or do th.randn(32, 36)

loss = F.cross_entropy(predictions, labels)
model.backward(loss)

Any idea why this is the case?

tjruwase · 2021-04-14T11:31:47Z

@aced125, are you still seeing issues?

aced125 · 2021-04-15T09:44:29Z

Yes - but I solved it in the following hacky way:

output_dim = 15

lin = nn.Linear(in_dim, 64)

y = lin(x)
y = y[:, :output_dim]

aced125 · 2021-04-15T09:45:30Z

It seems that when output_dim >= 36 things seem to work, else it fails.

tjruwase · 2021-04-15T13:51:43Z

So can you please provide steps to repro the failure so we can continue investigation?

SantoshGuptaML · 2021-04-28T09:11:29Z

tensor([1.], device='cuda:3', dtype=torch.float16, requires_grad=True)
This is the type of tensor that I get when I print what tensor is causing this issue.

This makes sense: this tensor has 1 element (tens.numel() = 1), therefore the partition size is 1, so if the rank > 1, partition_size * rank > 1 and the bug will happen!

I am having the same issue, trying to get the bing squad example to work with 4 gpus. How were you able to print the exact tensor that was causing the issue? I wish to do the same to figure out where the issue is happening.

tjruwase · 2021-04-28T09:24:41Z

@SantoshGuptaML , can you clarify the exact error you are seeing since multiple issues were involved here.

To your question about printing actual tensor values, you need to use Gather api like as follows:

       for n, p in model.named_parameters():
            with deepspeed.zero.GatheredParameters(p):
                val = p.detach().to('cpu').data.float()
                print0("{} {}: {} {}".format(tag, n, val.shape, val))

aced125 changed the title ~~UnboundLocalError: local variable 'dest_tensor_full_buffer' referenced before assignment~~ Zero Level 2 Works but Level 3 Fails: UnboundLocalError: local variable 'dest_tensor_full_buffer' referenced before assignment Apr 13, 2021

aced125 changed the title ~~Zero Level 2 Works but Level 3 Fails: UnboundLocalError: local variable 'dest_tensor_full_buffer' referenced before assignment~~ Zero Level 3 Offload FAILS on 8 GPUs, WORKS on 4 GPUs Apr 13, 2021

aced125 changed the title ~~Zero Level 3 Offload FAILS on 8 GPUs, WORKS on 4 GPUs~~ SOLVED: Zero Level 3 Offload FAILS on 8 GPUs, WORKS on 4 GPUs, because a parameter had numel = 4 Apr 13, 2021

aced125 changed the title ~~SOLVED: Zero Level 3 Offload FAILS on 8 GPUs, WORKS on 4 GPUs, because a parameter had numel = 4~~ Zero Level 3 Offload FAILS on 8 GPUs, WORKS on 4 GPUs Apr 13, 2021

aced125 changed the title ~~Zero Level 3 Offload FAILS on 8 GPUs, WORKS on 4 GPUs~~ Zero Level 3 Offload SOMETIMES FAILS on 8 GPUs, ALWAYS WORKS on 4 GPUs Apr 13, 2021

aced125 closed this as completed Apr 13, 2021

aced125 reopened this Apr 13, 2021

aced125 mentioned this issue Apr 16, 2021

ZeRO 0, 1, 2, 3 produce different results #966

Closed

tjruwase mentioned this issue Apr 16, 2021

Fix ZeRO-3 UnboundLocalError #968

Merged

tjruwase closed this as completed in #968 Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero Level 3 Offload SOMETIMES FAILS on 8 GPUs, ALWAYS WORKS on 4 GPUs #940

Zero Level 3 Offload SOMETIMES FAILS on 8 GPUs, ALWAYS WORKS on 4 GPUs #940

aced125 commented Apr 10, 2021 •

edited

Loading

aced125 commented Apr 13, 2021 •

edited

Loading

aced125 commented Apr 13, 2021 •

edited

Loading

aced125 commented Apr 13, 2021

aced125 commented Apr 13, 2021

aced125 commented Apr 13, 2021

tjruwase commented Apr 13, 2021

aced125 commented Apr 13, 2021

aced125 commented Apr 13, 2021 •

edited

Loading

aced125 commented Apr 13, 2021 •

edited

Loading

aced125 commented Apr 13, 2021

tjruwase commented Apr 13, 2021

aced125 commented Apr 13, 2021

aced125 commented Apr 13, 2021

aced125 commented Apr 13, 2021

aced125 commented Apr 13, 2021

tjruwase commented Apr 13, 2021

tjruwase commented Apr 13, 2021

aced125 commented Apr 13, 2021 •

edited

Loading

tjruwase commented Apr 14, 2021

aced125 commented Apr 15, 2021 •

edited

Loading

aced125 commented Apr 15, 2021

tjruwase commented Apr 15, 2021

SantoshGuptaML commented Apr 28, 2021

tjruwase commented Apr 28, 2021

Zero Level 3 Offload SOMETIMES FAILS on 8 GPUs, ALWAYS WORKS on 4 GPUs #940

Zero Level 3 Offload SOMETIMES FAILS on 8 GPUs, ALWAYS WORKS on 4 GPUs #940

Comments

aced125 commented Apr 10, 2021 • edited Loading

Sys config:

Error trace:

aced125 commented Apr 13, 2021 • edited Loading

aced125 commented Apr 13, 2021 • edited Loading

aced125 commented Apr 13, 2021

aced125 commented Apr 13, 2021

aced125 commented Apr 13, 2021

tjruwase commented Apr 13, 2021

aced125 commented Apr 13, 2021

aced125 commented Apr 13, 2021 • edited Loading

aced125 commented Apr 13, 2021 • edited Loading

aced125 commented Apr 13, 2021

tjruwase commented Apr 13, 2021

aced125 commented Apr 13, 2021

aced125 commented Apr 13, 2021

aced125 commented Apr 13, 2021

aced125 commented Apr 13, 2021

tjruwase commented Apr 13, 2021

tjruwase commented Apr 13, 2021

aced125 commented Apr 13, 2021 • edited Loading

tjruwase commented Apr 14, 2021

aced125 commented Apr 15, 2021 • edited Loading

aced125 commented Apr 15, 2021

tjruwase commented Apr 15, 2021

SantoshGuptaML commented Apr 28, 2021

tjruwase commented Apr 28, 2021

aced125 commented Apr 10, 2021 •

edited

Loading

aced125 commented Apr 13, 2021 •

edited

Loading

aced125 commented Apr 13, 2021 •

edited

Loading

aced125 commented Apr 13, 2021 •

edited

Loading

aced125 commented Apr 13, 2021 •

edited

Loading

aced125 commented Apr 13, 2021 •

edited

Loading

aced125 commented Apr 15, 2021 •

edited

Loading