RuntimeError: No backend type associated with device type cpu #28

minienglish1 · 2024-09-18T04:01:22Z

Training stable diffusion XL unet using accelerate library with FSDP: fsdp_offload_params: true; fsdp_sharding_strategy: SHARD_GRAD_OP

Environment:
accelerate-0.34.2
torch-2.4.1
CUDA Version: 12.4
adam_mini-1.0.3 (pip install)

Full Error:
[rank1]: Traceback (most recent call last):
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_12.py", line 1618, in
[rank1]: main()
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_12.py", line 1296, in main
[rank1]: optimizer.step()
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 159, in step
[rank1]: self.scaler.step(self.optimizer, closure)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 454, in step
[rank1]: retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 352, in _maybe_opt_step
[rank1]: retval = optimizer.step(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 214, in patched_step
[rank1]: return method(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
[rank1]: return func.get(opt, opt.class)(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank1]: out = func(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/adam_mini/adam_mini.py", line 317, in step
[rank1]: dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank1]: work = group.allreduce([tensor], opts)
[rank1]: RuntimeError: No backend type associated with device type cpu

I passed the code & error to ChatGPT o1 with the requirement "forces the communication in GPUs when GPU is available" based on your response to issue "Adam mini can't offload to CPU #3". It's response was a modifying the code as:

@torch.no_grad()
def step(self, closure=None):
# ... your existing code ...

if (state["reduced"]):
    # Force communication over GPUs when GPUs are available
    if tmp_lr.device.type == 'cpu':
        # Move the tensor to the current GPU device
        tmp_lr_gpu = tmp_lr.to(torch.cuda.current_device())
        # Perform the all-reduce operation on the GPU tensor
        dist.all_reduce(tmp_lr_gpu, op=dist.ReduceOp.SUM)
        # Move the result back to the CPU tensor
        tmp_lr.copy_(tmp_lr_gpu.cpu())
    else:
        # Tensor is already on GPU, use NCCL backend
        dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)

Using this modification allowed the script to train.

Compared to AdamW, the loss was similar for the first 25 steps.
ChatGPT o1 also made suggestions of batching the tensor transfer, or using GLOO for the tensors on CPU. But I trust you know your code better than ChatGPT.

Further, based on " Adam mini can't save when using with FSDP in Huggingface Trainer #5", using fsdp_use_orig_params: false, allowed the training state to be saved.

Really excited about Adam-mini being able to be used with FSDP with cpu_offset. Thanks for all your hard work on this!

The text was updated successfully, but these errors were encountered:

zyushun · 2024-09-18T04:39:57Z

@minienglish1 Thanks for the kind words and support!

Your suggestion seem to be a great fix. We will test on our side and will update in Pypi.

We will keep you updated here.

minienglish1 · 2024-09-18T05:35:34Z

Thanks, but the code is not mine, I copied directly from ChatGPT o1. I only verified that it worked with my training script. You should thoroughly test it.

I also tested the following modification suggested by ChatGPT o1.
It appears to also work fine, and at similar speeds as the above post.
Perhaps it will be of benefit to someone who finds this issue thread.
Again, the code is copied directly from ChatGPT o1.

Use the GLOO Backend for CPU Tensors:

Initialize a separate process group with the GLOO backend, which supports CPU tensors, and use it for CPU-based collective operations.

Modify the Optimizer Initialization:

Add a GLOO process group in the init method of your optimizer:

python

import torch.distributed as dist

class Adam_mini(torch.optim.Optimizer):
def init(self, named_parameters, **kwargs):
# ... your existing code ...

    # Initialize default backend and GLOO group if using NCCL
    if not dist.is_initialized():
        dist.init_process_group(backend='nccl' if torch.cuda.is_available() else 'gloo')
    self.default_backend = dist.get_backend()
    if self.default_backend == 'nccl':
        self.gloo_group = dist.new_group(backend='gloo')

Modify the step Method:

In your step method, use the GLOO group for CPU tensors:

python

@torch.no_grad()
def step(self, closure=None):
# ... your existing code ...

if (state["reduced"]):
    # Use GLOO backend if tensor is on CPU
    if tmp_lr.device.type == 'cpu' and self.default_backend == 'nccl':
        dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM, group=self.gloo_group)
    else:
        dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)

Explanation:

GLOO Backend: GLOO supports both CPU and GPU tensors, making it suitable for CPU operations.
Separate Process Group: By creating a new process group with GLOO, you avoid interfering with the existing NCCL-based group used for GPU operations.
Conditional All-Reduce: The code checks the device type of tmp_lr and uses the appropriate backend.

zyushun · 2024-09-18T05:38:34Z

@minienglish1 Hi, I think your change can work in general FSDP offload cases. We have merged your changes (with some minor changes) into our Adam-mini version 1.0.4.

Also updated in PyPI. You can try pip install adam-mini again and use the latest version.

Thanks for your great suggestions! We expressed our gratitude to you in the acknowledgment. :D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: No backend type associated with device type cpu #28

RuntimeError: No backend type associated with device type cpu #28

minienglish1 commented Sep 18, 2024 •

edited

Loading

zyushun commented Sep 18, 2024

minienglish1 commented Sep 18, 2024

zyushun commented Sep 18, 2024

RuntimeError: No backend type associated with device type cpu #28

RuntimeError: No backend type associated with device type cpu #28

Comments

minienglish1 commented Sep 18, 2024 • edited Loading

zyushun commented Sep 18, 2024

minienglish1 commented Sep 18, 2024

zyushun commented Sep 18, 2024

minienglish1 commented Sep 18, 2024 •

edited

Loading