Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: No backend type associated with device type cpu #28

Open
minienglish1 opened this issue Sep 18, 2024 · 3 comments
Open

RuntimeError: No backend type associated with device type cpu #28

minienglish1 opened this issue Sep 18, 2024 · 3 comments

Comments

@minienglish1
Copy link

minienglish1 commented Sep 18, 2024

Training stable diffusion XL unet using accelerate library with FSDP: fsdp_offload_params: true; fsdp_sharding_strategy: SHARD_GRAD_OP

Environment:
accelerate-0.34.2
torch-2.4.1
CUDA Version: 12.4
adam_mini-1.0.3 (pip install)

Full Error:
[rank1]: Traceback (most recent call last):
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_12.py", line 1618, in
[rank1]: main()
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_12.py", line 1296, in main
[rank1]: optimizer.step()
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 159, in step
[rank1]: self.scaler.step(self.optimizer, closure)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 454, in step
[rank1]: retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 352, in _maybe_opt_step
[rank1]: retval = optimizer.step(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 214, in patched_step
[rank1]: return method(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
[rank1]: return func.get(opt, opt.class)(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank1]: out = func(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/adam_mini/adam_mini.py", line 317, in step
[rank1]: dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank1]: work = group.allreduce([tensor], opts)
[rank1]: RuntimeError: No backend type associated with device type cpu

I passed the code & error to ChatGPT o1 with the requirement "forces the communication in GPUs when GPU is available" based on your response to issue "Adam mini can't offload to CPU #3". It's response was a modifying the code as:

@torch.no_grad()
def step(self, closure=None):
# ... your existing code ...

if (state["reduced"]):
    # Force communication over GPUs when GPUs are available
    if tmp_lr.device.type == 'cpu':
        # Move the tensor to the current GPU device
        tmp_lr_gpu = tmp_lr.to(torch.cuda.current_device())
        # Perform the all-reduce operation on the GPU tensor
        dist.all_reduce(tmp_lr_gpu, op=dist.ReduceOp.SUM)
        # Move the result back to the CPU tensor
        tmp_lr.copy_(tmp_lr_gpu.cpu())
    else:
        # Tensor is already on GPU, use NCCL backend
        dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)

Using this modification allowed the script to train.

Compared to AdamW, the loss was similar for the first 25 steps.
ChatGPT o1 also made suggestions of batching the tensor transfer, or using GLOO for the tensors on CPU. But I trust you know your code better than ChatGPT.

Further, based on " Adam mini can't save when using with FSDP in Huggingface Trainer #5", using fsdp_use_orig_params: false, allowed the training state to be saved.

Really excited about Adam-mini being able to be used with FSDP with cpu_offset. Thanks for all your hard work on this!

@zyushun
Copy link
Owner

zyushun commented Sep 18, 2024

@minienglish1 Thanks for the kind words and support!

Your suggestion seem to be a great fix. We will test on our side and will update in Pypi.

We will keep you updated here.

@minienglish1
Copy link
Author

Thanks, but the code is not mine, I copied directly from ChatGPT o1. I only verified that it worked with my training script. You should thoroughly test it.

I also tested the following modification suggested by ChatGPT o1.
It appears to also work fine, and at similar speeds as the above post.
Perhaps it will be of benefit to someone who finds this issue thread.
Again, the code is copied directly from ChatGPT o1.

Use the GLOO Backend for CPU Tensors:

Initialize a separate process group with the GLOO backend, which supports CPU tensors, and use it for CPU-based collective operations.

Modify the Optimizer Initialization:

Add a GLOO process group in the init method of your optimizer:

python

import torch.distributed as dist

class Adam_mini(torch.optim.Optimizer):
def init(self, named_parameters, **kwargs):
# ... your existing code ...

    # Initialize default backend and GLOO group if using NCCL
    if not dist.is_initialized():
        dist.init_process_group(backend='nccl' if torch.cuda.is_available() else 'gloo')
    self.default_backend = dist.get_backend()
    if self.default_backend == 'nccl':
        self.gloo_group = dist.new_group(backend='gloo')

Modify the step Method:

In your step method, use the GLOO group for CPU tensors:

python

@torch.no_grad()
def step(self, closure=None):
# ... your existing code ...

if (state["reduced"]):
    # Use GLOO backend if tensor is on CPU
    if tmp_lr.device.type == 'cpu' and self.default_backend == 'nccl':
        dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM, group=self.gloo_group)
    else:
        dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)

Explanation:

GLOO Backend: GLOO supports both CPU and GPU tensors, making it suitable for CPU operations.
Separate Process Group: By creating a new process group with GLOO, you avoid interfering with the existing NCCL-based group used for GPU operations.
Conditional All-Reduce: The code checks the device type of tmp_lr and uses the appropriate backend.

@zyushun
Copy link
Owner

zyushun commented Sep 18, 2024

@minienglish1 Hi, I think your change can work in general FSDP offload cases. We have merged your changes (with some minor changes) into our Adam-mini version 1.0.4.

Also updated in PyPI. You can try pip install adam-mini again and use the latest version.

Thanks for your great suggestions! We expressed our gratitude to you in the acknowledgment. :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants