-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: No backend type associated with device type cpu #28
Comments
@minienglish1 Thanks for the kind words and support! Your suggestion seem to be a great fix. We will test on our side and will update in Pypi. We will keep you updated here. |
Thanks, but the code is not mine, I copied directly from ChatGPT o1. I only verified that it worked with my training script. You should thoroughly test it. I also tested the following modification suggested by ChatGPT o1. Use the GLOO Backend for CPU Tensors: Initialize a separate process group with the GLOO backend, which supports CPU tensors, and use it for CPU-based collective operations. Modify the Optimizer Initialization: Add a GLOO process group in the init method of your optimizer: python import torch.distributed as dist class Adam_mini(torch.optim.Optimizer):
Modify the step Method: In your step method, use the GLOO group for CPU tensors: python @torch.no_grad()
Explanation:
|
@minienglish1 Hi, I think your change can work in general FSDP offload cases. We have merged your changes (with some minor changes) into our Adam-mini version 1.0.4. Also updated in PyPI. You can try pip install adam-mini again and use the latest version. Thanks for your great suggestions! We expressed our gratitude to you in the acknowledgment. :D |
Training stable diffusion XL unet using accelerate library with FSDP: fsdp_offload_params: true; fsdp_sharding_strategy: SHARD_GRAD_OP
Environment:
accelerate-0.34.2
torch-2.4.1
CUDA Version: 12.4
adam_mini-1.0.3 (pip install)
Full Error:
[rank1]: Traceback (most recent call last):
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_12.py", line 1618, in
[rank1]: main()
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_12.py", line 1296, in main
[rank1]: optimizer.step()
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 159, in step
[rank1]: self.scaler.step(self.optimizer, closure)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 454, in step
[rank1]: retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/amp/grad_scaler.py", line 352, in _maybe_opt_step
[rank1]: retval = optimizer.step(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/accelerate/optimizer.py", line 214, in patched_step
[rank1]: return method(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
[rank1]: return func.get(opt, opt.class)(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank1]: out = func(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]: return func(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/adam_mini/adam_mini.py", line 317, in step
[rank1]: dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2288, in all_reduce
[rank1]: work = group.allreduce([tensor], opts)
[rank1]: RuntimeError: No backend type associated with device type cpu
I passed the code & error to ChatGPT o1 with the requirement "forces the communication in GPUs when GPU is available" based on your response to issue "Adam mini can't offload to CPU #3". It's response was a modifying the code as:
@torch.no_grad()
def step(self, closure=None):
# ... your existing code ...
Using this modification allowed the script to train.
Compared to AdamW, the loss was similar for the first 25 steps.
ChatGPT o1 also made suggestions of batching the tensor transfer, or using GLOO for the tensors on CPU. But I trust you know your code better than ChatGPT.
Further, based on " Adam mini can't save when using with FSDP in Huggingface Trainer #5", using fsdp_use_orig_params: false, allowed the training state to be saved.
Really excited about Adam-mini being able to be used with FSDP with cpu_offset. Thanks for all your hard work on this!
The text was updated successfully, but these errors were encountered: