Adam mini can't offload to CPU #3

hahuyhoang411 · 2024-06-27T08:42:06Z

I'm using accelerate launch to use FSDP with Adam mini with the latest update. But looks like it doesn't support CPU offload. Any helps? Thank you

[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/train.py", line 222, in <module>
[rank1]:     trainer_stats = trainer.train()
[rank1]:                     ^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 440, in train
[rank1]:     output = super().train(*args, **kwargs)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1885, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank1]:     self.optimizer.step()
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step
[rank1]:     self.optimizer.step(closure)
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank1]:     return wrapped(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank1]:     out = func(*args, **kwargs)
[rank1]:           ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/Adam_mini.py", line 228, in step
[rank1]:     dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank1]:     work = group.allreduce([tensor], opts)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: No backend type associated with device type cpu

The text was updated successfully, but these errors were encountered:

chcoliang · 2024-06-27T09:32:53Z

Hi,
Thank you for mentioning it. It may be caused by the backend NCCL not supporting CPU communication. We have updated the Adam_mini.py which forces the communication in GPUs when GPU is available. We hope it will solve the issue.

hahuyhoang411 · 2024-06-27T12:05:58Z

Great the fix works well. Thank you

hahuyhoang411 closed this as completed Jun 27, 2024

minienglish1 mentioned this issue Sep 18, 2024

RuntimeError: No backend type associated with device type cpu #28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adam mini can't offload to CPU #3

Adam mini can't offload to CPU #3

hahuyhoang411 commented Jun 27, 2024 •

edited

Loading

chcoliang commented Jun 27, 2024

hahuyhoang411 commented Jun 27, 2024

Adam mini can't offload to CPU #3

Adam mini can't offload to CPU #3

Comments

hahuyhoang411 commented Jun 27, 2024 • edited Loading

chcoliang commented Jun 27, 2024

hahuyhoang411 commented Jun 27, 2024

hahuyhoang411 commented Jun 27, 2024 •

edited

Loading