Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adam mini can't offload to CPU #3

Closed
hahuyhoang411 opened this issue Jun 27, 2024 · 2 comments
Closed

Adam mini can't offload to CPU #3

hahuyhoang411 opened this issue Jun 27, 2024 · 2 comments

Comments

@hahuyhoang411
Copy link

hahuyhoang411 commented Jun 27, 2024

I'm using accelerate launch to use FSDP with Adam mini with the latest update. But looks like it doesn't support CPU offload. Any helps? Thank you

[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/train.py", line 222, in <module>
[rank1]:     trainer_stats = trainer.train()
[rank1]:                     ^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/trl/trainer/sft_trainer.py", line 440, in train
[rank1]:     output = super().train(*args, **kwargs)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 1885, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank1]:     self.optimizer.step()
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/accelerate/optimizer.py", line 170, in step
[rank1]:     self.optimizer.step(closure)
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank1]:     return wrapped(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank1]:     out = func(*args, **kwargs)
[rank1]:           ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/Adam_mini.py", line 228, in step
[rank1]:     dist.all_reduce(tmp_lr, op=dist.ReduceOp.SUM)
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank1]:     work = group.allreduce([tensor], opts)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: No backend type associated with device type cpu
@chcoliang
Copy link
Collaborator

Hi,
Thank you for mentioning it. It may be caused by the backend NCCL not supporting CPU communication. We have updated the Adam_mini.py which forces the communication in GPUs when GPU is available. We hope it will solve the issue.

@hahuyhoang411
Copy link
Author

Great the fix works well. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants