-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Compressed Adam optimizers - RuntimeError: Bool type is not supported by dlpack #1859
Comments
Hi @jhoareau thanks for reporting this issue. It seems like there is an ongoing discussion at dlpack about whether or not support bool type dmlc/dlpack#75. On the other hand, for your comment
I assume you added the casting in DeepSpeed/deepspeed/runtime/comm/nccl.py. Overall to solve this problem currently the only way is to use older pytorch (on our side we verified torch 1.8 works). @awan-10 and I need to have some internal discussion about whether there are any solution to make it work without bool type, but it might take some time. We also need to see dlpack's decision. |
Hi @conglongli I just added dlpack's issue has been open for a while, I think going for a workaround like this is best. The reason why it works in Torch up to 1.9 is because they used to do the casting internally: pytorch/pytorch#67081 (comment) I could potentially recommend only casting to uint8 if you detect the pytorch version to be 1.10 or over, not hitting performance for earlier versions and allowing at least functionality for Pytorch 1.10 and over, and let performance be what it is on those versions? |
I see. Yes we will investigate this on our side. But please understand that because we need to test both performance and convergence and because of bandwidth limitation, this will take some time. Before that I would recommend you to use older Pytorch if possible. |
Hi @conglongli the PR indeed fixes the issue. Thanks a lot for the quick PR turnaround! |
Thanks for confirming @jhoareau , will merge the PR then. |
Describe the bug
Traceback:
When using the implementation for OneBitAdam and ZeroOneAdam, the error from the title appears when
comm_backend_name
is set tonccl
.This is linked to the operations:
https://github.com/microsoft/DeepSpeed/blob/208d45bbf7cbde2abfb233e7d10803553fbcf126/deepspeed/runtime/comm/nccl.py#L72
https://github.com/microsoft/DeepSpeed/blob/208d45bbf7cbde2abfb233e7d10803553fbcf126/deepspeed/runtime/comm/nccl.py#L129
And the fact that Bool is not supported by dlpack since Pytorch 1.10: see pytorch/pytorch#67081
Google's JAX repo recommends casting to uint8 instead of bool: jax-ml/jax#4719
Beware that, when I tried to implement the casting locally I got terrible performance with ZeroOneAdam.
Expected behavior
The ZeroOneAdam optimizer working with nccl and the latest Pytorch version.
ds_report output
DeepSpeed general environment info:
torch version .................... 1.11.0+cu115
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.4
deepspeed info ................... 0.6.1+208d45b, 208d45b, master
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5, hip 0.0
Launcher context
Pytorch-Lightning DeepSpeedPlugin, Python 3.8
The text was updated successfully, but these errors were encountered: