-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does ncclAllGather have problems with recvBuffer size above 2GB? #184
Comments
Some more info, this happens on 8x Volta 16 GB with NVLink (i.e. 8 ranks), also on 4x Volta. |
I could not reproduce the issue. I tried on 8 GPUs, sending 1GB / receiving 8GB and it works (same type of machine : 8x Volta 16 GB machine with NVLink). Also tried on two GPUs sending 4GB / receiving 8GB and it worked as well. |
Thank you for testing this. There is a lot of other stuff happening around that code, so I will investigate some more. I will report back. |
Ok, Thanks again! Veryfing that it's not NCCL's fault sent me on right path. It is indeed integer overflow at 2G bytes, but in my own code. |
Is the test you used available somewhere? |
A collection of NCCL perf/sanity tests are available at; https://github.com/NVIDIA/nccl-tests |
I am trying to produce a minimal example for that issue. You test your operations in isolation, while I seem to be seeing problems when allGather is used after scatterReduce and a couple of stream synchronizations. Will see if I can send you a piece of code that isolates this. |
Interesting. I cannot reproduce this in a minimal example, but downgrading to 2.3.5 solves the problem in my large code. So this is definitely NCCL related. |
Do you have the sequence of operations that lead up to the failure or your minimal example for us to look at please? |
Sorry for reviving this thread after this long time. I was happy with NCCL 2.3.5 working for me and now went back to test 2.5.6. I have the same issue still, but |
Hm, no, still hangs when reloading. I will try to take another look at that with fresh eyes. |
That is weird. NCCL 2.5.6 should not be affected by |
It's the correct version and NCCL_LL_THRESHOLD does not affect it. That was a misinterpretation on my part. It seems that now I have two different sequences of events where one results in hanging, the other doesn't. That's new. Will see if I can isolate this now. |
OK, some more observations:
Are there any drawbacks to using Any version works fine with buffers smaller than 2GB. |
Oh, so Simple hangs but LL works ? I thought it was the opposite. And while it is an interesting observation, I think we should not settle with that (LL is much much slower than Simple). One easy way to try to reproduce the issue outside of the program would be to dump the NCCL sequence and replay it. Can you set |
Hi,
I am seeing weird behavior for ncclAllGather when recvBuffer size exceeds quite exactly 2 GB regardless of the type used. Everything below 2GB works perfectly fine with float32 or float16, as soon as 2GB is exceeded I get random hanging threads during or after the ncclAllGather operation or random memory access errors. Are there any known limits here? Integer size overflow maybe?
The text was updated successfully, but these errors were encountered: