Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does ncclAllGather have problems with recvBuffer size above 2GB? #184

Open
emjotde opened this issue Feb 26, 2019 · 15 comments
Open

Does ncclAllGather have problems with recvBuffer size above 2GB? #184

emjotde opened this issue Feb 26, 2019 · 15 comments

Comments

@emjotde
Copy link

emjotde commented Feb 26, 2019

Hi,
I am seeing weird behavior for ncclAllGather when recvBuffer size exceeds quite exactly 2 GB regardless of the type used. Everything below 2GB works perfectly fine with float32 or float16, as soon as 2GB is exceeded I get random hanging threads during or after the ncclAllGather operation or random memory access errors. Are there any known limits here? Integer size overflow maybe?

@emjotde
Copy link
Author

emjotde commented Feb 26, 2019

Some more info, this happens on 8x Volta 16 GB with NVLink (i.e. 8 ranks), also on 4x Volta.

@sjeaugey
Copy link
Member

I could not reproduce the issue. I tried on 8 GPUs, sending 1GB / receiving 8GB and it works (same type of machine : 8x Volta 16 GB machine with NVLink).

Also tried on two GPUs sending 4GB / receiving 8GB and it worked as well.

@emjotde
Copy link
Author

emjotde commented Feb 26, 2019

Thank you for testing this. There is a lot of other stuff happening around that code, so I will investigate some more. I will report back.

@emjotde
Copy link
Author

emjotde commented Feb 26, 2019

Ok, Thanks again! Veryfing that it's not NCCL's fault sent me on right path. It is indeed integer overflow at 2G bytes, but in my own code.

@emjotde emjotde closed this as completed Feb 26, 2019
@emjotde
Copy link
Author

emjotde commented Feb 26, 2019

Is the test you used available somewhere?

@emjotde emjotde reopened this Feb 26, 2019
@AddyLaddy
Copy link
Collaborator

A collection of NCCL perf/sanity tests are available at; https://github.com/NVIDIA/nccl-tests

@emjotde
Copy link
Author

emjotde commented Feb 26, 2019

I am trying to produce a minimal example for that issue. You test your operations in isolation, while I seem to be seeing problems when allGather is used after scatterReduce and a couple of stream synchronizations. Will see if I can send you a piece of code that isolates this.

@emjotde
Copy link
Author

emjotde commented Mar 2, 2019

Interesting. I cannot reproduce this in a minimal example, but downgrading to 2.3.5 solves the problem in my large code. So this is definitely NCCL related.

@AddyLaddy
Copy link
Collaborator

Do you have the sequence of operations that lead up to the failure or your minimal example for us to look at please?
Can you still observe the failure if you set NCCL_LL_THRESHOLD=0 or NCCL_LL_THRESHOLD=$((8192*1024*1024))

@emjotde
Copy link
Author

emjotde commented Jan 7, 2020

Sorry for reviving this thread after this long time. I was happy with NCCL 2.3.5 working for me and now went back to test 2.5.6. I have the same issue still, but NCCL_LL_THRESHOLD=0 does indeed fix that. Can I set that value somewhere in-code?

@emjotde
Copy link
Author

emjotde commented Jan 7, 2020

Hm, no, still hangs when reloading. I will try to take another look at that with fresh eyes.

@sjeaugey
Copy link
Member

sjeaugey commented Jan 7, 2020

That is weird. NCCL 2.5.6 should not be affected by NCCL_LL_THRESHOLD (it would be NCCL_PROTO=^LL). Are you sure your binary didn't get linked statically with 2.4.2 ? Setting NCCL_DEBUG=VERSION would print a line in the output.

@emjotde
Copy link
Author

emjotde commented Jan 7, 2020

It's the correct version and NCCL_LL_THRESHOLD does not affect it. That was a misinterpretation on my part. It seems that now I have two different sequences of events where one results in hanging, the other doesn't. That's new. Will see if I can isolate this now.

@emjotde
Copy link
Author

emjotde commented Jan 7, 2020

OK, some more observations:

  • 2.5.6 hangs or randomly gives me a memory access error with default settings. Works with NCCL_PROTO=LL.
  • 2.4.8 hangs with default settings, memory access error with NCCL_LL_THRESHOLD=0, but works fine with NCCL_LL_THRESHOLD=$((8192*1024*1024)).
  • 2.3.7 works fine out of the the box.

Are there any drawbacks to using NCCL_PROTO=LL? Can I tell NCCL to use that by default via code instead via environment variable?

Any version works fine with buffers smaller than 2GB.

@sjeaugey
Copy link
Member

sjeaugey commented Jan 7, 2020

Oh, so Simple hangs but LL works ? I thought it was the opposite. And while it is an interesting observation, I think we should not settle with that (LL is much much slower than Simple).

One easy way to try to reproduce the issue outside of the program would be to dump the NCCL sequence and replay it.

Can you set NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL and provide the log ? Or at least the end of the log if it is too long ? That might help us reproduce the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants