-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL with cuda-memcheck #265
Comments
Indeed, to optimize and simplify code, when reading user data, NCCL always rounds up to a multiple of 16 Bytes (but does make sure to not write one byte too much). Now that should probably be fixed since it triggers cuda-memcheck; also I could imagine a case where the user buffer is badly aligned (which is a bad idea for performance) and where reading 16B would cross the page boundary of the allocation and could cause a crash. Using 4 integers in your test instead of one makes the test pass. |
Thanks. I confirm it doesn't crash with 4 integers. Are you suggesting that cuda-memcheck crashes instead of detecting the buffer overrun is likely due to the way a 16B block is written by NCCL and it is not a bug per se for cuda-memcheck? |
Actually I tried on x86 and got a clean report from cuda-memcheck mentioning NCCL was reading out of bounds. Your output should look like :
|
Well, I tried several options of cuda-memcheck such as |
Is NCCL supposed to work with cuda-memcheck? This test code runs fine but crashes with cuda-memcheck. https://github.com/naoyam/nccl_error. I'm using 2 P100 GPUs on a Power8 system with CUDA 10.1.243 and NCCL 2.4.8. The code can be run with
make run
. The error happens when it's run under cuda-memcheck, and looks like this:The test code just allreduces one int value over 2 GPUs. Let me know if I have any misunderstanding here.
The text was updated successfully, but these errors were encountered: