Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL with cuda-memcheck #265

Open
naoyam opened this issue Oct 29, 2019 · 4 comments
Open

NCCL with cuda-memcheck #265

naoyam opened this issue Oct 29, 2019 · 4 comments

Comments

@naoyam
Copy link

naoyam commented Oct 29, 2019

Is NCCL supposed to work with cuda-memcheck? This test code runs fine but crashes with cuda-memcheck. https://github.com/naoyam/nccl_error. I'm using 2 P100 GPUs on a Power8 system with CUDA 10.1.243 and NCCL 2.4.8. The code can be run with make run. The error happens when it's run under cuda-memcheck, and looks like this:

$ make cuda-memcheck
MPI initialized
NCCL initialized
ncclAllreduce issued
Async error checked
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)

The test code just allreduces one int value over 2 GPUs. Let me know if I have any misunderstanding here.

@sjeaugey
Copy link
Member

sjeaugey commented Oct 29, 2019

Indeed, to optimize and simplify code, when reading user data, NCCL always rounds up to a multiple of 16 Bytes (but does make sure to not write one byte too much).

Now that should probably be fixed since it triggers cuda-memcheck; also I could imagine a case where the user buffer is badly aligned (which is a bad idea for performance) and where reading 16B would cross the page boundary of the allocation and could cause a crash.

Using 4 integers in your test instead of one makes the test pass.

@naoyam
Copy link
Author

naoyam commented Oct 29, 2019

Thanks. I confirm it doesn't crash with 4 integers. Are you suggesting that cuda-memcheck crashes instead of detecting the buffer overrun is likely due to the way a 16B block is written by NCCL and it is not a bug per se for cuda-memcheck?

@sjeaugey
Copy link
Member

Actually I tried on x86 and got a clean report from cuda-memcheck mentioning NCCL was reading out of bounds.
It looks like your test is only reporting stderr (hence the abort and the output from your test) but not stdout (where cuda-memcheck writes its output).

Your output should look like :

========= CUDA-MEMCHECK
========= CUDA-MEMCHECK
MPI initialized
NCCL initialized
ncclAllreduce issued
Async error checked
========= Invalid __global__ read of size 1
=========     at 0x00002bc8 in /tmp/nccl/src/collectives/device/primitives.h:438:ncclBroadcastRingLLKernel_copy_i8(ncclColl)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f0635400805 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x27d0d5]
=========     Host Frame:/tmp/nccl/build/lib/libnccl.so.2 [0x57459]
=========     Host Frame:/tmp/nccl/build/lib/libnccl.so.2 [0x574e7]
=========     Host Frame:/tmp/nccl/build/lib/libnccl.so.2 [0x8d835]
=========     Host Frame:/tmp/nccl/build/lib/libnccl.so.2 [0x1f96c]
=========     Host Frame:/tmp/nccl/build/lib/libnccl.so.2 (ncclBroadcast + 0x73) [0x357d3]
=========     Host Frame:./nccl_test.exe [0x12b92]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./nccl_test.exe [0x11b5a]
=========
CUDA error: unspecified launch failure
Error at nccl_test.cpp:========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x38c7b3]
=========     Host Frame:./nccl_test.exe [0x42ec6]
=========     Host Frame:./nccl_test.exe [0x12d79]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./nccl_test.exe [0x11b5a]
=========
107
[gc01:25946] *** Process received signal ***
[gc01:25946] Signal: Aborted (6)
[gc01:25946] Signal code:  (-6)
[gc01:25946] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f066eafb890]
[gc01:25946] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f066da9de97]
[gc01:25946] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f066da9f801]
[gc01:25946] [ 3] ./nccl_test.exe(+0x12e1b)[0x55af47420e1b]
[gc01:25946] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f066da80b97]
[gc01:25946] [ 5] ./nccl_test.exe(+0x11b5a)[0x55af4741fb5a]
[gc01:25946] *** End of error message ***
========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found

@naoyam
Copy link
Author

naoyam commented Oct 29, 2019

Well, I tried several options of cuda-memcheck such as --save, --log-file and --flush-to-disk, but none of them seems to change the output on my environment. Maybe as it's on Power?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants