NCCL with cuda-memcheck #265

naoyam · 2019-10-29T20:00:35Z

Is NCCL supposed to work with cuda-memcheck? This test code runs fine but crashes with cuda-memcheck. https://github.com/naoyam/nccl_error. I'm using 2 P100 GPUs on a Power8 system with CUDA 10.1.243 and NCCL 2.4.8. The code can be run with make run. The error happens when it's run under cuda-memcheck, and looks like this:

$ make cuda-memcheck
MPI initialized
NCCL initialized
ncclAllreduce issued
Async error checked
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)

The test code just allreduces one int value over 2 GPUs. Let me know if I have any misunderstanding here.

The text was updated successfully, but these errors were encountered:

sjeaugey · 2019-10-29T23:07:28Z

Indeed, to optimize and simplify code, when reading user data, NCCL always rounds up to a multiple of 16 Bytes (but does make sure to not write one byte too much).

Now that should probably be fixed since it triggers cuda-memcheck; also I could imagine a case where the user buffer is badly aligned (which is a bad idea for performance) and where reading 16B would cross the page boundary of the allocation and could cause a crash.

Using 4 integers in your test instead of one makes the test pass.

naoyam · 2019-10-29T23:19:17Z

Thanks. I confirm it doesn't crash with 4 integers. Are you suggesting that cuda-memcheck crashes instead of detecting the buffer overrun is likely due to the way a 16B block is written by NCCL and it is not a bug per se for cuda-memcheck?

sjeaugey · 2019-10-29T23:35:10Z

Actually I tried on x86 and got a clean report from cuda-memcheck mentioning NCCL was reading out of bounds.
It looks like your test is only reporting stderr (hence the abort and the output from your test) but not stdout (where cuda-memcheck writes its output).

Your output should look like :

========= CUDA-MEMCHECK
========= CUDA-MEMCHECK
MPI initialized
NCCL initialized
ncclAllreduce issued
Async error checked
========= Invalid __global__ read of size 1
=========     at 0x00002bc8 in /tmp/nccl/src/collectives/device/primitives.h:438:ncclBroadcastRingLLKernel_copy_i8(ncclColl)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f0635400805 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x27d0d5]
=========     Host Frame:/tmp/nccl/build/lib/libnccl.so.2 [0x57459]
=========     Host Frame:/tmp/nccl/build/lib/libnccl.so.2 [0x574e7]
=========     Host Frame:/tmp/nccl/build/lib/libnccl.so.2 [0x8d835]
=========     Host Frame:/tmp/nccl/build/lib/libnccl.so.2 [0x1f96c]
=========     Host Frame:/tmp/nccl/build/lib/libnccl.so.2 (ncclBroadcast + 0x73) [0x357d3]
=========     Host Frame:./nccl_test.exe [0x12b92]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./nccl_test.exe [0x11b5a]
=========
CUDA error: unspecified launch failure
Error at nccl_test.cpp:========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x38c7b3]
=========     Host Frame:./nccl_test.exe [0x42ec6]
=========     Host Frame:./nccl_test.exe [0x12d79]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:./nccl_test.exe [0x11b5a]
=========
107
[gc01:25946] *** Process received signal ***
[gc01:25946] Signal: Aborted (6)
[gc01:25946] Signal code:  (-6)
[gc01:25946] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f066eafb890]
[gc01:25946] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f066da9de97]
[gc01:25946] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f066da9f801]
[gc01:25946] [ 3] ./nccl_test.exe(+0x12e1b)[0x55af47420e1b]
[gc01:25946] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f066da80b97]
[gc01:25946] [ 5] ./nccl_test.exe(+0x11b5a)[0x55af4741fb5a]
[gc01:25946] *** End of error message ***
========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found

naoyam · 2019-10-29T23:49:00Z

Well, I tried several options of cuda-memcheck such as --save, --log-file and --flush-to-disk, but none of them seems to change the output on my environment. Maybe as it's on Power?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL with cuda-memcheck #265

NCCL with cuda-memcheck #265

naoyam commented Oct 29, 2019

sjeaugey commented Oct 29, 2019 •

edited

Loading

naoyam commented Oct 29, 2019

sjeaugey commented Oct 29, 2019

naoyam commented Oct 29, 2019

NCCL with cuda-memcheck #265

NCCL with cuda-memcheck #265

Comments

naoyam commented Oct 29, 2019

sjeaugey commented Oct 29, 2019 • edited Loading

naoyam commented Oct 29, 2019

sjeaugey commented Oct 29, 2019

naoyam commented Oct 29, 2019

sjeaugey commented Oct 29, 2019 •

edited

Loading