-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorflow processes with horovod(NCCL) get stuck during the training. #306
Comments
We've had similar problems in the past with LL, so trying to disable it ( Beyond that, you could also try NCCL 2.6 (https://github.com/NVIDIA/nccl/tree/v2.6). On a more general note, looking at your startup script, it would be good to avoid setting NCCL environment variables unless you really need it.
However, setting the ones below have a significant potential for causing issues, like crashes, hangs and data corruption (because we don't heavily test all the possible settings, only the default settings) :
|
Hi, @sjeaugey
I restart my job, using this new script, with NCCL v2.6.2 compiled from https://github.com/NVIDIA/nccl/tree/v2.6 and Horovod 0.19.1 installed by |
Sorry, we realized yesterday that there seemed to be a bug on 2.6 not using GPU Direct RDMA for reads. We'll try to fix the 2.6 branch soon. |
I pushed NCCL 2.6.4 to the v2.6 branch. Let us know if you still see a significant performance degradation. Thanks ! |
Hi, @sjeaugey |
@jyhengcoder can you try set env var: NCCL_LAUNCH_MODE to we also meet the train hangup, and solved by this way. |
@qianzhang613 Thanks for your suggestion, I tried NCCL v2.6.4 last weekend, the training hangup didn't appear. @sjeaugey |
@jyhengcoder thanks for the feedback. If you'd like to investigate the speed degradation, I would suggest running the NCCL performance tests (https://github.com/nvidia/nccl-tests) with both 2.5 and 2.6 and see whether the performance degradation is visible with the tests. |
@sjeaugey I succeed in running nccl-tests with v2.4.7 and v2.5.7, but met an error with v2.6.4:
|
Hum, this is weird. Can you set NCCL_DEBUG=INFO to get the full log and backtrace of the WARN ? Also I would suggest compiling the tests with MPI support (make ... MPI=1) and launch with mpirun, exactly like with horovod. |
Hi @sjeaugey , I ran the NCCL performance test based on 16(nodes) * 8 Nvidia V100 GPU cluster. Startup script
Results
|
Do you know which virtual lane Setting |
@jyhengcoder I was hoping you could post the full output for each version, i.e. performance from 8B to 128MB. The average confirms there is something clear here, but I can't tell from just the average what to look for next. @kwen2501 Adaptive routing is an Infiniband-only feature, it does not apply to RoCE. But the code change for AR could still impact RoCE performance, so setting |
Resultsv2.5.7
v2.6.4
|
Thanks. So we're aiming at 10 GB/s here, but with 2.6 we see hiccups. This is typical of a RoCE fabric dropping packets. Did you configure your RoCE switch in lossless / PFC mode ? Otherwise your RoCE performance will be subject to this kind of sharp degradation every time we make even tiny changes in NCCL which cause timing changes in the packets. It would be helpful if you could confirm whether the Adaptive routing support change is causing that difference, and whether |
@qianzhang613 @sjeaugey @kwen2501 |
I found that tensorflow process gets stuck after training lasts almost 30 hours and this problem can be reproduced every time. The usages of all gpus are 100% when all processes hang up.
I have raised an issue horovod/horovod#1799 (comment), but I am not sure this is a problem of horovod, so I think this issue is also necessary.
The text was updated successfully, but these errors were encountered: