nccl-tests cannot perform multi-machine interconnection through RDMA in the docker container. #281

Eevan-zq · 2025-01-15T03:19:47Z

Hello,
Recently, I have encountered a problem. I ran docker containers (official Nvidia images) on two servers and wanted to perform multi-machine interconnection of nccl-tests in the containers, with communication between the machines through RDMA, but it has not been successful. Here is the basic information of the two containers:
1. The information of mlx5_x can be seen through ibv_devices.

   2. Each container can perform nccl-tests in a single machine.

   3. The RDMA communication test using perftest between containers is also successful.

   4. When running mpirun between two containers, communication is also possible without invoking nccl-tests.

   5. When using the normal TCP/Ethernet network instead of RDMA between the two containers, the test can be performed normally. It's just that the speed is a bit slow, at 10Gb/s.

   Finally, here is the phenomenon that occurred when I tested mpirun + nccl-tests.

The command is: mpirun --allow-run-as-root -pernode -np 2 --hostfile /home/hostfile -x NCCL_DEBUG=INFO -x NCCL_IB_DISABLE=0 -x NCCL_IB_HCA=mlx5_0 -mca btl_base_verbose 30 -mca btl_tcp_if_include bond0.2460 -mca plm_rsh_args "-p 2102" /usr/local/bin/all_reduce_perf -b 1 -e 2GB -f 2 -g 8 > nccl-log.
The result is:

nccl-log.txt

Although there is a result, this result is exactly the same as that of the single-machine test, and the effect before using multiple machines must not have been so good.

The text was updated successfully, but these errors were encountered:

kiskra-nvidia · 2025-01-15T18:52:31Z

It appears that you used different all_reduce_perf binaries in your runs. The one from the last experiment (/usr/local/bin/all_reduce_perf) does not appear to have been compiled with MPI support. What I see in the attached log file are two separate single-node runs (can you see that "Rank 0" repeats, and "nranks" in the debug output is 8)? The one from step 5 is definitely MPI-enabled. For the one from step 2 it's impossible to say, as it was a single-process run.

Don't feel embarrassed about it though; I catch myself doing the same thing every so often 😉.

Eevan-zq · 2025-01-16T01:59:58Z

Yes, you have discovered this oversight.
/usr/local/bin/all_reduce_perf is the native all_reduce_perf binaries of the Nvidia docker container, and /home/nccl-tool/bin/all_reduce_perf is the binaries file that I git from GitHub and make, its version is v2.17.1-1. The mpi environment was specified during the compilation of nccl-test, the commang was make -j128 MPI=1 MPI_HOME=/opt/hpcx/ompi/ CUDA_HOME=/usr/local/cuda NCCL_HOME=/home/nccl-tool/dependency/nccl BUILDDIR=/home/nccl-tool/bin

I will uniformly use /home/nccl-tool/bin/all_reduce_perf for testing later, and the results of this file will be provided subsequently. However, it should still not be able to communicate via RDMA, and even the TCP communication at 10Gb/s is not working.

Eevan-zq · 2025-01-16T03:30:57Z

The command is mpirun --allow-run-as-root -pernode -np 2 --hostfile /home/hostfile -x NCCL_DEBUG=INFO -x NCCL_IB_DISABLE=0 -x NCCL_IB_HCA=mlx5_0 -mca btl_base_verbose 30 -mca btl_tcp_if_include bond0.2460 -mca plm_rsh_args "-p 2102" /home/nccl-tool/bin/all_reduce_perf -b 1 -e 2GB -f 2 -g 8 > nccl-log2.txt

nccl-log2.txt

cannot use rdma yet.

kiskra-nvidia · 2025-01-16T16:56:14Z

Can you try with NCCL_IB_GID_INDEX=3? See, e.g., NVIDIA/nccl#426

Eevan-zq · 2025-01-17T10:08:35Z

Yes! It works, thanks a lot.

kiskra-nvidia · 2025-01-17T18:08:50Z

Great! We've been working to eliminate the need to provide NCCL_IB_GID_INDEX. Since 2.21 it should no longer be needed when running baremetal, but recently we discovered containerized cases that still didn't work -- see #NVIDIA/nccl#1573.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nccl-tests cannot perform multi-machine interconnection through RDMA in the docker container. #281

nccl-tests cannot perform multi-machine interconnection through RDMA in the docker container. #281

Eevan-zq commented Jan 15, 2025

kiskra-nvidia commented Jan 15, 2025

Eevan-zq commented Jan 16, 2025

Eevan-zq commented Jan 16, 2025 •

edited

Loading

kiskra-nvidia commented Jan 16, 2025

Eevan-zq commented Jan 17, 2025

kiskra-nvidia commented Jan 17, 2025

nccl-tests cannot perform multi-machine interconnection through RDMA in the docker container. #281

nccl-tests cannot perform multi-machine interconnection through RDMA in the docker container. #281

Comments

Eevan-zq commented Jan 15, 2025

kiskra-nvidia commented Jan 15, 2025

Eevan-zq commented Jan 16, 2025

Eevan-zq commented Jan 16, 2025 • edited Loading

kiskra-nvidia commented Jan 16, 2025

Eevan-zq commented Jan 17, 2025

kiskra-nvidia commented Jan 17, 2025

Eevan-zq commented Jan 16, 2025 •

edited

Loading