Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl-tests cannot perform multi-machine interconnection through RDMA in the docker container. #281

Open
Eevan-zq opened this issue Jan 15, 2025 · 6 comments

Comments

@Eevan-zq
Copy link

Hello,
Recently, I have encountered a problem. I ran docker containers (official Nvidia images) on two servers and wanted to perform multi-machine interconnection of nccl-tests in the containers, with communication between the machines through RDMA, but it has not been successful. Here is the basic information of the two containers:
1. The information of mlx5_x can be seen through ibv_devices.

Image

   2. Each container can perform nccl-tests in a single machine.

Image

   3. The RDMA communication test using perftest between containers is also successful.

Image

   4. When running mpirun between two containers, communication is also possible without invoking nccl-tests.

Image

   5. When using the normal TCP/Ethernet network instead of RDMA between the two containers, the test can be performed normally. It's just that the speed is a bit slow, at 10Gb/s.

Image

   Finally, here is the phenomenon that occurred when I tested mpirun + nccl-tests.  

The command is: mpirun --allow-run-as-root -pernode -np 2 --hostfile /home/hostfile -x NCCL_DEBUG=INFO -x NCCL_IB_DISABLE=0 -x NCCL_IB_HCA=mlx5_0 -mca btl_base_verbose 30 -mca btl_tcp_if_include bond0.2460 -mca plm_rsh_args "-p 2102" /usr/local/bin/all_reduce_perf -b 1 -e 2GB -f 2 -g 8 > nccl-log.
The result is:

nccl-log.txt

Although there is a result, this result is exactly the same as that of the single-machine test, and the effect before using multiple machines must not have been so good.

@kiskra-nvidia
Copy link
Member

It appears that you used different all_reduce_perf binaries in your runs. The one from the last experiment (/usr/local/bin/all_reduce_perf) does not appear to have been compiled with MPI support. What I see in the attached log file are two separate single-node runs (can you see that "Rank 0" repeats, and "nranks" in the debug output is 8)? The one from step 5 is definitely MPI-enabled. For the one from step 2 it's impossible to say, as it was a single-process run.

Don't feel embarrassed about it though; I catch myself doing the same thing every so often 😉.

@Eevan-zq
Copy link
Author

Yes, you have discovered this oversight.
/usr/local/bin/all_reduce_perf is the native all_reduce_perf binaries of the Nvidia docker container, and /home/nccl-tool/bin/all_reduce_perf is the binaries file that I git from GitHub and make, its version is v2.17.1-1. The mpi environment was specified during the compilation of nccl-test, the commang was make -j128 MPI=1 MPI_HOME=/opt/hpcx/ompi/ CUDA_HOME=/usr/local/cuda NCCL_HOME=/home/nccl-tool/dependency/nccl BUILDDIR=/home/nccl-tool/bin

I will uniformly use /home/nccl-tool/bin/all_reduce_perf for testing later, and the results of this file will be provided subsequently. However, it should still not be able to communicate via RDMA, and even the TCP communication at 10Gb/s is not working.

@Eevan-zq
Copy link
Author

Eevan-zq commented Jan 16, 2025

The command is mpirun --allow-run-as-root -pernode -np 2 --hostfile /home/hostfile -x NCCL_DEBUG=INFO -x NCCL_IB_DISABLE=0 -x NCCL_IB_HCA=mlx5_0 -mca btl_base_verbose 30 -mca btl_tcp_if_include bond0.2460 -mca plm_rsh_args "-p 2102" /home/nccl-tool/bin/all_reduce_perf -b 1 -e 2GB -f 2 -g 8 > nccl-log2.txt

Image

nccl-log2.txt

cannot use rdma yet.

@kiskra-nvidia
Copy link
Member

Can you try with NCCL_IB_GID_INDEX=3? See, e.g., NVIDIA/nccl#426

@Eevan-zq
Copy link
Author

Yes! It works, thanks a lot.

Image

@kiskra-nvidia
Copy link
Member

Great! We've been working to eliminate the need to provide NCCL_IB_GID_INDEX. Since 2.21 it should no longer be needed when running baremetal, but recently we discovered containerized cases that still didn't work -- see #NVIDIA/nccl#1573.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants