-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nccl-tests cannot perform multi-machine interconnection through RDMA in the docker container. #281
Comments
It appears that you used different Don't feel embarrassed about it though; I catch myself doing the same thing every so often 😉. |
Yes, you have discovered this oversight. I will uniformly use |
The command is cannot use rdma yet. |
Can you try with |
Great! We've been working to eliminate the need to provide |
Hello,
Recently, I have encountered a problem. I ran docker containers (official Nvidia images) on two servers and wanted to perform multi-machine interconnection of nccl-tests in the containers, with communication between the machines through RDMA, but it has not been successful. Here is the basic information of the two containers:
1. The information of mlx5_x can be seen through ibv_devices.
The command is: mpirun --allow-run-as-root -pernode -np 2 --hostfile /home/hostfile -x NCCL_DEBUG=INFO -x NCCL_IB_DISABLE=0 -x NCCL_IB_HCA=mlx5_0 -mca btl_base_verbose 30 -mca btl_tcp_if_include bond0.2460 -mca plm_rsh_args "-p 2102" /usr/local/bin/all_reduce_perf -b 1 -e 2GB -f 2 -g 8 > nccl-log.
The result is:
nccl-log.txt
Although there is a result, this result is exactly the same as that of the single-machine test, and the effect before using multiple machines must not have been so good.
The text was updated successfully, but these errors were encountered: