-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL Ignores Specified SOCKET_IFNAME Configuration on Worker Nodes in Multi-Node Setup #1581
Comments
Is this a duplicate of #1580? As I wrote in the other bug, this is not unexpected if your network interfaces are on the same subnet. I am curious though why you see a different behavior on one node vs the other -- can you check if there are any critical differences in the output of As to why that is -- NCCL does not force any particular routing, but relies on the kernel TCP/IP stack to choose the appropriate NIC based on the provided destination IP address. This is the "classic" way of doing it, but it relies on the NICs being distinguishable via the routing table. I believe there may be an alternative way of doing it that wouldn't require separate subnets (the so-called "bind-before-connect" technique), but it's not currently implemented in NCCL. |
Here is the output from The main node (cx-13):
the worker node (cx-14):
|
I think, there should be some way to "force" a particular network interface, or ban/exclude other network interfaces. For instance,
Is a topology file required? The documentation for NCCL says to "use rails" and put each NIC on its own switch. And that NCCL was designed for this configuration. This is a "Rails" configuration. The documentation say that the "rails" can have a slow interconnect. This means that all rails can be within the same sub-net range, but that the rail should still be used. |
I am not sure if this is a duplicate bug. It is similar but different. There are three separate bugs/issues here. That may need their own tickets.
|
Thank you -- as I suspected. You've got two entries in the routing table with an identical destination address/mask ( You can get things to work with the NCCL version you currently have by adjusting the network configuration on your nodes:
|
I am trying to use the Deepspeed framework for multi-node distributed parallel training on 2 Debian 12 servers with 3 RTX 3090s installed. Deepspeed uses NCCL as backend framework for inter-node communication, and it's ignoring the NCCL_SOCKET_IFNAME configuration.
Each server has 2 NICs. One 1GB port (eno1) is assigned to the 192.168.3.* network segment, and one 10GB port (enp37s0f0) is assigned to the 192.168.5.* network segment.
When running my training script I encountered a bug where NCCL selects inconsistent network interfaces for sending and receiving traffic during distributed training. While the main node consistently uses the correct interface (enp37s0f0), the worker nodes exhibit unexpected behavior. On worker nodes, NCCL uses enp37s0f0 for receiving traffic and eno1 for sending traffic, despite NCCL_SOCKET_IFNAME being explicitly set to enp37s0f0.
This leads to degraded performance in the cluster. The bug seems to stem from NCCL’s interface selection mechanism. What is the reason NCCL_SOCKET_IFNAME is not enforcing the use of the interface defined in the configuration on all nodes? How should I solve it?
Environment
Observed Behavior
On the worker node (cx-14), significant traffic is observed on the secondary interface (eno1) despite NCCL_SOCKET_IFNAME=enp37s0f0. The main node (cx-13) does not exhibit this issue.
Bandwidth Logs
Worker Node (cx-14)
Main Node (cx-13)
Debug Logs
The logs confirm that NCCL_SOCKET_IFNAME is set correctly on both nodes:
NCCL configuration file:
I attached below the debug logs from the Deepspeed training script and the bandwidth usage for both interfaces (eno1 and enp37s0f0) for both the main node and the worker node.
debug logs
bandwidth usage log on the main node
bandwidth usage log on the worker node
The text was updated successfully, but these errors were encountered: