NCCL Ignores Specified SOCKET_IFNAME Configuration on Worker Nodes in Multi-Node Setup #1581

rachid2198 · 2025-01-18T17:49:36Z

I am trying to use the Deepspeed framework for multi-node distributed parallel training on 2 Debian 12 servers with 3 RTX 3090s installed. Deepspeed uses NCCL as backend framework for inter-node communication, and it's ignoring the NCCL_SOCKET_IFNAME configuration.

Each server has 2 NICs. One 1GB port (eno1) is assigned to the 192.168.3.* network segment, and one 10GB port (enp37s0f0) is assigned to the 192.168.5.* network segment.

When running my training script I encountered a bug where NCCL selects inconsistent network interfaces for sending and receiving traffic during distributed training. While the main node consistently uses the correct interface (enp37s0f0), the worker nodes exhibit unexpected behavior. On worker nodes, NCCL uses enp37s0f0 for receiving traffic and eno1 for sending traffic, despite NCCL_SOCKET_IFNAME being explicitly set to enp37s0f0.

This leads to degraded performance in the cluster. The bug seems to stem from NCCL’s interface selection mechanism. What is the reason NCCL_SOCKET_IFNAME is not enforcing the use of the interface defined in the configuration on all nodes? How should I solve it?

Environment

NCCL Version: 2.24.3+cuda12.4
DeepSpeed Version: 0.16.2
CUDA Version: 12.4
Operating System: Debian12
Hardware:
- Main node (cx-13) and worker node (cx-14) are both equipped with multiple GPUs and dual-network interfaces.
- Interfaces: enp37s0f0 (primary, high-bandwidth) and eno1 (secondary, fallback).

Observed Behavior

On the worker node (cx-14), significant traffic is observed on the secondary interface (eno1) despite NCCL_SOCKET_IFNAME=enp37s0f0. The main node (cx-13) does not exhibit this issue.

Bandwidth Logs

Worker Node (cx-14)

    enp37s0f0              eno1       
 KB/s in  KB/s out   KB/s in  KB/s out
105278.2      0.00    713.63  34077.51
150569.1      0.00    563.03  120109.0
122634.4      0.00    326.39  120042.5
...

Main Node (cx-13)

    enp37s0f0              eno1       
 KB/s in  KB/s out   KB/s in  KB/s out
93655.95  97137.91      4.01      1.97
79190.90  58117.28      2.37      1.24
75349.71  90591.92      1.30      1.24
65793.54  62356.21      2.31      1.26
...

Debug Logs

The logs confirm that NCCL_SOCKET_IFNAME is set correctly on both nodes:

192.168.5.13: cx-13:2724533:2724533 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp37s0f0
192.168.5.14: cx-14:1536318:1536318 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp37s0f0

NCCL configuration file:

NCCL_DEBUG=INFO
NCCL_SOCKET_IFNAME=enp37s0f0
NCCL_SOCKET_NTHREADS=4
NCCL_NSOCKS_PERTHREAD=8

I attached below the debug logs from the Deepspeed training script and the bandwidth usage for both interfaces (eno1 and enp37s0f0) for both the main node and the worker node.

debug logs
bandwidth usage log on the main node
bandwidth usage log on the worker node

The text was updated successfully, but these errors were encountered:

kiskra-nvidia · 2025-01-19T05:38:21Z

Is this a duplicate of #1580?

As I wrote in the other bug, this is not unexpected if your network interfaces are on the same subnet. I am curious though why you see a different behavior on one node vs the other -- can you check if there are any critical differences in the output of ip route show from both nodes?

As to why that is -- NCCL does not force any particular routing, but relies on the kernel TCP/IP stack to choose the appropriate NIC based on the provided destination IP address. This is the "classic" way of doing it, but it relies on the NICs being distinguishable via the routing table. I believe there may be an alternative way of doing it that wouldn't require separate subnets (the so-called "bind-before-connect" technique), but it's not currently implemented in NCCL.

rachid2198 · 2025-01-19T13:34:18Z

@kiskra-nvidia

can you check if there are any critical differences in the output of ip route show from both nodes?

Here is the output from ip route show on both nodes:

The main node (cx-13):

default via 192.168.1.1 dev eno1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.0.0/20 dev enp37s0f0 proto kernel scope link src 192.168.5.13
192.168.0.0/20 dev eno1 proto kernel scope link src 192.168.3.13

the worker node (cx-14):

default via 192.168.1.1 dev eno1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.0.0/20 dev eno1 proto kernel scope link src 192.168.3.14
192.168.0.0/20 dev enp37s0f0 proto kernel scope link src 192.168.5.14

haltingstate · 2025-01-19T15:32:51Z

As to why that is -- NCCL does not force any particular routing, but relies on the kernel TCP/IP stack to choose the appropriate NIC based on the provided destination IP address. This is the "classic" way of doing it, but it relies on the NICs being distinguishable via the routing table. I believe there may be an alternative way of doing it that wouldn't require separate subnets (the so-called "bind-before-connect" technique), but it's not currently implemented in NCCL.

I think, there should be some way to "force" a particular network interface, or ban/exclude other network interfaces.

For instance,

if I have two or 4 rails (2 NICs or 4 NICs), how do I enable transmission on each NIC?
can 2 or 4 NICs be used, if they have separate IP addresses?
Can I control which IP address ranges connect to what IP address ranges?
how can I prohibit traffic on a specific NIC/IP address?

Is a topology file required?

The documentation for NCCL says to "use rails" and put each NIC on its own switch. And that NCCL was designed for this configuration. This is a "Rails" configuration.

The documentation say that the "rails" can have a slow interconnect. This means that all rails can be within the same sub-net range, but that the rail should still be used.

haltingstate · 2025-01-19T15:34:37Z

@kiskra-nvidia

I am not sure if this is a duplicate bug. It is similar but different.

There are three separate bugs/issues here. That may need their own tickets.

192.168.3.8 is not used anywhere in configuration; but NCCL is connecting to it.
why would NCCL connect to IP address, that is not in the NCCL configuration?
why is NCCL sending traffic to this IP address, but the IP address does not appear in the logs?

all connections to the IP address, should appear in the logs, but they do not?

Why is NCCL listening on a port/IP address that was not specified or set in the logs and which does not appear in the logs?

is NCCL failing to log certain connections or the usage of certain IP addresses or network adapters or listening interfaces?

kiskra-nvidia · 2025-01-20T05:31:59Z

@rachid2198

The main node (cx-13):

default via 192.168.1.1 dev eno1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.0.0/20 dev enp37s0f0 proto kernel scope link src 192.168.5.13
192.168.0.0/20 dev eno1 proto kernel scope link src 192.168.3.13

the worker node (cx-14):

default via 192.168.1.1 dev eno1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.0.0/20 dev eno1 proto kernel scope link src 192.168.3.14
192.168.0.0/20 dev enp37s0f0 proto kernel scope link src 192.168.5.14

Thank you -- as I suspected. You've got two entries in the routing table with an identical destination address/mask (192.168.0.0/20) and the order of these entries differs from node to node (possibly depending on minute timing differences regarding which interface ended up being initialized first?).

You can get things to work with the NCCL version you currently have by adjusting the network configuration on your nodes:

The most robust fix is to change the netmasks of eno1 and enp37s0f0 interfaces from /20 to /24.
I'm guessing increasing the metric of enp37s0f0 line in the routing table to a value larger than the default 100 will also work (untested).
Finally, if you can guarantee the order of entries in the routing table like on the host cx-13 above, by bringing the two interfaces up in the right order, I guess it will also work, but personally I would consider that too fragile for anything but a quick experiment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL Ignores Specified SOCKET_IFNAME Configuration on Worker Nodes in Multi-Node Setup #1581

NCCL Ignores Specified SOCKET_IFNAME Configuration on Worker Nodes in Multi-Node Setup #1581

rachid2198 commented Jan 18, 2025 •

edited

Loading

kiskra-nvidia commented Jan 19, 2025

rachid2198 commented Jan 19, 2025

haltingstate commented Jan 19, 2025

haltingstate commented Jan 19, 2025 •

edited

Loading

kiskra-nvidia commented Jan 20, 2025

NCCL Ignores Specified SOCKET_IFNAME Configuration on Worker Nodes in Multi-Node Setup #1581

NCCL Ignores Specified SOCKET_IFNAME Configuration on Worker Nodes in Multi-Node Setup #1581

Comments

rachid2198 commented Jan 18, 2025 • edited Loading

Observed Behavior

Bandwidth Logs

Debug Logs

kiskra-nvidia commented Jan 19, 2025

rachid2198 commented Jan 19, 2025

haltingstate commented Jan 19, 2025

haltingstate commented Jan 19, 2025 • edited Loading

kiskra-nvidia commented Jan 20, 2025

rachid2198 commented Jan 18, 2025 •

edited

Loading

haltingstate commented Jan 19, 2025 •

edited

Loading