Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL Ignores Specified SOCKET_IFNAME Configuration on Worker Nodes in Multi-Node Setup #1581

Open
rachid2198 opened this issue Jan 18, 2025 · 5 comments

Comments

@rachid2198
Copy link

rachid2198 commented Jan 18, 2025

I am trying to use the Deepspeed framework for multi-node distributed parallel training on 2 Debian 12 servers with 3 RTX 3090s installed. Deepspeed uses NCCL as backend framework for inter-node communication, and it's ignoring the NCCL_SOCKET_IFNAME configuration.

Each server has 2 NICs. One 1GB port (eno1) is assigned to the 192.168.3.* network segment, and one 10GB port (enp37s0f0) is assigned to the 192.168.5.* network segment.

When running my training script I encountered a bug where NCCL selects inconsistent network interfaces for sending and receiving traffic during distributed training. While the main node consistently uses the correct interface (enp37s0f0), the worker nodes exhibit unexpected behavior. On worker nodes, NCCL uses enp37s0f0 for receiving traffic and eno1 for sending traffic, despite NCCL_SOCKET_IFNAME being explicitly set to enp37s0f0.

This leads to degraded performance in the cluster. The bug seems to stem from NCCL’s interface selection mechanism. What is the reason NCCL_SOCKET_IFNAME is not enforcing the use of the interface defined in the configuration on all nodes? How should I solve it?

Environment

  • NCCL Version: 2.24.3+cuda12.4
  • DeepSpeed Version: 0.16.2
  • CUDA Version: 12.4
  • Operating System: Debian12
  • Hardware:
    • Main node (cx-13) and worker node (cx-14) are both equipped with multiple GPUs and dual-network interfaces.
    • Interfaces: enp37s0f0 (primary, high-bandwidth) and eno1 (secondary, fallback).

Observed Behavior

On the worker node (cx-14), significant traffic is observed on the secondary interface (eno1) despite NCCL_SOCKET_IFNAME=enp37s0f0. The main node (cx-13) does not exhibit this issue.

Bandwidth Logs

Worker Node (cx-14)

    enp37s0f0              eno1       
 KB/s in  KB/s out   KB/s in  KB/s out
105278.2      0.00    713.63  34077.51
150569.1      0.00    563.03  120109.0
122634.4      0.00    326.39  120042.5
...

Main Node (cx-13)

    enp37s0f0              eno1       
 KB/s in  KB/s out   KB/s in  KB/s out
93655.95  97137.91      4.01      1.97
79190.90  58117.28      2.37      1.24
75349.71  90591.92      1.30      1.24
65793.54  62356.21      2.31      1.26
...

Debug Logs

The logs confirm that NCCL_SOCKET_IFNAME is set correctly on both nodes:

192.168.5.13: cx-13:2724533:2724533 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp37s0f0
192.168.5.14: cx-14:1536318:1536318 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp37s0f0

NCCL configuration file:

NCCL_DEBUG=INFO
NCCL_SOCKET_IFNAME=enp37s0f0
NCCL_SOCKET_NTHREADS=4
NCCL_NSOCKS_PERTHREAD=8

I attached below the debug logs from the Deepspeed training script and the bandwidth usage for both interfaces (eno1 and enp37s0f0) for both the main node and the worker node.

debug logs
bandwidth usage log on the main node
bandwidth usage log on the worker node

@kiskra-nvidia
Copy link
Member

Is this a duplicate of #1580?

As I wrote in the other bug, this is not unexpected if your network interfaces are on the same subnet. I am curious though why you see a different behavior on one node vs the other -- can you check if there are any critical differences in the output of ip route show from both nodes?

As to why that is -- NCCL does not force any particular routing, but relies on the kernel TCP/IP stack to choose the appropriate NIC based on the provided destination IP address. This is the "classic" way of doing it, but it relies on the NICs being distinguishable via the routing table. I believe there may be an alternative way of doing it that wouldn't require separate subnets (the so-called "bind-before-connect" technique), but it's not currently implemented in NCCL.

@rachid2198
Copy link
Author

@kiskra-nvidia

can you check if there are any critical differences in the output of ip route show from both nodes?

Here is the output from ip route show on both nodes:

The main node (cx-13):

default via 192.168.1.1 dev eno1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.0.0/20 dev enp37s0f0 proto kernel scope link src 192.168.5.13
192.168.0.0/20 dev eno1 proto kernel scope link src 192.168.3.13

the worker node (cx-14):

default via 192.168.1.1 dev eno1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.0.0/20 dev eno1 proto kernel scope link src 192.168.3.14
192.168.0.0/20 dev enp37s0f0 proto kernel scope link src 192.168.5.14

@haltingstate
Copy link

As to why that is -- NCCL does not force any particular routing, but relies on the kernel TCP/IP stack to choose the appropriate NIC based on the provided destination IP address. This is the "classic" way of doing it, but it relies on the NICs being distinguishable via the routing table. I believe there may be an alternative way of doing it that wouldn't require separate subnets (the so-called "bind-before-connect" technique), but it's not currently implemented in NCCL.

I think, there should be some way to "force" a particular network interface, or ban/exclude other network interfaces.

For instance,

  • if I have two or 4 rails (2 NICs or 4 NICs), how do I enable transmission on each NIC?
  • can 2 or 4 NICs be used, if they have separate IP addresses?
  • Can I control which IP address ranges connect to what IP address ranges?
  • how can I prohibit traffic on a specific NIC/IP address?

Is a topology file required?

The documentation for NCCL says to "use rails" and put each NIC on its own switch. And that NCCL was designed for this configuration. This is a "Rails" configuration.

The documentation say that the "rails" can have a slow interconnect. This means that all rails can be within the same sub-net range, but that the rail should still be used.

@haltingstate
Copy link

haltingstate commented Jan 19, 2025

@kiskra-nvidia

I am not sure if this is a duplicate bug. It is similar but different.

There are three separate bugs/issues here. That may need their own tickets.

  1. 192.168.3.8 is not used anywhere in configuration; but NCCL is connecting to it.
  2. why would NCCL connect to IP address, that is not in the NCCL configuration?
  3. why is NCCL sending traffic to this IP address, but the IP address does not appear in the logs?
  • all connections to the IP address, should appear in the logs, but they do not?
  1. Why is NCCL listening on a port/IP address that was not specified or set in the logs and which does not appear in the logs?
  • is NCCL failing to log certain connections or the usage of certain IP addresses or network adapters or listening interfaces?

@kiskra-nvidia
Copy link
Member

@rachid2198

The main node (cx-13):

default via 192.168.1.1 dev eno1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.0.0/20 dev enp37s0f0 proto kernel scope link src 192.168.5.13
192.168.0.0/20 dev eno1 proto kernel scope link src 192.168.3.13

the worker node (cx-14):

default via 192.168.1.1 dev eno1
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.168.0.0/20 dev eno1 proto kernel scope link src 192.168.3.14
192.168.0.0/20 dev enp37s0f0 proto kernel scope link src 192.168.5.14

Thank you -- as I suspected. You've got two entries in the routing table with an identical destination address/mask (192.168.0.0/20) and the order of these entries differs from node to node (possibly depending on minute timing differences regarding which interface ended up being initialized first?).

You can get things to work with the NCCL version you currently have by adjusting the network configuration on your nodes:

  • The most robust fix is to change the netmasks of eno1 and enp37s0f0 interfaces from /20 to /24.
  • I'm guessing increasing the metric of enp37s0f0 line in the routing table to a value larger than the default 100 will also work (untested).
  • Finally, if you can guarantee the order of entries in the routing table like on the host cx-13 above, by bringing the two interfaces up in the right order, I guess it will also work, but personally I would consider that too fragile for anything but a quick experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants