Point-to-point operations preview #316

sjeaugey · 2020-03-30T21:45:04Z

This is a PR for people to review and provide feedback on the p2p branch (issue #212).

Add p2p primitives (ncclSend/ncclRecv). Rework buffer management to only allocated needed buffers.

sjeaugey · 2020-03-31T16:44:25Z

To test the feature, please use the alltoall_perf test from the NCCL tests ("p2p" branch) : https://github.com/NVIDIA/nccl-tests/tree/p2p. Please ignore "in-place" numbers and errors as it is not yet supported (and might never be).

Documentation is not yet available, so I'll be using this PR as temporary/moving documentation.

Usage

Two new verbs are added :

ncclResult_t  ncclSend(const void* sendbuff, size_t count, ncclDataType_t datatype, int peer, ncclComm_t comm, cudaStream_t stream);
ncclResult_t  ncclRecv(void* recvbuff, size_t count, ncclDataType_t datatype, int peer, ncclComm_t comm, cudaStream_t stream);

To implement more complex communication patterns, like concurrent send/receive operations (e.g. sendrecv, gather[v], scatter[v], alltoall[v,w]), multiple calls to ncclSend and ncclRecv can be fused together using group calls (ncclGroupStart/ncclGroupEnd). Calling ncclSend or ncclRecv first makes no difference, no communication will happen until the call to ncclGroupEnd.

Send to self is permitted, provided there is a matching receive from self, all within a group call.

Examples

Sendrecv

  ncclGroupStart();
  ncclSend(sendbuff, sendcount, sendtype, peer, comm, stream);
  ncclRecv(recvbuff, recvcount, recvtype, peer, comm, stream);
  ncclGroupEnd();

Alltoall

  ncclGroupStart();
  for (int r=0; r<nranks; r++) {
    ncclSend(sendbuff + r*sendcount, sendcount, sendtype, r, comm, stream);
    ncclRecv(recvbuff + r*recvcount, recvcount, recvtype, r, comm, stream);
  }
  ncclGroupEnd();

Scatter

  ncclGroupStart();
  if (rank == root) {
    for (int r=0; r<nranks; r++)
      ncclSend(sendbuff+r*size, size, type, r, comm, stream);
  }
  ncclRecv(recvbuff, size, type, root, comm, stream);
  ncclGroupEnd();

Gather

  ncclGroupStart();
  if (rank == root) {
  for (int r=0; r<nranks; r++)
    ncclRecv(recvbuff+r*size, size, type, r, comm, stream);
  }
  ncclSend(sendbuff, size, type, root, comm, stream);
  ncclGroupEnd();

Performance and other considerations

NCCL will establish point-to-point connections dynamically, i.e. on the first call which will need to communicate with a given peer. This can make the first call much slower.

Each point-to-point connection uses resources, which include a 4MB buffer, usually in GPU memory. Performing large all-to-all operations can result in a significant amount of memory being consumed for NCCL buffers. Setting NCCL_BUFFSIZE to a lower value can reduce that footprint, but can also affect collective operations performance.

Limitations / Known issues

This preview version has a few limitations :

There can be only one send and one recv per peer.
Group launch (when using more than one GPU per process) is not supported for asymmetric patterns and results in a hang.
When using GPU Direct RDMA, each point-to-point connection uses resources on the GPU PCI address space. Tesla products should not have any limitation in that respect, but other lines like Quadro might run out of space. If that happens, consider disabling GPU Direct RDMA (NCCL_NET_GDR_LEVEL=0) or reduce the per-peer buffer size (NCCL_BUFFSIZE).

sjeaugey · 2020-03-31T16:53:50Z

FYI @ktnyt @Tixxx @nevion @victoryang00 @2sin18 @akorzh @kuenishi @maxhgerlach @TE-KazukiYoshiyama @yzs981130 since you expressed interest in #212, so that you get updates from this PR.

2sin18 · 2020-04-03T04:56:14Z

I use ncclSend/ncclRecv to implement AllToAllv, and find it's 10x slower than ncclReduce based implementation in V100 x 8GPU environment, but the performance on 4GPUs is as expected.

Will a ncclSend or ncclRecv always find the fastest way to deliver the message? For example, the P2P communication between 2 GPUs without direct NVlink connection might use another GPU with NVLink connections to both GPUs as a bridge.

And it seems like data might be sent to wrong peer in real-world model training, but the problem cannot be reproduced in UT or benchmark.

Can ncclSend/ncclRecv be used in multiple NCCL communicators in different threads?

Kur0x · 2020-04-03T09:28:31Z

@sjeaugey
will there be ncclAlltoAll api？

sjeaugey · 2020-04-03T17:16:02Z

@2sin18 Communication between GPUs without a direct NVLink connection will go through system shared memory, which will indeed be much slower. Better supporting topologies like DGX-1, i.e. using NVLink through another GPU would be a significant amount of work, which is why it didn't make it for this first preview. Besides, you'd still be limited to a single NVLink per GPU going to the other side (so, to serve the 4 other GPUs) hence alltoall bandwidth would drop to ~11 GB/s. For best performance you should use system like DGX-2 which can reach more than 100 GB/s.
Can you explain more in details what you mean by "Can ncclSend/ncclRecv be used in multiple NCCL communicators in different threads?". Do you mean different ncclComm_t (from the same group, but on different GPUs) or the same GPU on different groups ?

@Kur0x we don't intend to have an alltoall API at this point, given the current API can serve that communication pattern and any variant of it (on sizes, buffers, ...).

2sin18 · 2020-04-04T07:31:43Z

@2sin18 Communication between GPUs without a direct NVLink connection will go through system shared memory, which will indeed be much slower. Better supporting topologies like DGX-1, i.e. using NVLink through another GPU would be a significant amount of work, which is why it didn't make it for this first preview. Besides, you'd still be limited to a single NVLink per GPU going to the other side (so, to serve the 4 other GPUs) hence alltoall bandwidth would drop to ~11 GB/s. For best performance you should use system like DGX-2 which can reach more than 100 GB/s.
Can you explain more in details what you mean by "Can ncclSend/ncclRecv be used in multiple NCCL communicators in different threads?". Do you mean different ncclComm_t (from the same group, but on different GPUs) or the same GPU on different groups ?

the same GPU on different groups

sjeaugey · 2020-04-06T16:52:02Z

the same GPU on different groups

Like with collective operations, using different communicators concurrently is unsafe, as CUDA gives no guarantee on how operations will progress (one might block the other, causing a deadlock). Things may work in certain conditions, but this is not something we advise. Using a single communicator and merging all send/recv operations in "alltoallv" phases is safer and might also be much faster as NCCL would use resources in a more efficient way.

2sin18 · 2020-04-09T11:00:56Z

the same GPU on different groups

Like with collective operations, using different communicators concurrently is unsafe, as CUDA gives no guarantee on how operations will progress (one might block the other, causing a deadlock). Things may work in certain conditions, but this is not something we advise. Using a single communicator and merging all send/recv operations in "alltoallv" phases is safer and might also be much faster as NCCL would use resources in a more efficient way.

Thanks for explanation.

Another question: How can I get topology information from NCCL to implement a topology-aware alltoallv algorithm using ncclSend/ncclRecv? In 2.6.4 release notes, it seems NCCL will detect speed for PCI, IB and NIC now.

sjeaugey · 2020-04-09T16:36:46Z

NCCL does not expose the topology through an API. That's because NCCL is supposed to handle the task of optimizing communication paths internally, so when you need to perform an alltoallv, you simply describe the alltoallv (with grouped send/recv to/from all the other peers) and NCCL should take care of the rest.

Also note, topology detection is not new in 2.6.4, what's new is just the speed detection.

2sin18 · 2020-04-18T02:39:12Z

When will this change be merged to master or be part of a release?

kwen2501 · 2020-04-18T07:01:02Z

We are currently testing NCCL 2.7.0, which has the point-to-point feature. If everything goes fine, we may release it (and merge the source code into master) in about 2-3 weeks.

2sin18

If topology detection cannot read /sys/devices/.../max_link_width (e.g. with Linux kernel 3.10), a crash will happen, toplogy.xml generated as below:

<system version="1">
  <cpu numaid="0" affinity="00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff,ffffffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="85">
    <pci busid="0000:18:00.0" class="0x060400" link_speed="" link_width="0">
      <pci busid="0000:4f:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="0" sm="70" rank="0" gdr="1">
          <nvlink target="0000:50:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:60:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:5f:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:50:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="1" sm="70" rank="1" gdr="1">
          <nvlink target="0000:dc:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:5f:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:4f:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:60:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:5d:00.0" class="0x060400" link_speed="" link_width="0">
      <pci busid="0000:5f:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="2" sm="70" rank="2" gdr="1">
          <nvlink target="0000:60:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b1:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:50:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:4f:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:60:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="3" sm="70" rank="3" gdr="1">
          <nvlink target="0000:4f:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:50:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:5f:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
  </cpu>
  <cpu numaid="-1" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="85">
    <pci busid="0000:af:00.0" class="0x060400" link_speed="" link_width="0">
      <pci busid="0000:b1:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="4" sm="70" rank="4" gdr="1">
          <nvlink target="0000:dc:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:5f:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:b2:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="5" sm="70" rank="5" gdr="1">
          <nvlink target="0000:60:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b1:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:dc:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:d8:00.0" class="0x060400" link_speed="" link_width="0">
      <pci busid="0000:db:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="6" sm="70" rank="6" gdr="1">
          <nvlink target="0000:4f:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:dc:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:b1:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="2" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:dc:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="7" sm="70" rank="7" gdr="1">
          <nvlink target="0000:b1:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:50:00.0" count="2" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:da:00.0" class="0x020000" link_speed="" link_width="0">
        <nic>
          <net name="mlx5_bond_0" dev="0" speed="25000" port="1" guid="0x6676a500039b0398" maxconn="262144" gdr="1"/>
        </nic>
      </pci>
    </pci>
  </cpu>
</system>

2sin18 · 2020-04-21T06:42:53Z

src/graph/topo.cc

+    }
+    if (path->width == maxWidth && path->type == minType) nets[count++] = system->nodes[NET].nodes[n].id;
+  }
+  *id = nets[rr % count];


If path->width is 0, rr % count will meet a Floating point exception here.

Suggested change

*id = nets[rr % count];

if (count > 0) {

*id = nets[rr % count];

} else {

*id = system->nodes[NET].nodes[0].id;

}

sjeaugey · 2020-04-21T16:01:35Z

This is weird. Even when those are not defined, we should use default values (PCI Gen 3 x 16) and width should never be 0. Thanks for posting the topology that should help me reproduce the problem on my side.

sjeaugey · 2020-04-21T16:54:06Z

@2sin18 I pushed a fix, please try again. maxWidth was an integer instead of being a float, so when you have a non-integer bandwidth (in you case 25Gb/s = 3.125 GB/s) maxWidth is equal to 3, hence the test path->width == maxWidth is never true, so you end up with 0 NICs.

2sin18 · 2020-04-28T16:12:51Z

Hi @sjeaugey , is it possible to use a pre-defined transfer plan for ncclSend/ncclRecv, just like Funatiq/gossip? I find NVIDIA/HugeCTR use this library to do the all-to-all.

sjeaugey · 2020-04-28T17:32:38Z

NCCL relies on its topology detection to optimize transfers. It does not need hardcoded schedules based on the platform but builds that schedule automatically based on the topology.

That said, this first version is not optimized for all cases yet; in particular, intra-node DGX-1. That will come later.

2sin18 · 2020-04-29T03:32:47Z

Thanks for explanation.

I think the global optimal schedule of transfers depends on not only topology but also algorithm. all-to-all and gather both can use ncclSend/ncclRecv but needs different transfer plan for minimizing the cost. In gossip, it uses an ILP solver to compute the optimal transfer plan for all-to-all or gather, which seems not likely can be found easily in ncclSend/ncclSend setup.

jiayisuse · 2020-05-18T22:15:52Z

Hey team, following up to check the plan of merging point-to-point API to master. We are going to implement NCCL based alltoall in PyTorch distributed lib and depending on the new NCCL release. Thanks!

sjeaugey · 2020-05-19T22:40:22Z

Sorry, we were delayed a bit, but things are looking good now. We're targeting a release before the end of the month.

weberxie · 2020-05-27T08:33:47Z

Great job, looking forward to the new release!

I have built the new branch and ran the tests.

Now I have a question that, how the new API handle this situation, different rank may have to send data of different sizes, some sendcounts may even be 0.

Thanks!

sjeaugey · 2020-05-27T16:44:23Z

@weberxie This should be handled correctly, except in the case of a single process - multi GPU as mentioned in the "Known issues" above :
« Group launch (when using more than one GPU per process) is not supported for asymmetric patterns and results in a hang. »

API-wise, the API is solely based on ncclSend and ncclRecv so there is no problem sending different sizes, datatypes, ... to different peers.

nevion · 2020-06-07T12:34:34Z

@sjeaugey Finally getting around to using this branch - while piecing things out and applying typical optimization patterns to halo exchanges - I noticed starting an asynchronous receive so peers may start transferring while "this" rank is still in computation (but otherwise not touching a receive buffer) seems to be something hard to make happen. Typically this is dangling/optimistic receive or the receive is started explicitly as soon as the previous receive results are consumed but as soon as all details for what to receive are known (if varying). Typically at the timescales of copy over infiniband and likely even nvlink being serialized out to computation, this could be a significant amount of time. Without being able to setup/start transfers as soon as possible, what happens is uneven computation/arrival times across the domain get amplified. In otherwords this is fighting overlapping computation with IO.

I had a few ideas but it seems with the group semantics this might not be something doable.

Is one allowed to start multiple receives in a ncclGroup, and at a later time start the corresponding sends on that rank? This says no to me:

This operation is blocking for the GPU. If multiple ncclSend() and ncclRecv() operations need to progress concurrently to complete, they must be fused within a ncclGroupStart/End.

Do threads offer a solution to the problem? For example, can recv's be nccl grouped on one thread of sorts and can sends then be issued at a different point in time? This seems still unsupported, it just gets the recv blocking out of the computation thread.... ultimately I don't see support for the sends being in a different group...

Do you have any suggestions/thoughts on either how to asynchronously let the receives begin/happen and start the sends at a later time? If not solutions, could you speak some words to where nccl may be going to address this capability and timelines?

Related question: if there's an early domain and it does it's ncclSend/recv group , and at a much later time a last rankdoes it's send/recv group - when in the timeline does the send from the early rank start transferring data and how does it "start", does it wait for some sort of announcement? Assume IB is involved, if it matters. Does the timing of the last rank affect other ranks?

And might as well ask also, how does NCCL deal with multiple IB cards on the same NUMA node as many GPUs?

sjeaugey · 2020-06-08T17:21:29Z

Closing this pull request as it's been merged.

@nevion Indeed, overlapping computation and NCCL communication is a complex subject, because NCCL communication happens on the GPU SMs, hence uses the same resources as the compute kernels. So communication can block computation and vice versa, which means that dynamic/opportunistic scheduling systems won't scale well (if some computation is done before NCCL on one GPU and after NCCL on another, you'll pay the computation time twice). It was true for allreduce, it's still true for send/receive.

The solution we advise is therefore to make sure the scheduling is consistent across GPUs by making the scheduling static, or at least deterministic on factors that are not GPU dependent. That's the only way to ensure good scaling. Which happens to also be a classic MPI rule : noise is the enemy of scaling, and adds up linearly as you scale.

As for pre-posting receives or something equivalent, all communication needs the receiver to be active. The sender can still send some data before the receiver arrives, but that's limited to the intermediate buffer size and it will need to wait for the receiver to absorb that data before posting more.

Regarding IB cards (or network cards in general), each GPU should use all cards that are at the minimum distance in a round-robin manner.

2.7.0p4

589c53a

Add p2p primitives (ncclSend/ncclRecv). Rework buffer management to only allocated needed buffers.

sjeaugey mentioned this pull request Mar 30, 2020

Point-to-point communications in NCCL? #212

Open

sjeaugey added the enhancement label Mar 31, 2020

sjeaugey linked an issue Mar 31, 2020 that may be closed by this pull request

Point-to-point communications in NCCL? #212

Open

sjeaugey mentioned this pull request Mar 31, 2020

all-to-all collective in NCCL 2? #106

Closed

2sin18 reviewed Apr 21, 2020

View reviewed changes

Fix crash when maxWidth is not an integer.

5d819aa

sjeaugey closed this Jun 8, 2020

deepakn94 mentioned this pull request Jun 15, 2020

Actual results did not match the optimizer expectation msr-fiddle/pipedream#50

Open

sjeaugey mentioned this pull request Jun 24, 2020

Bug: ncclSend/ncclRecv: mixed (but matching) sized messages or connectivity result in invalid argument cuda errors and fatal crashes #345

Open

mingzhe09088 mentioned this pull request Sep 1, 2020

[RFC] Batch point-to-point communication pytorch/pytorch#43995

Closed

wenkaidu mentioned this pull request Sep 29, 2020

Add Alltoallv RCCL kernel implementation ROCm/rccl#269

Merged

sjeaugey deleted the p2p branch September 27, 2022 09:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Point-to-point operations preview #316

Point-to-point operations preview #316

sjeaugey commented Mar 30, 2020

sjeaugey commented Mar 31, 2020 •

edited

Loading

sjeaugey commented Mar 31, 2020

2sin18 commented Apr 3, 2020 •

edited

Loading

Kur0x commented Apr 3, 2020

sjeaugey commented Apr 3, 2020 •

edited

Loading

2sin18 commented Apr 4, 2020

sjeaugey commented Apr 6, 2020 •

edited

Loading

2sin18 commented Apr 9, 2020

sjeaugey commented Apr 9, 2020

2sin18 commented Apr 18, 2020

kwen2501 commented Apr 18, 2020

2sin18 left a comment •

edited

Loading

2sin18 Apr 21, 2020 •

edited

Loading

sjeaugey commented Apr 21, 2020

sjeaugey commented Apr 21, 2020

2sin18 commented Apr 28, 2020

sjeaugey commented Apr 28, 2020

2sin18 commented Apr 29, 2020

jiayisuse commented May 18, 2020

sjeaugey commented May 19, 2020

weberxie commented May 27, 2020

sjeaugey commented May 27, 2020

nevion commented Jun 7, 2020

sjeaugey commented Jun 8, 2020

-  *id = nets[rr % count];
+ if (count > 0) {
+   *id = nets[rr % count];
+ } else {
+   *id = system->nodes[NET].nodes[0].id;
+ }

Point-to-point operations preview #316

Point-to-point operations preview #316

Conversation

sjeaugey commented Mar 30, 2020

sjeaugey commented Mar 31, 2020 • edited Loading

sjeaugey commented Mar 31, 2020

2sin18 commented Apr 3, 2020 • edited Loading

Kur0x commented Apr 3, 2020

sjeaugey commented Apr 3, 2020 • edited Loading

2sin18 commented Apr 4, 2020

sjeaugey commented Apr 6, 2020 • edited Loading

2sin18 commented Apr 9, 2020

sjeaugey commented Apr 9, 2020

2sin18 commented Apr 18, 2020

kwen2501 commented Apr 18, 2020

2sin18 left a comment • edited Loading

Choose a reason for hiding this comment

2sin18 Apr 21, 2020 • edited Loading

Choose a reason for hiding this comment

sjeaugey commented Apr 21, 2020

sjeaugey commented Apr 21, 2020

2sin18 commented Apr 28, 2020

sjeaugey commented Apr 28, 2020

2sin18 commented Apr 29, 2020

jiayisuse commented May 18, 2020

sjeaugey commented May 19, 2020

weberxie commented May 27, 2020

sjeaugey commented May 27, 2020

nevion commented Jun 7, 2020

sjeaugey commented Jun 8, 2020

sjeaugey commented Mar 31, 2020 •

edited

Loading

2sin18 commented Apr 3, 2020 •

edited

Loading

sjeaugey commented Apr 3, 2020 •

edited

Loading

sjeaugey commented Apr 6, 2020 •

edited

Loading

2sin18 left a comment •

edited

Loading

2sin18 Apr 21, 2020 •

edited

Loading