Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Point-to-point operations preview #316

Closed
wants to merge 2 commits into from
Closed

Point-to-point operations preview #316

wants to merge 2 commits into from

Conversation

sjeaugey
Copy link
Member

This is a PR for people to review and provide feedback on the p2p branch (issue #212).

Add p2p primitives (ncclSend/ncclRecv).
Rework buffer management to only allocated needed buffers.
@sjeaugey
Copy link
Member Author

sjeaugey commented Mar 31, 2020

To test the feature, please use the alltoall_perf test from the NCCL tests ("p2p" branch) : https://github.com/NVIDIA/nccl-tests/tree/p2p. Please ignore "in-place" numbers and errors as it is not yet supported (and might never be).

Documentation is not yet available, so I'll be using this PR as temporary/moving documentation.

Usage

Two new verbs are added :

ncclResult_t  ncclSend(const void* sendbuff, size_t count, ncclDataType_t datatype, int peer, ncclComm_t comm, cudaStream_t stream);
ncclResult_t  ncclRecv(void* recvbuff, size_t count, ncclDataType_t datatype, int peer, ncclComm_t comm, cudaStream_t stream);

To implement more complex communication patterns, like concurrent send/receive operations (e.g. sendrecv, gather[v], scatter[v], alltoall[v,w]), multiple calls to ncclSend and ncclRecv can be fused together using group calls (ncclGroupStart/ncclGroupEnd). Calling ncclSend or ncclRecv first makes no difference, no communication will happen until the call to ncclGroupEnd.

Send to self is permitted, provided there is a matching receive from self, all within a group call.

Examples

Sendrecv

  ncclGroupStart();
  ncclSend(sendbuff, sendcount, sendtype, peer, comm, stream);
  ncclRecv(recvbuff, recvcount, recvtype, peer, comm, stream);
  ncclGroupEnd();

Alltoall

  ncclGroupStart();
  for (int r=0; r<nranks; r++) {
    ncclSend(sendbuff + r*sendcount, sendcount, sendtype, r, comm, stream);
    ncclRecv(recvbuff + r*recvcount, recvcount, recvtype, r, comm, stream);
  }
  ncclGroupEnd();

Scatter

  ncclGroupStart();
  if (rank == root) {
    for (int r=0; r<nranks; r++)
      ncclSend(sendbuff+r*size, size, type, r, comm, stream);
  }
  ncclRecv(recvbuff, size, type, root, comm, stream);
  ncclGroupEnd();

Gather

  ncclGroupStart();
  if (rank == root) {
  for (int r=0; r<nranks; r++)
    ncclRecv(recvbuff+r*size, size, type, r, comm, stream);
  }
  ncclSend(sendbuff, size, type, root, comm, stream);
  ncclGroupEnd();

Performance and other considerations

NCCL will establish point-to-point connections dynamically, i.e. on the first call which will need to communicate with a given peer. This can make the first call much slower.

Each point-to-point connection uses resources, which include a 4MB buffer, usually in GPU memory. Performing large all-to-all operations can result in a significant amount of memory being consumed for NCCL buffers. Setting NCCL_BUFFSIZE to a lower value can reduce that footprint, but can also affect collective operations performance.

Limitations / Known issues

This preview version has a few limitations :

  • There can be only one send and one recv per peer.
  • Group launch (when using more than one GPU per process) is not supported for asymmetric patterns and results in a hang.
  • When using GPU Direct RDMA, each point-to-point connection uses resources on the GPU PCI address space. Tesla products should not have any limitation in that respect, but other lines like Quadro might run out of space. If that happens, consider disabling GPU Direct RDMA (NCCL_NET_GDR_LEVEL=0) or reduce the per-peer buffer size (NCCL_BUFFSIZE).

@sjeaugey sjeaugey linked an issue Mar 31, 2020 that may be closed by this pull request
@sjeaugey
Copy link
Member Author

FYI @ktnyt @Tixxx @nevion @victoryang00 @2sin18 @akorzh @kuenishi @maxhgerlach @TE-KazukiYoshiyama @yzs981130 since you expressed interest in #212, so that you get updates from this PR.

@2sin18
Copy link

2sin18 commented Apr 3, 2020

I use ncclSend/ncclRecv to implement AllToAllv, and find it's 10x slower than ncclReduce based implementation in V100 x 8GPU environment, but the performance on 4GPUs is as expected.

Will a ncclSend or ncclRecv always find the fastest way to deliver the message? For example, the P2P communication between 2 GPUs without direct NVlink connection might use another GPU with NVLink connections to both GPUs as a bridge.

And it seems like data might be sent to wrong peer in real-world model training, but the problem cannot be reproduced in UT or benchmark.

Can ncclSend/ncclRecv be used in multiple NCCL communicators in different threads?

@Kur0x
Copy link

Kur0x commented Apr 3, 2020

@sjeaugey
will there be ncclAlltoAll api?

@sjeaugey
Copy link
Member Author

sjeaugey commented Apr 3, 2020

@2sin18 Communication between GPUs without a direct NVLink connection will go through system shared memory, which will indeed be much slower. Better supporting topologies like DGX-1, i.e. using NVLink through another GPU would be a significant amount of work, which is why it didn't make it for this first preview. Besides, you'd still be limited to a single NVLink per GPU going to the other side (so, to serve the 4 other GPUs) hence alltoall bandwidth would drop to ~11 GB/s. For best performance you should use system like DGX-2 which can reach more than 100 GB/s.
Can you explain more in details what you mean by "Can ncclSend/ncclRecv be used in multiple NCCL communicators in different threads?". Do you mean different ncclComm_t (from the same group, but on different GPUs) or the same GPU on different groups ?

@Kur0x we don't intend to have an alltoall API at this point, given the current API can serve that communication pattern and any variant of it (on sizes, buffers, ...).

@2sin18
Copy link

2sin18 commented Apr 4, 2020

@2sin18 Communication between GPUs without a direct NVLink connection will go through system shared memory, which will indeed be much slower. Better supporting topologies like DGX-1, i.e. using NVLink through another GPU would be a significant amount of work, which is why it didn't make it for this first preview. Besides, you'd still be limited to a single NVLink per GPU going to the other side (so, to serve the 4 other GPUs) hence alltoall bandwidth would drop to ~11 GB/s. For best performance you should use system like DGX-2 which can reach more than 100 GB/s.
Can you explain more in details what you mean by "Can ncclSend/ncclRecv be used in multiple NCCL communicators in different threads?". Do you mean different ncclComm_t (from the same group, but on different GPUs) or the same GPU on different groups ?

the same GPU on different groups

@sjeaugey
Copy link
Member Author

sjeaugey commented Apr 6, 2020

the same GPU on different groups

Like with collective operations, using different communicators concurrently is unsafe, as CUDA gives no guarantee on how operations will progress (one might block the other, causing a deadlock). Things may work in certain conditions, but this is not something we advise. Using a single communicator and merging all send/recv operations in "alltoallv" phases is safer and might also be much faster as NCCL would use resources in a more efficient way.

@2sin18
Copy link

2sin18 commented Apr 9, 2020

the same GPU on different groups

Like with collective operations, using different communicators concurrently is unsafe, as CUDA gives no guarantee on how operations will progress (one might block the other, causing a deadlock). Things may work in certain conditions, but this is not something we advise. Using a single communicator and merging all send/recv operations in "alltoallv" phases is safer and might also be much faster as NCCL would use resources in a more efficient way.

Thanks for explanation.

Another question: How can I get topology information from NCCL to implement a topology-aware alltoallv algorithm using ncclSend/ncclRecv? In 2.6.4 release notes, it seems NCCL will detect speed for PCI, IB and NIC now.

@sjeaugey
Copy link
Member Author

sjeaugey commented Apr 9, 2020

NCCL does not expose the topology through an API. That's because NCCL is supposed to handle the task of optimizing communication paths internally, so when you need to perform an alltoallv, you simply describe the alltoallv (with grouped send/recv to/from all the other peers) and NCCL should take care of the rest.

Also note, topology detection is not new in 2.6.4, what's new is just the speed detection.

@2sin18
Copy link

2sin18 commented Apr 18, 2020

When will this change be merged to master or be part of a release?

@kwen2501
Copy link
Contributor

We are currently testing NCCL 2.7.0, which has the point-to-point feature. If everything goes fine, we may release it (and merge the source code into master) in about 2-3 weeks.

Copy link

@2sin18 2sin18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If topology detection cannot read /sys/devices/.../max_link_width (e.g. with Linux kernel 3.10), a crash will happen, toplogy.xml generated as below:

<system version="1">
  <cpu numaid="0" affinity="00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff,ffffffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="85">
    <pci busid="0000:18:00.0" class="0x060400" link_speed="" link_width="0">
      <pci busid="0000:4f:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="0" sm="70" rank="0" gdr="1">
          <nvlink target="0000:50:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:60:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:5f:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:50:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="1" sm="70" rank="1" gdr="1">
          <nvlink target="0000:dc:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:5f:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:4f:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:60:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:5d:00.0" class="0x060400" link_speed="" link_width="0">
      <pci busid="0000:5f:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="2" sm="70" rank="2" gdr="1">
          <nvlink target="0000:60:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b1:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:50:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:4f:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:60:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="3" sm="70" rank="3" gdr="1">
          <nvlink target="0000:4f:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:50:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:5f:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
  </cpu>
  <cpu numaid="-1" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="85">
    <pci busid="0000:af:00.0" class="0x060400" link_speed="" link_width="0">
      <pci busid="0000:b1:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="4" sm="70" rank="4" gdr="1">
          <nvlink target="0000:dc:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:5f:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:b2:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="5" sm="70" rank="5" gdr="1">
          <nvlink target="0000:60:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b1:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:dc:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:d8:00.0" class="0x060400" link_speed="" link_width="0">
      <pci busid="0000:db:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="6" sm="70" rank="6" gdr="1">
          <nvlink target="0000:4f:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:dc:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:b1:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="2" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:dc:00.0" class="0x030200" link_speed="" link_width="0">
        <gpu dev="7" sm="70" rank="7" gdr="1">
          <nvlink target="0000:b1:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:50:00.0" count="2" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:da:00.0" class="0x020000" link_speed="" link_width="0">
        <nic>
          <net name="mlx5_bond_0" dev="0" speed="25000" port="1" guid="0x6676a500039b0398" maxconn="262144" gdr="1"/>
        </nic>
      </pci>
    </pci>
  </cpu>
</system>

}
if (path->width == maxWidth && path->type == minType) nets[count++] = system->nodes[NET].nodes[n].id;
}
*id = nets[rr % count];
Copy link

@2sin18 2sin18 Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If path->width is 0, rr % count will meet a Floating point exception here.

Suggested change
*id = nets[rr % count];
if (count > 0) {
*id = nets[rr % count];
} else {
*id = system->nodes[NET].nodes[0].id;
}

@sjeaugey
Copy link
Member Author

This is weird. Even when those are not defined, we should use default values (PCI Gen 3 x 16) and width should never be 0. Thanks for posting the topology that should help me reproduce the problem on my side.

@sjeaugey
Copy link
Member Author

@2sin18 I pushed a fix, please try again. maxWidth was an integer instead of being a float, so when you have a non-integer bandwidth (in you case 25Gb/s = 3.125 GB/s) maxWidth is equal to 3, hence the test path->width == maxWidth is never true, so you end up with 0 NICs.

@2sin18
Copy link

2sin18 commented Apr 28, 2020

Hi @sjeaugey , is it possible to use a pre-defined transfer plan for ncclSend/ncclRecv, just like Funatiq/gossip? I find NVIDIA/HugeCTR use this library to do the all-to-all.

@sjeaugey
Copy link
Member Author

NCCL relies on its topology detection to optimize transfers. It does not need hardcoded schedules based on the platform but builds that schedule automatically based on the topology.

That said, this first version is not optimized for all cases yet; in particular, intra-node DGX-1. That will come later.

@2sin18
Copy link

2sin18 commented Apr 29, 2020

Thanks for explanation.

I think the global optimal schedule of transfers depends on not only topology but also algorithm. all-to-all and gather both can use ncclSend/ncclRecv but needs different transfer plan for minimizing the cost. In gossip, it uses an ILP solver to compute the optimal transfer plan for all-to-all or gather, which seems not likely can be found easily in ncclSend/ncclSend setup.

@jiayisuse
Copy link

Hey team, following up to check the plan of merging point-to-point API to master. We are going to implement NCCL based alltoall in PyTorch distributed lib and depending on the new NCCL release. Thanks!

@sjeaugey
Copy link
Member Author

Sorry, we were delayed a bit, but things are looking good now. We're targeting a release before the end of the month.

@weberxie
Copy link

Great job, looking forward to the new release!

I have built the new branch and ran the tests.

Now I have a question that, how the new API handle this situation, different rank may have to send data of different sizes, some sendcounts may even be 0.

Thanks!

@sjeaugey
Copy link
Member Author

@weberxie This should be handled correctly, except in the case of a single process - multi GPU as mentioned in the "Known issues" above :
« Group launch (when using more than one GPU per process) is not supported for asymmetric patterns and results in a hang. »

API-wise, the API is solely based on ncclSend and ncclRecv so there is no problem sending different sizes, datatypes, ... to different peers.

@nevion
Copy link

nevion commented Jun 7, 2020

@sjeaugey Finally getting around to using this branch - while piecing things out and applying typical optimization patterns to halo exchanges - I noticed starting an asynchronous receive so peers may start transferring while "this" rank is still in computation (but otherwise not touching a receive buffer) seems to be something hard to make happen. Typically this is dangling/optimistic receive or the receive is started explicitly as soon as the previous receive results are consumed but as soon as all details for what to receive are known (if varying). Typically at the timescales of copy over infiniband and likely even nvlink being serialized out to computation, this could be a significant amount of time. Without being able to setup/start transfers as soon as possible, what happens is uneven computation/arrival times across the domain get amplified. In otherwords this is fighting overlapping computation with IO.

I had a few ideas but it seems with the group semantics this might not be something doable.

Is one allowed to start multiple receives in a ncclGroup, and at a later time start the corresponding sends on that rank? This says no to me:

This operation is blocking for the GPU. If multiple ncclSend() and ncclRecv() operations need to progress concurrently to complete, they must be fused within a ncclGroupStart/End.

Do threads offer a solution to the problem? For example, can recv's be nccl grouped on one thread of sorts and can sends then be issued at a different point in time? This seems still unsupported, it just gets the recv blocking out of the computation thread.... ultimately I don't see support for the sends being in a different group...

Do you have any suggestions/thoughts on either how to asynchronously let the receives begin/happen and start the sends at a later time? If not solutions, could you speak some words to where nccl may be going to address this capability and timelines?

Related question: if there's an early domain and it does it's ncclSend/recv group , and at a much later time a last rankdoes it's send/recv group - when in the timeline does the send from the early rank start transferring data and how does it "start", does it wait for some sort of announcement? Assume IB is involved, if it matters. Does the timing of the last rank affect other ranks?

And might as well ask also, how does NCCL deal with multiple IB cards on the same NUMA node as many GPUs?

@sjeaugey sjeaugey closed this Jun 8, 2020
@sjeaugey
Copy link
Member Author

sjeaugey commented Jun 8, 2020

Closing this pull request as it's been merged.

@nevion Indeed, overlapping computation and NCCL communication is a complex subject, because NCCL communication happens on the GPU SMs, hence uses the same resources as the compute kernels. So communication can block computation and vice versa, which means that dynamic/opportunistic scheduling systems won't scale well (if some computation is done before NCCL on one GPU and after NCCL on another, you'll pay the computation time twice). It was true for allreduce, it's still true for send/receive.

The solution we advise is therefore to make sure the scheduling is consistent across GPUs by making the scheduling static, or at least deterministic on factors that are not GPU dependent. That's the only way to ensure good scaling. Which happens to also be a classic MPI rule : noise is the enemy of scaling, and adds up linearly as you scale.

As for pre-posting receives or something equivalent, all communication needs the receiver to be active. The sender can still send some data before the receiver arrives, but that's limited to the intermediate buffer size and it will need to wait for the receiver to absorb that data before posting more.

Regarding IB cards (or network cards in general), each GPU should use all cards that are at the minimum distance in a round-robin manner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Point-to-point communications in NCCL?
7 participants