-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Point-to-point operations preview #316
Conversation
To test the feature, please use the Documentation is not yet available, so I'll be using this PR as temporary/moving documentation. Usage Two new verbs are added :
To implement more complex communication patterns, like concurrent send/receive operations (e.g. sendrecv, gather[v], scatter[v], alltoall[v,w]), multiple calls to Send to self is permitted, provided there is a matching receive from self, all within a group call. Examples Sendrecv
Alltoall
Scatter
Gather
Performance and other considerations NCCL will establish point-to-point connections dynamically, i.e. on the first call which will need to communicate with a given peer. This can make the first call much slower. Each point-to-point connection uses resources, which include a 4MB buffer, usually in GPU memory. Performing large all-to-all operations can result in a significant amount of memory being consumed for NCCL buffers. Setting Limitations / Known issues This preview version has a few limitations :
|
FYI @ktnyt @Tixxx @nevion @victoryang00 @2sin18 @akorzh @kuenishi @maxhgerlach @TE-KazukiYoshiyama @yzs981130 since you expressed interest in #212, so that you get updates from this PR. |
I use Will a And it seems like data might be sent to wrong peer in real-world model training, but the problem cannot be reproduced in UT or benchmark. Can |
@sjeaugey |
@2sin18 Communication between GPUs without a direct NVLink connection will go through system shared memory, which will indeed be much slower. Better supporting topologies like DGX-1, i.e. using NVLink through another GPU would be a significant amount of work, which is why it didn't make it for this first preview. Besides, you'd still be limited to a single NVLink per GPU going to the other side (so, to serve the 4 other GPUs) hence alltoall bandwidth would drop to ~11 GB/s. For best performance you should use system like DGX-2 which can reach more than 100 GB/s. @Kur0x we don't intend to have an alltoall API at this point, given the current API can serve that communication pattern and any variant of it (on sizes, buffers, ...). |
the same GPU on different groups |
Like with collective operations, using different communicators concurrently is unsafe, as CUDA gives no guarantee on how operations will progress (one might block the other, causing a deadlock). Things may work in certain conditions, but this is not something we advise. Using a single communicator and merging all send/recv operations in "alltoallv" phases is safer and might also be much faster as NCCL would use resources in a more efficient way. |
Thanks for explanation. Another question: How can I get topology information from NCCL to implement a topology-aware alltoallv algorithm using |
NCCL does not expose the topology through an API. That's because NCCL is supposed to handle the task of optimizing communication paths internally, so when you need to perform an alltoallv, you simply describe the alltoallv (with grouped send/recv to/from all the other peers) and NCCL should take care of the rest. Also note, topology detection is not new in 2.6.4, what's new is just the speed detection. |
When will this change be merged to master or be part of a release? |
We are currently testing NCCL 2.7.0, which has the point-to-point feature. If everything goes fine, we may release it (and merge the source code into master) in about 2-3 weeks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If topology detection cannot read /sys/devices/.../max_link_width
(e.g. with Linux kernel 3.10), a crash will happen, toplogy.xml generated as below:
<system version="1">
<cpu numaid="0" affinity="00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff,ffffffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="85">
<pci busid="0000:18:00.0" class="0x060400" link_speed="" link_width="0">
<pci busid="0000:4f:00.0" class="0x030200" link_speed="" link_width="0">
<gpu dev="0" sm="70" rank="0" gdr="1">
<nvlink target="0000:50:00.0" count="1" tclass="0x030200"/>
<nvlink target="0000:60:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:db:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:5f:00.0" count="1" tclass="0x030200"/>
</gpu>
</pci>
<pci busid="0000:50:00.0" class="0x030200" link_speed="" link_width="0">
<gpu dev="1" sm="70" rank="1" gdr="1">
<nvlink target="0000:dc:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:5f:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:4f:00.0" count="1" tclass="0x030200"/>
<nvlink target="0000:60:00.0" count="1" tclass="0x030200"/>
</gpu>
</pci>
</pci>
<pci busid="0000:5d:00.0" class="0x060400" link_speed="" link_width="0">
<pci busid="0000:5f:00.0" class="0x030200" link_speed="" link_width="0">
<gpu dev="2" sm="70" rank="2" gdr="1">
<nvlink target="0000:60:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:b1:00.0" count="1" tclass="0x030200"/>
<nvlink target="0000:50:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:4f:00.0" count="1" tclass="0x030200"/>
</gpu>
</pci>
<pci busid="0000:60:00.0" class="0x030200" link_speed="" link_width="0">
<gpu dev="3" sm="70" rank="3" gdr="1">
<nvlink target="0000:4f:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:50:00.0" count="1" tclass="0x030200"/>
<nvlink target="0000:5f:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:b2:00.0" count="1" tclass="0x030200"/>
</gpu>
</pci>
</pci>
</cpu>
<cpu numaid="-1" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="85">
<pci busid="0000:af:00.0" class="0x060400" link_speed="" link_width="0">
<pci busid="0000:b1:00.0" class="0x030200" link_speed="" link_width="0">
<gpu dev="4" sm="70" rank="4" gdr="1">
<nvlink target="0000:dc:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:b2:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:5f:00.0" count="1" tclass="0x030200"/>
<nvlink target="0000:db:00.0" count="1" tclass="0x030200"/>
</gpu>
</pci>
<pci busid="0000:b2:00.0" class="0x030200" link_speed="" link_width="0">
<gpu dev="5" sm="70" rank="5" gdr="1">
<nvlink target="0000:60:00.0" count="1" tclass="0x030200"/>
<nvlink target="0000:db:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:b1:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:dc:00.0" count="1" tclass="0x030200"/>
</gpu>
</pci>
</pci>
<pci busid="0000:d8:00.0" class="0x060400" link_speed="" link_width="0">
<pci busid="0000:db:00.0" class="0x030200" link_speed="" link_width="0">
<gpu dev="6" sm="70" rank="6" gdr="1">
<nvlink target="0000:4f:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:dc:00.0" count="1" tclass="0x030200"/>
<nvlink target="0000:b1:00.0" count="1" tclass="0x030200"/>
<nvlink target="0000:b2:00.0" count="2" tclass="0x030200"/>
</gpu>
</pci>
<pci busid="0000:dc:00.0" class="0x030200" link_speed="" link_width="0">
<gpu dev="7" sm="70" rank="7" gdr="1">
<nvlink target="0000:b1:00.0" count="2" tclass="0x030200"/>
<nvlink target="0000:b2:00.0" count="1" tclass="0x030200"/>
<nvlink target="0000:db:00.0" count="1" tclass="0x030200"/>
<nvlink target="0000:50:00.0" count="2" tclass="0x030200"/>
</gpu>
</pci>
<pci busid="0000:da:00.0" class="0x020000" link_speed="" link_width="0">
<nic>
<net name="mlx5_bond_0" dev="0" speed="25000" port="1" guid="0x6676a500039b0398" maxconn="262144" gdr="1"/>
</nic>
</pci>
</pci>
</cpu>
</system>
} | ||
if (path->width == maxWidth && path->type == minType) nets[count++] = system->nodes[NET].nodes[n].id; | ||
} | ||
*id = nets[rr % count]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If path->width
is 0, rr % count
will meet a Floating point exception here.
*id = nets[rr % count]; | |
if (count > 0) { | |
*id = nets[rr % count]; | |
} else { | |
*id = system->nodes[NET].nodes[0].id; | |
} |
This is weird. Even when those are not defined, we should use default values (PCI Gen 3 x 16) and width should never be 0. Thanks for posting the topology that should help me reproduce the problem on my side. |
@2sin18 I pushed a fix, please try again. maxWidth was an integer instead of being a float, so when you have a non-integer bandwidth (in you case 25Gb/s = 3.125 GB/s) maxWidth is equal to 3, hence the test |
Hi @sjeaugey , is it possible to use a pre-defined transfer plan for ncclSend/ncclRecv, just like Funatiq/gossip? I find NVIDIA/HugeCTR use this library to do the all-to-all. |
NCCL relies on its topology detection to optimize transfers. It does not need hardcoded schedules based on the platform but builds that schedule automatically based on the topology. That said, this first version is not optimized for all cases yet; in particular, intra-node DGX-1. That will come later. |
Thanks for explanation. I think the global optimal schedule of transfers depends on not only topology but also algorithm. all-to-all and gather both can use ncclSend/ncclRecv but needs different transfer plan for minimizing the cost. In gossip, it uses an ILP solver to compute the optimal transfer plan for all-to-all or gather, which seems not likely can be found easily in ncclSend/ncclSend setup. |
Hey team, following up to check the plan of merging point-to-point API to master. We are going to implement NCCL based alltoall in PyTorch distributed lib and depending on the new NCCL release. Thanks! |
Sorry, we were delayed a bit, but things are looking good now. We're targeting a release before the end of the month. |
Great job, looking forward to the new release! I have built the new branch and ran the tests. Now I have a question that, how the new API handle this situation, different rank may have to send data of different sizes, some sendcounts may even be 0. Thanks! |
@weberxie This should be handled correctly, except in the case of a single process - multi GPU as mentioned in the "Known issues" above : API-wise, the API is solely based on |
@sjeaugey Finally getting around to using this branch - while piecing things out and applying typical optimization patterns to halo exchanges - I noticed starting an asynchronous receive so peers may start transferring while "this" rank is still in computation (but otherwise not touching a receive buffer) seems to be something hard to make happen. Typically this is dangling/optimistic receive or the receive is started explicitly as soon as the previous receive results are consumed but as soon as all details for what to receive are known (if varying). Typically at the timescales of copy over infiniband and likely even nvlink being serialized out to computation, this could be a significant amount of time. Without being able to setup/start transfers as soon as possible, what happens is uneven computation/arrival times across the domain get amplified. In otherwords this is fighting overlapping computation with IO. I had a few ideas but it seems with the group semantics this might not be something doable. Is one allowed to start multiple receives in a ncclGroup, and at a later time start the corresponding sends on that rank? This says no to me:
Do threads offer a solution to the problem? For example, can recv's be nccl grouped on one thread of sorts and can sends then be issued at a different point in time? This seems still unsupported, it just gets the recv blocking out of the computation thread.... ultimately I don't see support for the sends being in a different group... Do you have any suggestions/thoughts on either how to asynchronously let the receives begin/happen and start the sends at a later time? If not solutions, could you speak some words to where nccl may be going to address this capability and timelines? Related question: if there's an early domain and it does it's ncclSend/recv group , and at a much later time a last rankdoes it's send/recv group - when in the timeline does the send from the early rank start transferring data and how does it "start", does it wait for some sort of announcement? Assume IB is involved, if it matters. Does the timing of the last rank affect other ranks? And might as well ask also, how does NCCL deal with multiple IB cards on the same NUMA node as many GPUs? |
Closing this pull request as it's been merged. @nevion Indeed, overlapping computation and NCCL communication is a complex subject, because NCCL communication happens on the GPU SMs, hence uses the same resources as the compute kernels. So communication can block computation and vice versa, which means that dynamic/opportunistic scheduling systems won't scale well (if some computation is done before NCCL on one GPU and after NCCL on another, you'll pay the computation time twice). It was true for allreduce, it's still true for send/receive. The solution we advise is therefore to make sure the scheduling is consistent across GPUs by making the scheduling static, or at least deterministic on factors that are not GPU dependent. That's the only way to ensure good scaling. Which happens to also be a classic MPI rule : noise is the enemy of scaling, and adds up linearly as you scale. As for pre-posting receives or something equivalent, all communication needs the receiver to be active. The sender can still send some data before the receiver arrives, but that's limited to the intermediate buffer size and it will need to wait for the receiver to absorb that data before posting more. Regarding IB cards (or network cards in general), each GPU should use all cards that are at the minimum distance in a round-robin manner. |
This is a PR for people to review and provide feedback on the p2p branch (issue #212).