-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No algorithm/protocol available when NCCL_ALGO is set to Tree #317
Comments
Hi, thanks for reporting this. You were seeing this error because there is no tree-based implementation for broadcast in NCCL, so I guess the problem boils down to whether NCCL should use fall-back algorithms when users set it to something that does not exist for certain collectives. This could be sometimes useful when users use mixed collectives and only want to apply Btw, NCCL should be able to auto-tune the algorithm and protocol based on the number of nodes & ranks, message size, platform, etc, so hard setting those two environment variables is often not necessary. |
@kwen2501 Thanks for your reply. I think I will use some fall-back algorithms for those without available algorithm/protocol. |
@joapolarbear can you explain why you want to force |
@sjeaugey I am working on a program that needs to know some details about the communication operations in NCCL, including the topology and dependency information. We have done the "ring" algorithm part and want to cover the "tree" algorithm. Basically, there is not a real need to use the "tree" algorithm and we just want to design a system that can cover both algorithms. |
Is it still true that there is no tree-based implementation for broadcast? Could you explain why this is the case? If I am not mistaken, 2-tree algorithm for AllReduce is 2-tree Reduce followed by (maybe pipelined with?) 2-tree Broadcast, can we not implement Reduce as in the first phase for AllReduce, and implement Broadcast as in the second phase for AllReduce? Thank you. |
I do think we can still do 2-tree for reductoin/broadcast, as long as for reduce we gather the data at the end, and for broadcast we broadcast the target device data to the other tree root in the beginning. |
Yes we could use it for broadcast/reduce but we would have to recompute the trees for each root, and connect ranks, which we don't do at the moment. |
Hi,
I set NCCL_ALGO=Tree and run the MNist example of horovod on 2 host machines, but it raises the error that no algorithm/protocol is available. I found the problem is related to the following code
This shows that when the operation is not AllReduce and the algorithm is Tree, the bandwidth is set to 0. In my case, horovod involves a Broadcast operation before training. If NCCL_ALGO is set to Tree, since the corresponding bandwidth is 0, the evaluated transmission time is larger than the threshold (1 hour), therefore, no algorithm would be chosen.
I wonder why NCCL does not set the bandwidth for non-AllReduce operations with Tree algorithm. Thanks.
The text was updated successfully, but these errors were encountered: