-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Use the Correct WG Communicator #4548
[BUG] Use the Correct WG Communicator #4548
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (disclaimer: I'm not familiar with the behavior, side-effects, and intent, so feel free to also pull in others if necessary)
The tl;dr is that the local communicator is used for intra-node communication and the global communicator is used for inter-node communication. If there's only one node, they are the same communicator, basically. But if there are multiple nodes, then the local communicator won't include all workers. So we could potentially hang if we use the local communicator here. There usually isn't any reason to use the local communicator in this context unless we're doing something that we know only involves the workers on the current node. |
…ghi-nv/cugraph into use-correct-communicator
/merge |
ac35be3
into
rapidsai:branch-24.08
[BUG] Use the Correct WG Communicator (rapidsai/cugraph#4548)
cuGraph-PyG's WholeFeatureStore currently uses the local communicator, when it should be using the global communicator, as was originally intended. This PR modifies the feature store so it correctly calls
get_global_node_communicator()
.This also fixes another bug where torch.int32 was used to store the number of edges in the graph, which resulted in an overflow error when the number of edges exceeded that datatype's maximum value. The datatype is now correctly set to int64.