-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DistributedNeighborLoader
[3/6]
#8080
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
**[1/3] Distributed Loaders PRs** This PR includes base class of `DistributedLoader` that handles RPC connection and handling requests from `DistributedNeighborSampler` processes. It includes basic `DistNeighborSampler` functions used by the loader. 1. #8079 2. #8080 3. #8085 Other PRs related to this module: DistSampler: #7974 GraphStore\FeatureStore: #8083 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: rusty1s <[email protected]>
for more information, see https://pre-commit.ci
This code belongs to the part of the whole distributed training for PyG. `DistNeighborSampler` leverages the `NeighborSampler` class from `pytorch_geometric` and the `neighbor_sample` function from `pyg-lib`. However, due to the fact that in case of distributed training it is required to synchronise the results between machines after each layer, the part of the code responsible for sampling was implemented in python. Added suport for the following sampling methods: - node, edge, negative, disjoint, temporal **TODOs:** - [x] finish hetero part - [x] subgraph sampling **This PR should be merged together with other distributed PRs:** pyg-lib: [#246](pyg-team/pyg-lib#246), [#252](pyg-team/pyg-lib#252) GraphStore\FeatureStore: #8083 DistLoaders: 1. #8079 2. #8080 3. #8085 --------- Co-authored-by: JakubPietrakIntel <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: ZhengHongming888 <[email protected]> Co-authored-by: Jakub Pietrak <[email protected]> Co-authored-by: Matthias Fey <[email protected]>
for more information, see https://pre-commit.ci
DistributedNeighborLoader
DistributedNeighborLoader
[3/6]
|
||
edge_index = part_data[1]._edge_index[(None, "coo")] | ||
|
||
assert "DistNeighborLoader()" in str(loader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert "DistNeighborLoader" in str(loader)
please add DistNeighborLoader to the distributed/init.py file |
This code belongs to the part of the whole distributed training for PyG. Please be aware that this PR should be merged before Loaders package! - @JakubPietrakIntel Loaders: 1. #8079 2. #8080 3. #8085 Other PRs related to this module: DistSampler: #7974 --------- Co-authored-by: JakubPietrakIntel <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Matthias Fey <[email protected]>
**Changes made:** - added support for temporal sampling - use torch.Tensors instead of numpy arrays - move _sample_one_hop() from NeighborSampler to DistNeighborSampler - do not go with disjoint flow in _sample() function - this is not needed because batch is calculated after - added tests for node sampling and disjoint (works without DistNeighborLoader) - added tests for node temporal sampling (works without DistNeighborLoader) - some minor changes like changing variables names etc This PR is based on the #8083, so both must be combined to pass the tests. Other distributed PRs: #8083 #8080 #8085 --------- Co-authored-by: Matthias Fey <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@pytest.mark.parametrize('num_parts', [2]) | ||
@pytest.mark.parametrize('num_workers', [0]) | ||
@pytest.mark.parametrize('async_sampling', [True]) | ||
@pytest.mark.skip(reason="Breaks with no attribute 'num_hops'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JakubPietrakIntel This test currently breaks for me. I am disabling it for now.
Codecov Report
@@ Coverage Diff @@
## master #8080 +/- ##
==========================================
+ Coverage 87.17% 88.42% +1.24%
==========================================
Files 473 474 +1
Lines 28757 28804 +47
==========================================
+ Hits 25068 25469 +401
+ Misses 3689 3335 -354
... and 8 files with indirect coverage changes 📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today! |
**[3/3] Distributed Loaders PRs** This PR includes `DistributedLinkNeighborLoader` used for processing edge sampler output in distributed training setup. 1. #8079 2. #8080 3. #8085 Other PRs related to this module: DistSampler: #7974 GraphStore\FeatureStore: #8083 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Matthias Fey <[email protected]>
else: # Tuple[FeatureStore, GraphStore] | ||
data = Data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rusty1s I think this code was to add the features to the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are metadata[-3]
, metadata[-2]
and metadata[-1]
populated? The sampler does not have access to node features, so I got confused by this part and dropped it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This info is added to the SamplerOutput metadata after sampling and collecting features from all machines in the:
DistSampler _collate_fn()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output of _collate_fn()
includes nfeats, nlabels, efeats
which are our metadata[-3]
, metadata[-2]
and metadata[-1]
. Here: L717C1-L718C22
I will add a small PR to revert these lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rusty1s It is different from single node case. for distributed after node sampling there will need one step to put labels, nfeats, efeats from different nodes into pyg data format for loader.
[2/3] Distributed Loaders PRs
This PR includes
DistributedNeighborLoader
used for processing node sampler output in distributed training setup.DistLoader
#8079DistributedNeighborLoader
[3/6] #8080DistributedLinkNeighborLoader
[4/6] #8085Other PRs related to this module:
DistSampler: #7974
GraphStore\FeatureStore: #8083