Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DistributedNeighborLoader [3/6] #8080

Merged
merged 28 commits into from
Nov 10, 2023
Merged

Conversation

JakubPietrakIntel
Copy link
Contributor

@JakubPietrakIntel JakubPietrakIntel commented Sep 27, 2023

[2/3] Distributed Loaders PRs
This PR includesDistributedNeighborLoader used for processing node sampler output in distributed training setup.

  1. Add base class DistLoader #8079
  2. Add DistributedNeighborLoader [3/6] #8080
  3. Add DistributedLinkNeighborLoader [4/6] #8085

Other PRs related to this module:
DistSampler: #7974
GraphStore\FeatureStore: #8083

rusty1s added a commit that referenced this pull request Oct 2, 2023
**[1/3] Distributed Loaders PRs** 
This PR includes base class of `DistributedLoader` that handles RPC
connection and handling requests from `DistributedNeighborSampler`
processes.

It includes basic `DistNeighborSampler` functions used by the loader.

1.  #8079
2.  #8080
3.  #8085

Other PRs related to this module:
DistSampler: #7974
GraphStore\FeatureStore:
#8083

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: rusty1s <[email protected]>
rusty1s added a commit that referenced this pull request Oct 9, 2023
This code belongs to the part of the whole distributed training for PyG.

`DistNeighborSampler` leverages the `NeighborSampler` class from
`pytorch_geometric` and the `neighbor_sample` function from `pyg-lib`.
However, due to the fact that in case of distributed training it is
required to synchronise the results between machines after each layer,
the part of the code responsible for sampling was implemented in python.

Added suport for the following sampling methods:
- node, edge, negative, disjoint, temporal

**TODOs:**

- [x] finish hetero part
- [x] subgraph sampling

**This PR should be merged together with other distributed PRs:**
pyg-lib: [#246](pyg-team/pyg-lib#246),
[#252](pyg-team/pyg-lib#252)
GraphStore\FeatureStore:
#8083
DistLoaders:
1.  #8079
2.  #8080
3.  #8085

---------

Co-authored-by: JakubPietrakIntel <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: ZhengHongming888 <[email protected]>
Co-authored-by: Jakub Pietrak <[email protected]>
Co-authored-by: Matthias Fey <[email protected]>
@rusty1s rusty1s changed the title Add DistributedNeighborLoader Add DistributedNeighborLoader [3/6] Oct 30, 2023

edge_index = part_data[1]._edge_index[(None, "coo")]

assert "DistNeighborLoader()" in str(loader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert "DistNeighborLoader" in str(loader)

@kgajdamo
Copy link
Contributor

please add DistNeighborLoader to the distributed/init.py file

rusty1s added a commit that referenced this pull request Nov 6, 2023
This code belongs to the part of the whole distributed training for PyG.

Please be aware that this PR should be merged before Loaders package! -
@JakubPietrakIntel
Loaders:
1.  #8079
2.  #8080
3.  #8085

Other PRs related to this module:
DistSampler: #7974

---------

Co-authored-by: JakubPietrakIntel <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Matthias Fey <[email protected]>
rusty1s added a commit that referenced this pull request Nov 10, 2023
**Changes made:**
- added support for temporal sampling
- use torch.Tensors instead of numpy arrays
- move _sample_one_hop() from NeighborSampler to DistNeighborSampler
- do not go with disjoint flow in _sample() function - this is not
needed because batch is calculated after
- added tests for node sampling and disjoint (works without
DistNeighborLoader)
- added tests for node temporal sampling (works without
DistNeighborLoader)
- some minor changes like changing variables names etc

This PR is based on the #8083, so both must be combined to pass the
tests.

Other distributed PRs:
#8083 
#8080 
#8085

---------

Co-authored-by: Matthias Fey <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@pytest.mark.parametrize('num_parts', [2])
@pytest.mark.parametrize('num_workers', [0])
@pytest.mark.parametrize('async_sampling', [True])
@pytest.mark.skip(reason="Breaks with no attribute 'num_hops'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JakubPietrakIntel This test currently breaks for me. I am disabling it for now.

Copy link

codecov bot commented Nov 10, 2023

Codecov Report

Merging #8080 (ed73dcf) into master (9ea2233) will increase coverage by 1.24%.
Report is 1 commits behind head on master.
The diff coverage is 78.76%.

@@            Coverage Diff             @@
##           master    #8080      +/-   ##
==========================================
+ Coverage   87.17%   88.42%   +1.24%     
==========================================
  Files         473      474       +1     
  Lines       28757    28804      +47     
==========================================
+ Hits        25068    25469     +401     
+ Misses       3689     3335     -354     
Files Coverage Δ
torch_geometric/distributed/__init__.py 100.00% <100.00%> (ø)
torch_geometric/distributed/local_graph_store.py 96.93% <100.00%> (+1.06%) ⬆️
torch_geometric/sampler/neighbor_sampler.py 92.02% <100.00%> (+5.55%) ⬆️
...orch_geometric/distributed/dist_neighbor_loader.py 95.45% <95.45%> (ø)
torch_geometric/loader/node_loader.py 95.45% <77.77%> (-0.89%) ⬇️
...rch_geometric/distributed/dist_neighbor_sampler.py 64.09% <70.83%> (+64.09%) ⬆️

... and 8 files with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

@rusty1s rusty1s merged commit 40cc3b1 into master Nov 10, 2023
@rusty1s rusty1s deleted the intel/dist-neighbor-loader branch November 10, 2023 12:19
rusty1s added a commit that referenced this pull request Nov 10, 2023
**[3/3] Distributed Loaders PRs**
This PR includes `DistributedLinkNeighborLoader` used for processing
edge sampler output in distributed training setup.


1.  #8079
2.  #8080
3.  #8085

Other PRs related to this module:
DistSampler: #7974
GraphStore\FeatureStore:
#8083

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Matthias Fey <[email protected]>
else: # Tuple[FeatureStore, GraphStore]
data = Data(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rusty1s I think this code was to add the features to the data.

Copy link
Member

@rusty1s rusty1s Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are metadata[-3], metadata[-2] and metadata[-1] populated? The sampler does not have access to node features, so I got confused by this part and dropped it.

Copy link
Contributor

@kgajdamo kgajdamo Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This info is added to the SamplerOutput metadata after sampling and collecting features from all machines in the:
DistSampler _collate_fn()

Copy link
Contributor Author

@JakubPietrakIntel JakubPietrakIntel Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of _collate_fn() includes nfeats, nlabels, efeats which are our metadata[-3], metadata[-2] and metadata[-1]. Here: L717C1-L718C22
I will add a small PR to revert these lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rusty1s
PR with the fix: #8377

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rusty1s It is different from single node case. for distributed after node sampling there will need one step to put labels, nfeats, efeats from different nodes into pyg data format for loader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants