Add support for distributed sampling #246

kgajdamo · 2023-08-21T11:41:04Z

This code belongs to the part of the whole distributed training for PyG.

Description

Distributed training neighbor sampling differs from the sampling currently implemented in pyg-lib. During distributed training nodes from one batch can be sampled by different machines (and therefore different samplers). The result of this is incorrect subtree/subgraph node indexing.
To achieve correct results it is necessary to sample by one hop and then synchronise outputs between machines.

Proposed algorithm:

First sample only global node ids (sampled_nodes) with duplicates in neighbor_sample.
Do not sample rows and cols but save information of how many neighbors were sampled by each node (cumm_sum_sampled_nbrs_per_node).
After each layer: synchronise and merge outputs from different machines and take new seed nodes (without duplicates) from sampled_nodes.
Sample next layer and continue 1-3 until all layers are sampled.
Perform global to local mappings using mapper and create (row, col) based on a sampled_nodes_with_duplicates and sampled_nbrs_per_node.

Step 3. was implemented in pytorch_geometric.

Added

new argument distributed to the neighbor_sample function to enable the algorithm described above.
new argument batch to the neighbor_sample function that allows to specify the initial subgraph indices for seed nodes (used with disjoint).
new return value cumm_sum_sampled_nbrs_per_node to the neighbor_sample function to return cumulative sum of the sampled neighbors per each node.
new function relabel_neighborhood that is used after sampling all layers and its purpose is to relabel global indices of the sampled nodes to the local subtree/subgraph indices (row, col).
new function hetero_relabel_neighborhood (same as relabel_neighborhood but for heterogeneous graphs). Returns (row_dict and col_dict).
unit tests

codecov · 2023-08-21T11:52:21Z

Codecov Report

Merging #246 (b45cc0e) into master (888238c) will decrease coverage by 3.22%.
The diff coverage is 28.33%.

@@            Coverage Diff             @@
##           master     #246      +/-   ##
==========================================
- Coverage   82.83%   79.61%   -3.22%     
==========================================
  Files          28       28              
  Lines         938      996      +58     
==========================================
+ Hits          777      793      +16     
- Misses        161      203      +42

Files Changed	Coverage Δ
pyg_lib/csrc/sampler/neighbor.cpp	`54.16% <5.71%> (-45.84%)`	⬇️
pyg_lib/csrc/sampler/cpu/neighbor_kernel.cpp	`80.31% <60.00%> (-2.64%)`	⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pyg_lib/csrc/sampler/cpu/dist_neighbor_kernel.cpp

pyg_lib/csrc/sampler/dist_neighbor.cpp

rusty1s

Thanks for the PR. A few minor high-level comments about the neighbor sampling part.

Overall, I think it would be easier to get this PR in if we would add merge_sampler_outputs and relabel_neighborhood separately.

pyg_lib/csrc/sampler/cpu/neighbor_kernel.cpp

rusty1s · 2023-09-04T09:00:07Z

pyg_lib/csrc/sampler/cpu/neighbor_kernel.cpp

           std::vector<int64_t>>
 sample(const at::Tensor& rowptr,
       const at::Tensor& col,
       const at::Tensor& seed,
       const std::vector<int64_t>& num_neighbors,
       const c10::optional<at::Tensor>& time,
       const c10::optional<at::Tensor>& seed_time,
+       const c10::optional<at::Tensor>& batch,


How does batch behave in case of disjoint=False? If distributed sampling requires disjoint=True anyway, I am not totally sure I understand why we need this new argument here.

Distributed sampling does not requires disjoint=true. It can work with disjoint=false as well.
batch is used only when disjoint=true, otherwise it is not relevant.
Why batch is needed:
During distributed sampling we sample by one hop in c++ and go out of the sample() function. So if we sample more than one layer, information about which subgraph a given node belonged to, will be lost. So, thanks to the batch variable, we can assign initial values.

Yes, this is clear now. But I am still not sure why we need it though. In the end, we can just do
batch[out.batch] outside of sampling to re-construct the correct batch information.

pyg_lib/csrc/sampler/cpu/neighbor_kernel.cpp

kgajdamo · 2023-09-05T09:10:30Z

Thanks for the PR. A few minor high-level comments about the neighbor sampling part.

Overall, I think it would be easier to get this PR in if we would add merge_sampler_outputs and relabel_neighborhood separately.

Thank you for the comments. As you suggested I opened a new PR for merge_sampler_outputs: #252

rusty1s

Thanks @kgajdamo. To decrease the size of PR, I removed the relabel function. Can you recheck it in in a separate PR - please be patient with me :(

In this PR, I made the following changes:

I added dist_neighbor_sample and dist_hetero_neighbor_sample. Please call these on PyG side since I don't want to make the general neighbor_sample interface to polluted. For dist_neighbor_sample we can also consider cleaning up the interface, e.g., num_neighbors should only be an int rather than a list, but it's not a must.
For now, I removed the batch argument since I still don't see why we need it. Please bring it back if you see no other way to resolve this.

rusty1s · 2023-09-05T14:37:25Z

Can we also add a test for these new functions?

kgajdamo · 2023-09-06T09:34:08Z

Thanks @kgajdamo. To decrease the size of PR, I removed the relabel function. Can you recheck it in in a separate PR - please be patient with me :(

In this PR, I made the following changes:

I added dist_neighbor_sample and dist_hetero_neighbor_sample. Please call these on PyG side since I don't want to make the general neighbor_sample interface to polluted. For dist_neighbor_sample we can also consider cleaning up the interface, e.g., num_neighbors should only be an int rather than a list, but it's not a must.

For now, I removed the batch argument since I still don't see why we need it. Please bring it back if you see no other way to resolve this.

Thanks for the updates. It is a good idea to have a dist_neighbor_sample function instead of using the neighbor_sample. Here are some additional comments from me:

due to the fact, that distributed sampling have a loop over the layers in python, in case of hetero at the moment when we call neighbor_sample we have only one edge type. So it becomes actually homo and we don't need the dist_hetero_neighbor_sample because we can use dist_neighbor_sample instead. I removed then dist_hetero_neighbor_sample function.
I also removed all not used outputs and left only: node, edge_ids, cummsum_sampled_nbrs_per_node. So it is more clear now.
and changed std::vector<int64_t> num_neighbors input list into int64_t one_hop_num as you suggested.
here is the new PR with the above mentioned changes and new unit tests (without relabel function, which I will add in the next one): #253

This code belongs to the part of the whole distributed training for PyG. This PR is complementary to the [#246](#246) and introduces some updates. What has been changed: * Removed not needed `dist_hetero_neighbor_sample` function (due to the fact, that distributed sampling have a loop over the layers in python, in case of hetero at the moment when we call `neighbor_sample` we have only one edge type. So it becomes actually homo and we don't need the `dist_hetero_neighbor_sample` and can use `dist_neighbor_sample` instead.) * Removed all not used outputs and left only the following: `node`, `edge_ids`, `cummsum_sampled_nbrs_per_node`. * Changed `std::vector<int64_t> num_neighbors` input list into `int64_t one_hop_num`. Added: * Unit tests --------- Co-authored-by: rusty1s <[email protected]>

This code belongs to the part of the whole distributed training for PyG. ## Description Distributed training requires after each layer to merge results between machines. For later algorithms, it is required that the results be sorted according to the sampling order. This PR introduces a function which purpose is to handle merge and sort operations in parallel. **Other distributed PRs:** pytorch_geometric DistLoader: [#7869](pyg-team/pytorch_geometric#7869) pytorch_geometric DistSampler: [#7974](pyg-team/pytorch_geometric#7974) pyg-lib: [#246](#246) --------- Co-authored-by: rusty1s <[email protected]>

#254) This code belongs to the part of the whole distributed training for PyG. This PR is complementary to the [#246](#246). ##Descrption Perform global to local mappings using mapper and create (row, col) based on a sampled_nodes_with_duplicates and sampled_nbrs_per_node. **Other distributed PRs:** pytorch_geometric DistLoader: [#7869](pyg-team/pytorch_geometric#7869) pytorch_geometric DistSampler: [#7974](pyg-team/pytorch_geometric#7974) pyg-lib [MERGED]: [#246](#246) pyg-lib: [#252](#252) pyg-lib: [#253](#253) --------- Co-authored-by: Matthias Fey <[email protected]>

This code belongs to the part of the whole distributed training for PyG. `DistNeighborSampler` leverages the `NeighborSampler` class from `pytorch_geometric` and the `neighbor_sample` function from `pyg-lib`. However, due to the fact that in case of distributed training it is required to synchronise the results between machines after each layer, the part of the code responsible for sampling was implemented in python. Added suport for the following sampling methods: - node, edge, negative, disjoint, temporal **TODOs:** - [x] finish hetero part - [x] subgraph sampling **This PR should be merged together with other distributed PRs:** pyg-lib: [#246](pyg-team/pyg-lib#246), [#252](pyg-team/pyg-lib#252) GraphStore\FeatureStore: #8083 DistLoaders: 1. #8079 2. #8080 3. #8085 --------- Co-authored-by: JakubPietrakIntel <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: ZhengHongming888 <[email protected]> Co-authored-by: Jakub Pietrak <[email protected]> Co-authored-by: Matthias Fey <[email protected]>

kgajdamo requested review from mananshah99, rusty1s, ZhengHongming888 and JakubPietrakIntel August 21, 2023 11:41

kgajdamo force-pushed the dist-sampler branch 2 times, most recently from 557f959 to e183f1d Compare August 21, 2023 14:28

DamianSzwichtenberg reviewed Aug 22, 2023

View reviewed changes

JakubPietrakIntel force-pushed the dist-sampler branch from 9d10051 to 000bb82 Compare September 2, 2023 09:11

kgajdamo added 3 commits September 2, 2023 09:43

add support for dist sampler

f7ef720

update CHANGELOG.md

83fb5c5

apply Damian's comments

e02312a

rusty1s assigned kgajdamo Sep 4, 2023

rusty1s added 0 - Priority P0 feature sampler labels Sep 4, 2023

kgajdamo mentioned this pull request Sep 4, 2023

Add DistNeighborSampler pyg-team/pytorch_geometric#7972

Closed

rusty1s reviewed Sep 4, 2023

View reviewed changes

kgajdamo mentioned this pull request Sep 4, 2023

Add DistNeighborSampler pyg-team/pytorch_geometric#7974

Merged

2 tasks

kgajdamo force-pushed the dist-sampler branch from 000bb82 to 90e4b8c Compare September 4, 2023 18:29

parallel hetero + minor changes

90e4b8c

kgajdamo mentioned this pull request Sep 5, 2023

Add function to merge samplers outputs #252

Merged

rusty1s added 2 commits September 5, 2023 14:11

update

99d971a

update

61b2cb3

rusty1s approved these changes Sep 5, 2023

View reviewed changes

rusty1s changed the title ~~Add support for distributed sampler~~ Add support for distributed sampling Sep 5, 2023

rusty1s added 2 commits September 5, 2023 16:20

Merge branch 'master' into dist-sampler

f832718

update

b45cc0e

rusty1s reviewed Sep 5, 2023

View reviewed changes

rusty1s enabled auto-merge (squash) September 5, 2023 14:39

rusty1s merged commit 6af62de into pyg-team:master Sep 5, 2023

kgajdamo mentioned this pull request Sep 6, 2023

Update dist_neighbor_sample #253

Merged

kgajdamo mentioned this pull request Sep 6, 2023

Add relabel_neighborhood and hetero_relabel_neighborhood functions #254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for distributed sampling #246

Add support for distributed sampling #246

kgajdamo commented Aug 21, 2023

codecov bot commented Aug 21, 2023 •

edited

Loading

rusty1s left a comment

rusty1s Sep 4, 2023

kgajdamo Sep 5, 2023

rusty1s Sep 5, 2023

kgajdamo commented Sep 5, 2023

rusty1s left a comment

rusty1s commented Sep 5, 2023 •

edited

Loading

kgajdamo commented Sep 6, 2023

Add support for distributed sampling #246

Add support for distributed sampling #246

Conversation

kgajdamo commented Aug 21, 2023

Description

Added

codecov bot commented Aug 21, 2023 • edited Loading

Codecov Report

rusty1s left a comment

Choose a reason for hiding this comment

rusty1s Sep 4, 2023

Choose a reason for hiding this comment

kgajdamo Sep 5, 2023

Choose a reason for hiding this comment

rusty1s Sep 5, 2023

Choose a reason for hiding this comment

kgajdamo commented Sep 5, 2023

rusty1s left a comment

Choose a reason for hiding this comment

rusty1s commented Sep 5, 2023 • edited Loading

kgajdamo commented Sep 6, 2023

codecov bot commented Aug 21, 2023 •

edited

Loading

rusty1s commented Sep 5, 2023 •

edited

Loading