Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for distributed sampling #246

Merged
merged 8 commits into from
Sep 5, 2023
Merged

Conversation

kgajdamo
Copy link
Contributor

This code belongs to the part of the whole distributed training for PyG.

Description

Distributed training neighbor sampling differs from the sampling currently implemented in pyg-lib. During distributed training nodes from one batch can be sampled by different machines (and therefore different samplers). The result of this is incorrect subtree/subgraph node indexing.
To achieve correct results it is necessary to sample by one hop and then synchronise outputs between machines.

Proposed algorithm:

  1. First sample only global node ids (sampled_nodes) with duplicates in neighbor_sample.
  2. Do not sample rows and cols but save information of how many neighbors were sampled by each node (cumm_sum_sampled_nbrs_per_node).
  3. After each layer: synchronise and merge outputs from different machines and take new seed nodes (without duplicates) from sampled_nodes.
  4. Sample next layer and continue 1-3 until all layers are sampled.
  5. Perform global to local mappings using mapper and create (row, col) based on a sampled_nodes_with_duplicates and sampled_nbrs_per_node.

Step 3. was implemented in pytorch_geometric.

Added

  • new argument distributed to the neighbor_sample function to enable the algorithm described above.

  • new argument batch to the neighbor_sample function that allows to specify the initial subgraph indices for seed nodes (used with disjoint).

  • new return value cumm_sum_sampled_nbrs_per_node to the neighbor_sample function to return cumulative sum of the sampled neighbors per each node.

  • new function relabel_neighborhood that is used after sampling all layers and its purpose is to relabel global indices of the sampled nodes to the local subtree/subgraph indices (row, col).

  • new function hetero_relabel_neighborhood (same as relabel_neighborhood but for heterogeneous graphs). Returns (row_dict and col_dict).

  • unit tests

@codecov
Copy link

codecov bot commented Aug 21, 2023

Codecov Report

Merging #246 (b45cc0e) into master (888238c) will decrease coverage by 3.22%.
The diff coverage is 28.33%.

@@            Coverage Diff             @@
##           master     #246      +/-   ##
==========================================
- Coverage   82.83%   79.61%   -3.22%     
==========================================
  Files          28       28              
  Lines         938      996      +58     
==========================================
+ Hits          777      793      +16     
- Misses        161      203      +42     
Files Changed Coverage Δ
pyg_lib/csrc/sampler/neighbor.cpp 54.16% <5.71%> (-45.84%) ⬇️
pyg_lib/csrc/sampler/cpu/neighbor_kernel.cpp 80.31% <60.00%> (-2.64%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@kgajdamo kgajdamo force-pushed the dist-sampler branch 2 times, most recently from 557f959 to e183f1d Compare August 21, 2023 14:28
Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. A few minor high-level comments about the neighbor sampling part.

Overall, I think it would be easier to get this PR in if we would add merge_sampler_outputs and relabel_neighborhood separately.

std::vector<int64_t>>
sample(const at::Tensor& rowptr,
const at::Tensor& col,
const at::Tensor& seed,
const std::vector<int64_t>& num_neighbors,
const c10::optional<at::Tensor>& time,
const c10::optional<at::Tensor>& seed_time,
const c10::optional<at::Tensor>& batch,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does batch behave in case of disjoint=False? If distributed sampling requires disjoint=True anyway, I am not totally sure I understand why we need this new argument here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distributed sampling does not requires disjoint=true. It can work with disjoint=false as well.
batch is used only when disjoint=true, otherwise it is not relevant.
Why batch is needed:
During distributed sampling we sample by one hop in c++ and go out of the sample() function. So if we sample more than one layer, information about which subgraph a given node belonged to, will be lost. So, thanks to the batch variable, we can assign initial values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is clear now. But I am still not sure why we need it though. In the end, we can just do
batch[out.batch] outside of sampling to re-construct the correct batch information.

@kgajdamo
Copy link
Contributor Author

kgajdamo commented Sep 5, 2023

Thanks for the PR. A few minor high-level comments about the neighbor sampling part.

Overall, I think it would be easier to get this PR in if we would add merge_sampler_outputs and relabel_neighborhood separately.

Thank you for the comments. As you suggested I opened a new PR for merge_sampler_outputs: #252

@rusty1s rusty1s changed the title Add support for distributed sampler Add support for distributed sampling Sep 5, 2023
Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kgajdamo. To decrease the size of PR, I removed the relabel function. Can you recheck it in in a separate PR - please be patient with me :(

In this PR, I made the following changes:

  • I added dist_neighbor_sample and dist_hetero_neighbor_sample. Please call these on PyG side since I don't want to make the general neighbor_sample interface to polluted. For dist_neighbor_sample we can also consider cleaning up the interface, e.g., num_neighbors should only be an int rather than a list, but it's not a must.
  • For now, I removed the batch argument since I still don't see why we need it. Please bring it back if you see no other way to resolve this.

@rusty1s
Copy link
Member

rusty1s commented Sep 5, 2023

Can we also add a test for these new functions?

@rusty1s rusty1s enabled auto-merge (squash) September 5, 2023 14:39
@rusty1s rusty1s merged commit 6af62de into pyg-team:master Sep 5, 2023
@kgajdamo
Copy link
Contributor Author

kgajdamo commented Sep 6, 2023

Thanks @kgajdamo. To decrease the size of PR, I removed the relabel function. Can you recheck it in in a separate PR - please be patient with me :(

In this PR, I made the following changes:

  • I added dist_neighbor_sample and dist_hetero_neighbor_sample. Please call these on PyG side since I don't want to make the general neighbor_sample interface to polluted. For dist_neighbor_sample we can also consider cleaning up the interface, e.g., num_neighbors should only be an int rather than a list, but it's not a must.
  • For now, I removed the batch argument since I still don't see why we need it. Please bring it back if you see no other way to resolve this.

Thanks for the updates. It is a good idea to have a dist_neighbor_sample function instead of using the neighbor_sample. Here are some additional comments from me:

  • due to the fact, that distributed sampling have a loop over the layers in python, in case of hetero at the moment when we call neighbor_sample we have only one edge type. So it becomes actually homo and we don't need the dist_hetero_neighbor_sample because we can use dist_neighbor_sample instead. I removed then dist_hetero_neighbor_sample function.
  • I also removed all not used outputs and left only: node, edge_ids, cummsum_sampled_nbrs_per_node. So it is more clear now.
  • and changed std::vector<int64_t> num_neighbors input list into int64_t one_hop_num as you suggested.
  • here is the new PR with the above mentioned changes and new unit tests (without relabel function, which I will add in the next one): #253

rusty1s added a commit that referenced this pull request Sep 6, 2023
This code belongs to the part of the whole distributed training for PyG.

This PR is complementary to the
[#246](#246) and introduces some
updates.

What has been changed:
* Removed not needed `dist_hetero_neighbor_sample` function (due to the
fact, that distributed sampling have a loop over the layers in python,
in case of hetero at the moment when we call `neighbor_sample` we have
only one edge type. So it becomes actually homo and we don't need the
`dist_hetero_neighbor_sample` and can use `dist_neighbor_sample`
instead.)
* Removed all not used outputs and left only the following: `node`,
`edge_ids`, `cummsum_sampled_nbrs_per_node`.
* Changed `std::vector<int64_t> num_neighbors` input list into `int64_t
one_hop_num`.

Added:
* Unit tests

---------

Co-authored-by: rusty1s <[email protected]>
rusty1s added a commit that referenced this pull request Sep 15, 2023
This code belongs to the part of the whole distributed training for PyG.

## Description
Distributed training requires after each layer to merge results between
machines. For later algorithms, it is required that the results be
sorted according to the sampling order. This PR introduces a function
which purpose is to handle merge and sort operations in parallel.

**Other distributed PRs:**
pytorch_geometric DistLoader:
[#7869](pyg-team/pytorch_geometric#7869)
pytorch_geometric DistSampler:
[#7974](pyg-team/pytorch_geometric#7974)
pyg-lib: [#246](#246)

---------

Co-authored-by: rusty1s <[email protected]>
rusty1s added a commit that referenced this pull request Sep 15, 2023
#254)

This code belongs to the part of the whole distributed training for PyG.

This PR is complementary to the
[#246](#246).

##Descrption
Perform global to local mappings using mapper and create (row, col)
based on a sampled_nodes_with_duplicates and sampled_nbrs_per_node.

**Other distributed PRs:**
pytorch_geometric DistLoader:
[#7869](pyg-team/pytorch_geometric#7869)
pytorch_geometric DistSampler:
[#7974](pyg-team/pytorch_geometric#7974)
pyg-lib [MERGED]: [#246](#246)
pyg-lib: [#252](#252)
pyg-lib: [#253](#253)

---------

Co-authored-by: Matthias Fey <[email protected]>
rusty1s added a commit to pyg-team/pytorch_geometric that referenced this pull request Oct 9, 2023
This code belongs to the part of the whole distributed training for PyG.

`DistNeighborSampler` leverages the `NeighborSampler` class from
`pytorch_geometric` and the `neighbor_sample` function from `pyg-lib`.
However, due to the fact that in case of distributed training it is
required to synchronise the results between machines after each layer,
the part of the code responsible for sampling was implemented in python.

Added suport for the following sampling methods:
- node, edge, negative, disjoint, temporal

**TODOs:**

- [x] finish hetero part
- [x] subgraph sampling

**This PR should be merged together with other distributed PRs:**
pyg-lib: [#246](pyg-team/pyg-lib#246),
[#252](pyg-team/pyg-lib#252)
GraphStore\FeatureStore:
#8083
DistLoaders:
1.  #8079
2.  #8080
3.  #8085

---------

Co-authored-by: JakubPietrakIntel <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: ZhengHongming888 <[email protected]>
Co-authored-by: Jakub Pietrak <[email protected]>
Co-authored-by: Matthias Fey <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants