Add `DistributedNeighborLoader` [3/6] #8080

JakubPietrakIntel · 2023-09-27T14:14:03Z

[2/3] Distributed Loaders PRs
This PR includesDistributedNeighborLoader used for processing node sampler output in distributed training setup.

Other PRs related to this module:
DistSampler: #7974
GraphStore\FeatureStore: #8083

for more information, see https://pre-commit.ci

**[1/3] Distributed Loaders PRs** This PR includes base class of `DistributedLoader` that handles RPC connection and handling requests from `DistributedNeighborSampler` processes. It includes basic `DistNeighborSampler` functions used by the loader. 1. #8079 2. #8080 3. #8085 Other PRs related to this module: DistSampler: #7974 GraphStore\FeatureStore: #8083 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: rusty1s <[email protected]>

for more information, see https://pre-commit.ci

This code belongs to the part of the whole distributed training for PyG. `DistNeighborSampler` leverages the `NeighborSampler` class from `pytorch_geometric` and the `neighbor_sample` function from `pyg-lib`. However, due to the fact that in case of distributed training it is required to synchronise the results between machines after each layer, the part of the code responsible for sampling was implemented in python. Added suport for the following sampling methods: - node, edge, negative, disjoint, temporal **TODOs:** - [x] finish hetero part - [x] subgraph sampling **This PR should be merged together with other distributed PRs:** pyg-lib: [#246](pyg-team/pyg-lib#246), [#252](pyg-team/pyg-lib#252) GraphStore\FeatureStore: #8083 DistLoaders: 1. #8079 2. #8080 3. #8085 --------- Co-authored-by: JakubPietrakIntel <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: ZhengHongming888 <[email protected]> Co-authored-by: Jakub Pietrak <[email protected]> Co-authored-by: Matthias Fey <[email protected]>

for more information, see https://pre-commit.ci

kgajdamo · 2023-10-31T09:46:54Z

test/distributed/test_dist_neighbor_loader.py

+
+    edge_index = part_data[1]._edge_index[(None, "coo")]
+
+    assert "DistNeighborLoader()" in str(loader)


assert "DistNeighborLoader" in str(loader)

kgajdamo · 2023-10-31T09:52:47Z

please add DistNeighborLoader to the distributed/init.py file

@JakubPietrakIntel

This code belongs to the part of the whole distributed training for PyG. Please be aware that this PR should be merged before Loaders package! - @JakubPietrakIntel Loaders: 1. #8079 2. #8080 3. #8085 Other PRs related to this module: DistSampler: #7974 --------- Co-authored-by: JakubPietrakIntel <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Matthias Fey <[email protected]>

**Changes made:** - added support for temporal sampling - use torch.Tensors instead of numpy arrays - move _sample_one_hop() from NeighborSampler to DistNeighborSampler - do not go with disjoint flow in _sample() function - this is not needed because batch is calculated after - added tests for node sampling and disjoint (works without DistNeighborLoader) - added tests for node temporal sampling (works without DistNeighborLoader) - some minor changes like changing variables names etc This PR is based on the #8083, so both must be combined to pass the tests. Other distributed PRs: #8083 #8080 #8085 --------- Co-authored-by: Matthias Fey <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

for more information, see https://pre-commit.ci

rusty1s · 2023-11-10T12:11:56Z

test/distributed/test_dist_neighbor_loader.py

+@pytest.mark.parametrize('num_parts', [2])
+@pytest.mark.parametrize('num_workers', [0])
+@pytest.mark.parametrize('async_sampling', [True])
+@pytest.mark.skip(reason="Breaks with no attribute 'num_hops'")


@JakubPietrakIntel This test currently breaks for me. I am disabling it for now.

codecov · 2023-11-10T12:18:28Z

Codecov Report

Merging #8080 (ed73dcf) into master (9ea2233) will increase coverage by 1.24%.
Report is 1 commits behind head on master.
The diff coverage is 78.76%.

@@            Coverage Diff             @@
##           master    #8080      +/-   ##
==========================================
+ Coverage   87.17%   88.42%   +1.24%     
==========================================
  Files         473      474       +1     
  Lines       28757    28804      +47     
==========================================
+ Hits        25068    25469     +401     
+ Misses       3689     3335     -354

Files	Coverage Δ
torch_geometric/distributed/__init__.py	`100.00% <100.00%> (ø)`
torch_geometric/distributed/local_graph_store.py	`96.93% <100.00%> (+1.06%)`	⬆️
torch_geometric/sampler/neighbor_sampler.py	`92.02% <100.00%> (+5.55%)`	⬆️
...orch_geometric/distributed/dist_neighbor_loader.py	`95.45% <95.45%> (ø)`
torch_geometric/loader/node_loader.py	`95.45% <77.77%> (-0.89%)`	⬇️
...rch_geometric/distributed/dist_neighbor_sampler.py	`64.09% <70.83%> (+64.09%)`	⬆️

... and 8 files with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

**[3/3] Distributed Loaders PRs** This PR includes `DistributedLinkNeighborLoader` used for processing edge sampler output in distributed training setup. 1. #8079 2. #8080 3. #8085 Other PRs related to this module: DistSampler: #7974 GraphStore\FeatureStore: #8083 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Matthias Fey <[email protected]>

kgajdamo · 2023-11-13T15:45:01Z

torch_geometric/loader/node_loader.py

            else:  # Tuple[FeatureStore, GraphStore]
-                data = Data(


@rusty1s I think this code was to add the features to the data.

Where are metadata[-3], metadata[-2] and metadata[-1] populated? The sampler does not have access to node features, so I got confused by this part and dropped it.

This info is added to the SamplerOutput metadata after sampling and collecting features from all machines in the:
DistSampler _collate_fn()

The output of _collate_fn() includes nfeats, nlabels, efeats which are our metadata[-3], metadata[-2] and metadata[-1]. Here: L717C1-L718C22
I will add a small PR to revert these lines.

@rusty1s
PR with the fix: #8377

@rusty1s It is different from single node case. for distributed after node sampling there will need one step to put labels, nfeats, efeats from different nodes into pyg data format for loader.

JakubPietrakIntel requested review from wsad1, mananshah99, a team and rusty1s as code owners September 27, 2023 14:14

github-actions bot added the loader label Sep 27, 2023

JakubPietrakIntel and others added 5 commits September 27, 2023 15:51

add distributed neighbor loader

e5d1cfa

add test

c6a06d6

[pre-commit.ci] auto fixes from pre-commit.com hooks

fd8fc90

for more information, see https://pre-commit.ci

add load_partition_info used in tests

7605c49

[pre-commit.ci] auto fixes from pre-commit.com hooks

2e99343

for more information, see https://pre-commit.ci

This was referenced Sep 27, 2023

Add DistributedLinkNeighborLoader [4/6] #8085

Merged

Add base class DistLoader #8079

Merged

Update GraphStore and FeatureStore [1/6] #8083

Merged

Add DistNeighborSampler #7974

Merged

JakubPietrakIntel self-assigned this Sep 27, 2023

JakubPietrakIntel and others added 7 commits September 27, 2023 17:34

update test

eddb8fb

[pre-commit.ci] auto fixes from pre-commit.com hooks

e183ea5

for more information, see https://pre-commit.ci

formatting

08415ec

update labels in neighbor loader test

e4b23c9

[pre-commit.ci] auto fixes from pre-commit.com hooks

1051147

for more information, see https://pre-commit.ci

typo

eacfa23

revert perm changes

cf94a70

rusty1s and others added 7 commits October 2, 2023 15:50

Merge branch 'master' into intel/dist-neighbor-loader

b25f688

filter_dist_store with input_type and import utils

6ce245f

[pre-commit.ci] auto fixes from pre-commit.com hooks

739fbfa

for more information, see https://pre-commit.ci

fix import clusterdata

514d6ed

revert changes in parition.py moved to #8083

f1d8496

Merge branch 'master' into intel/dist-neighbor-loader

d2dc813

Merge branch 'master' into intel/dist-neighbor-loader

5071934

JakubPietrakIntel added 2 commits October 11, 2023 14:41

add e_id without perm for dist

a25f7dc

fix hetero

325478e

kgajdamo mentioned this pull request Oct 17, 2023

Update DistNeighborSampler for homo graphs [2/6] #8209

Merged

JakubPietrakIntel and others added 2 commits October 17, 2023 16:48

add tests for correct edge return

4788d82

[pre-commit.ci] auto fixes from pre-commit.com hooks

968e854

for more information, see https://pre-commit.ci

JakubPietrakIntel added the distributed label Oct 19, 2023

Merge branch 'master' into intel/dist-neighbor-loader

35423fe

rusty1s changed the title ~~Add DistributedNeighborLoader~~ Add DistributedNeighborLoader [3/6] Oct 30, 2023

kgajdamo reviewed Oct 31, 2023

View reviewed changes

rusty1s and others added 4 commits November 10, 2023 11:51

Merge branch 'master' into intel/dist-neighbor-loader

fe7eb35

[pre-commit.ci] auto fixes from pre-commit.com hooks

8a433ce

for more information, see https://pre-commit.ci

update

ae60064

update

ed73dcf

rusty1s reviewed Nov 10, 2023

View reviewed changes

rusty1s approved these changes Nov 10, 2023

View reviewed changes

rusty1s added skip-changelog 0 - Priority P0 feature labels Nov 10, 2023

rusty1s merged commit 40cc3b1 into master Nov 10, 2023

rusty1s deleted the intel/dist-neighbor-loader branch November 10, 2023 12:19

kgajdamo reviewed Nov 13, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `DistributedNeighborLoader` [3/6] #8080

Add `DistributedNeighborLoader` [3/6] #8080

JakubPietrakIntel commented Sep 27, 2023 •

edited

Loading

kgajdamo Oct 31, 2023

kgajdamo commented Oct 31, 2023

rusty1s Nov 10, 2023

codecov bot commented Nov 10, 2023

kgajdamo Nov 13, 2023

rusty1s Nov 13, 2023 •

edited

Loading

kgajdamo Nov 13, 2023 •

edited

Loading

JakubPietrakIntel Nov 14, 2023 •

edited

Loading

JakubPietrakIntel Nov 14, 2023

ZhengHongming888 Nov 14, 2023


		edge_index = part_data[1]._edge_index[(None, "coo")]

		assert "DistNeighborLoader()" in str(loader)

Add DistributedNeighborLoader [3/6] #8080

Add DistributedNeighborLoader [3/6] #8080

Conversation

JakubPietrakIntel commented Sep 27, 2023 • edited Loading

kgajdamo Oct 31, 2023

Choose a reason for hiding this comment

kgajdamo commented Oct 31, 2023

rusty1s Nov 10, 2023

Choose a reason for hiding this comment

codecov bot commented Nov 10, 2023

Codecov Report

kgajdamo Nov 13, 2023

Choose a reason for hiding this comment

rusty1s Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

kgajdamo Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

JakubPietrakIntel Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

JakubPietrakIntel Nov 14, 2023

Choose a reason for hiding this comment

ZhengHongming888 Nov 14, 2023

Choose a reason for hiding this comment

Add `DistributedNeighborLoader` [3/6] #8080

Add `DistributedNeighborLoader` [3/6] #8080

JakubPietrakIntel commented Sep 27, 2023 •

edited

Loading

rusty1s Nov 13, 2023 •

edited

Loading

kgajdamo Nov 13, 2023 •

edited

Loading

JakubPietrakIntel Nov 14, 2023 •

edited

Loading