Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] MNMG WCC fails on 8 GPUs with Karate data set #1796

Closed
ChuckHastings opened this issue Aug 30, 2021 · 5 comments
Closed

[ENH] MNMG WCC fails on 8 GPUs with Karate data set #1796

ChuckHastings opened this issue Aug 30, 2021 · 5 comments
Assignees
Labels
? - Needs Triage Need team to review and classify improvement Improvement / enhancement to an existing function
Milestone

Comments

@ChuckHastings
Copy link
Collaborator

At first blush this appears to be due to having some partitions with no vertices and edges assigned to them.

This can be reproduced by running the C++ unit test for Karate on an 8 GPU system. I see the data partitioned on 8 GPUs as follows:

back from construct graph, comm_rank = 0, num local edges = 0, num local vertices = 0
back from construct graph, comm_rank = 1, num local edges = 4, num local vertices = 7
back from construct graph, comm_rank = 2, num local edges = 0, num local vertices = 6
back from construct graph, comm_rank = 3, num local edges = 11, num local vertices = 5
back from construct graph, comm_rank = 4, num local edges = 0, num local vertices = 4
back from construct graph, comm_rank = 5, num local edges = 3, num local vertices = 2
back from construct graph, comm_rank = 6, num local edges = 0, num local vertices = 6
back from construct graph, comm_rank = 7, num local edges = 9, num local vertices = 4

On a 4 GPU run I see the following partitioning:

back from construct graph, comm_rank = 0, num edges = 7, num vertices = 4
back from construct graph, comm_rank = 1, num edges = 7, num vertices = 9
back from construct graph, comm_rank = 2, num edges = 13, num vertices = 12
back from construct graph, comm_rank = 3, num edges = 27, num vertices = 9

I suspect this might be true for other algorithms as well.

@ChuckHastings ChuckHastings added the ? - Needs Triage Need team to review and classify label Aug 30, 2021
@ChuckHastings ChuckHastings added the improvement Improvement / enhancement to an existing function label Aug 30, 2021
@ChuckHastings ChuckHastings added this to the 21.10 milestone Aug 30, 2021
@seunghwak
Copy link
Contributor

This is strange as karate has been in the MG C++ tests for quite a while (https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/tests/components/mg_weakly_connected_components_test.cpp#L231) and I ran these tests many times on DGX.

It seems like something has changed, and I will take a look.

@seunghwak
Copy link
Contributor

Was this the only test you see failing? When I tried on DGX, it seems like almost all C++ tests are failing (at least the ones I have tried). Let me rebuild the conda environment to see whether this is due to my environments.

@ChuckHastings
Copy link
Collaborator Author

I only ran this test as it was failing in the python testing. Entirely possible that all of the tests are failing.

It may be that some recent changes have exposed this (or introduced this... CI doesn't test MNMG)

@seunghwak
Copy link
Contributor

It seems like something is broken in our dependencies. All the C++ tests I have tried failed even after I updated the conda environment. And the CI for the PR with few Louvain test compile error fixes (#1797) failed with seemingly unrelated test errors (https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cugraph/job/prb/job/cugraph-gpu-test/CUDA=11.0,GPU_LABEL=gpu-a100,LINUX_VER=centos7,PYTHON=3.7/683/testReport/)

@ChuckHastings
Copy link
Collaborator Author

Closed by #1802

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify improvement Improvement / enhancement to an existing function
Projects
None yet
Development

No branches or pull requests

2 participants