Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the drop-duplicate functionality #4095

Merged

Conversation

jnke2016
Copy link
Contributor

@jnke2016 jnke2016 commented Jan 16, 2024

Our current python API leverages dask to implement the drop-duplicate functionality but it carries a lot of overhead as it draws a significant amount of host memory and results into a crash when processing large graphs (4+ billion edges).

This PR

  1. Leverages the CAPI to internally drop multi edges when creating the PLC graph.
  2. Deprecates the parameter multi which, when set to False, triggers the dask based drop-duplicate functionality
  3. Add flag do_expensive_check to check for NULL values in the edgelist

@jnke2016 jnke2016 requested a review from a team as a code owner January 16, 2024 22:02
@BradReesWork BradReesWork added this to the 24.02 milestone Jan 23, 2024
Comment on lines 89 to 93
warnings.warn(
"'multi' is deprecated as the removal of multi edges is "
"only supported when creating a multi-graph",
FutureWarning,
)
Copy link
Contributor

@rlratzel rlratzel Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing something: does this mean a user who wants a MultiGraph could also want to remove the multiple edges? Why would they want a MultiGraph without multiple edges? It seems like this option would be more useful for a Graph, but it's being deprecated here.

Copy link
Contributor Author

@jnke2016 jnke2016 Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing something: does this mean a user who wants a MultiGraph could also want to remove the multiple edges

No, It means that the only way to drop multi edges is by creating a multi-graph. In fact prior to that, we were using the flag multi in the symmetrize function to drop multi-edges (if creating a directed or undirected Graph) through the _memory_efficient_drop_duplicates function which is dask based. But we can't just get rid of that flag because some users might still have it in their code base if they were directly calling the symmetrize function directly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks. I think my concerns will go away when Vibhu's feedback is addressed.

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good mostly apart from some changes around docstrings.

Can we also quickly also run the graph creation benchmarks here too see how perf may change.

https://github.com/rapidsai/cugraph/blob/branch-24.02/benchmarks/cugraph/standalone/cugraph_graph_creation.py

Comment on lines 57 to 60
multi : bool, optional (default=False)
Deprecated.
Set to True if graph is a Multi(Di)Graph. This allows multiple
edges instead of dropping them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also mention the warning statement along with Deprecated, otherwise it is confusing for the user.

Suggested change
multi : bool, optional (default=False)
Deprecated.
Set to True if graph is a Multi(Di)Graph. This allows multiple
edges instead of dropping them.
multi : bool, optional (default=False)
[Deprecated, Multi will be removed in future version, and the removal of multi
edges will no longer be supported from 'symmetrize'. Multi edges will be removed
upon creation of graph instance directly based on if the graph is
`curgaph.MultiGraph` or `cugraph.Graph`. ]
Set to True if graph is a Multi(Di)Graph. This allows multiple
edges instead of dropping them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also will suggest filing an issue for tracking this deprecation.

@@ -208,6 +224,7 @@ def symmetrize(
weights column name.

multi : bool, optional (default=False)
Deprecated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, just some clarification around if we are dropping duplicates for single GPU's only.

Comment on lines 523 to 525
edgelist_df = edgelist_df.groupby(
by=[*vertex_col_name], as_index=False
).min()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming, edgelist_df is a cuDF data frame right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is a cuDF dataframe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

if not self.properties.multi_edge:
# Drop parallel edges for non MultiGraph
# FIXME: Drop multi edges with the CAPI instead.
df = df.groupby(by=[*vertex_col_name], as_index=False).min()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here , assuming this a cuDF dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is a cuDF dataframe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@@ -1176,10 +1176,12 @@ def test_extract_subgraph_vertex_prop_condition_only(
)
# Should result in two edges, one a "relationship", the other a "referral"
expected_edgelist = cudf.DataFrame(
{"src": [89216, 78634], "dst": [78634, 89216], "weights": [99, 8]}
{"src": [78634, 89216], "dst": [89216, 78634], "weights": [8, 99]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder why we flipped it around ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially removed the drop_duplicate call which was swapping the order of some columns. This expected result was set to match the actual_result. But this is no longer relevant since I am no longer updating this test file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this.

Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving assuming all of @VibhuJawa 's feedback gets addressed.

@jnke2016
Copy link
Contributor Author

jnke2016 commented Feb 1, 2024

just some clarification around if we are dropping duplicates for single GPU's only.

@VibhuJawa that's correct

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BradReesWork BradReesWork added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 1, 2024
@rlratzel
Copy link
Contributor

rlratzel commented Feb 2, 2024

/merge

@rapids-bot rapids-bot bot merged commit 3d52f17 into rapidsai:branch-24.02 Feb 2, 2024
109 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants