Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Set index to _EDGE_ID_ and _VERTEX_ for _vertex_prop_dataframe and _edge_prop_dataframe to make sampling faster #2401

Closed
VibhuJawa opened this issue Jul 11, 2022 · 3 comments · Fixed by #2523
Assignees
Milestone

Comments

@VibhuJawa
Copy link
Member

Describe the solution you'd like and any additional context

We should set index to _EDGE_ID_ and _VERTEX_ for _vertex_prop_dataframe and _edge_prop_dataframe so that when we are fetching for sampling by ids we are fast.

Motivating Example where we see a 3x speed up for fetching a batch of 50k.

from cugraph.experimental import PropertyGraph
import numpy as np
import cudf

pg = PropertyGraph()
n_features = 100
n_rows = 10_000_000

df = cudf.DataFrame({'node_id':np.arange(n_rows)})
for feat_id in range(n_features):
    df[f'feat_{feat_id}'] = np.ones(n_rows)
pg.add_vertex_data(df,vertex_col_name='node_id')


node_ids_to_fetch = np.random.randint(100_000_000, size=50_000)

Without Index:

%%timeit
node_ids_df = cudf.DataFrame({'_VERTEX_':node_ids_to_fetch, 'input_order':np.arange(0,len(node_ids_to_fetch))})
fetched_df = node_ids_df.merge(pg._vertex_prop_dataframe, how='left')
fetched_df = fetched_df.sort_values(by='input_order')
len(fetched_df)
57.9 ms ± 8.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

With Index (3x faster) :

df_with_index =  pg._vertex_prop_dataframe.set_index('_VERTEX_')
%%timeit
fetched_df = df_with_index.loc[node_ids_to_fetch]
18.5 ms
@VibhuJawa VibhuJawa added improvement Improvement / enhancement to an existing function python labels Jul 11, 2022
@rlratzel rlratzel self-assigned this Jul 20, 2022
@rlratzel rlratzel added this to the 22.08 milestone Jul 20, 2022
@rlratzel rlratzel assigned eriknw and unassigned rlratzel Jul 28, 2022
@alexbarghi-nv
Copy link
Member

I'm seeing 10x speedup in my tests by setting the index and using .loc as shown here. Could we increase the priority of this?

@eriknw
Copy link
Contributor

eriknw commented Aug 5, 2022

Wow, nice! Yup, I expect to start work on this today or Monday.

@jarmak-nv jarmak-nv modified the milestones: 22.08, 22.10 Aug 8, 2022
eriknw added a commit to eriknw/cugraph that referenced this issue Aug 9, 2022
Currently, this only does SG version for rapidsai#2401.  MG is still TODO.

This also doesn't change anything user-facing (yet).
@VibhuJawa VibhuJawa added the PG label Aug 17, 2022
@BradReesWork BradReesWork removed improvement Improvement / enhancement to an existing function python labels Aug 18, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Sep 30, 2022
~Currently, this only does SG version for #2401.  MG is still TODO.~

Closes #2401

This also doesn't change anything user-facing (yet).

Authors:
  - Erik Welch (https://github.com/eriknw)
  - Alex Barghi (https://github.com/alexbarghi-nv)

Approvers:
  - Rick Ratzel (https://github.com/rlratzel)

URL: #2523
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants