[FEA] Set index to `_EDGE_ID_` and `_VERTEX_` for `_vertex_prop_dataframe` and `_edge_prop_dataframe` to make sampling faster #2401

VibhuJawa · 2022-07-11T20:18:54Z

Describe the solution you'd like and any additional context

We should set index to _EDGE_ID_ and _VERTEX_ for _vertex_prop_dataframe and _edge_prop_dataframe so that when we are fetching for sampling by ids we are fast.

Motivating Example where we see a 3x speed up for fetching a batch of 50k.

from cugraph.experimental import PropertyGraph
import numpy as np
import cudf

pg = PropertyGraph()
n_features = 100
n_rows = 10_000_000

df = cudf.DataFrame({'node_id':np.arange(n_rows)})
for feat_id in range(n_features):
    df[f'feat_{feat_id}'] = np.ones(n_rows)
pg.add_vertex_data(df,vertex_col_name='node_id')


node_ids_to_fetch = np.random.randint(100_000_000, size=50_000)

Without Index:

%%timeit
node_ids_df = cudf.DataFrame({'_VERTEX_':node_ids_to_fetch, 'input_order':np.arange(0,len(node_ids_to_fetch))})
fetched_df = node_ids_df.merge(pg._vertex_prop_dataframe, how='left')
fetched_df = fetched_df.sort_values(by='input_order')
len(fetched_df)

57.9 ms ± 8.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

With Index (3x faster) :

df_with_index =  pg._vertex_prop_dataframe.set_index('_VERTEX_')

%%timeit
fetched_df = df_with_index.loc[node_ids_to_fetch]

18.5 ms

The text was updated successfully, but these errors were encountered:

alexbarghi-nv · 2022-08-05T19:19:02Z

I'm seeing 10x speedup in my tests by setting the index and using .loc as shown here. Could we increase the priority of this?

eriknw · 2022-08-05T20:09:34Z

Wow, nice! Yup, I expect to start work on this today or Monday.

Currently, this only does SG version for rapidsai#2401. MG is still TODO. This also doesn't change anything user-facing (yet).

github-actions · 2022-09-17T19:06:16Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

~Currently, this only does SG version for #2401. MG is still TODO.~ Closes #2401 This also doesn't change anything user-facing (yet). Authors: - Erik Welch (https://github.com/eriknw) - Alex Barghi (https://github.com/alexbarghi-nv) Approvers: - Rick Ratzel (https://github.com/rlratzel) URL: #2523

VibhuJawa added improvement Improvement / enhancement to an existing function python labels Jul 11, 2022

rlratzel self-assigned this Jul 20, 2022

rlratzel added this to the 22.08 milestone Jul 20, 2022

rlratzel assigned eriknw and unassigned rlratzel Jul 28, 2022

jarmak-nv modified the milestones: 22.08, 22.10 Aug 8, 2022

eriknw added a commit to eriknw/cugraph that referenced this issue Aug 9, 2022

PropertyGraph set index to vertex and edge ids

2a6c9cf

Currently, this only does SG version for rapidsai#2401. MG is still TODO. This also doesn't change anything user-facing (yet).

eriknw mentioned this issue Aug 9, 2022

PropertyGraph set index to vertex and edge ids #2523

Merged

VibhuJawa added the PG label Aug 17, 2022

BradReesWork removed improvement Improvement / enhancement to an existing function python labels Aug 18, 2022

github-actions bot added the inactive-30d label Sep 17, 2022

BradReesWork removed PG labels Sep 20, 2022

rapids-bot bot closed this as completed in #2523 Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Set index to `_EDGE_ID_` and `_VERTEX_` for `_vertex_prop_dataframe` and `_edge_prop_dataframe` to make sampling faster #2401

[FEA] Set index to `_EDGE_ID_` and `_VERTEX_` for `_vertex_prop_dataframe` and `_edge_prop_dataframe` to make sampling faster #2401

VibhuJawa commented Jul 11, 2022

alexbarghi-nv commented Aug 5, 2022

eriknw commented Aug 5, 2022

github-actions bot commented Sep 17, 2022

[FEA] Set index to _EDGE_ID_ and _VERTEX_ for _vertex_prop_dataframe and _edge_prop_dataframe to make sampling faster #2401

[FEA] Set index to _EDGE_ID_ and _VERTEX_ for _vertex_prop_dataframe and _edge_prop_dataframe to make sampling faster #2401

Comments

VibhuJawa commented Jul 11, 2022

alexbarghi-nv commented Aug 5, 2022

eriknw commented Aug 5, 2022

github-actions bot commented Sep 17, 2022

[FEA] Set index to `_EDGE_ID_` and `_VERTEX_` for `_vertex_prop_dataframe` and `_edge_prop_dataframe` to make sampling faster #2401

[FEA] Set index to `_EDGE_ID_` and `_VERTEX_` for `_vertex_prop_dataframe` and `_edge_prop_dataframe` to make sampling faster #2401