-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better handle cudf.pandas in from_pandas_edgelist
#4525
Better handle cudf.pandas in from_pandas_edgelist
#4525
Conversation
Optimistically use cupy, but fall back to numpy if necessary
src_array = df[source].to_numpy() | ||
dst_array = df[target].to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy=False
is currently the default here, but should we add copy=False
to be clear about potential data movement? I wonder whether pandas will eventually follow numpy 2 semantics and use copy=None
to only copy if necessary and copy=False
to raise if a copy is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add copy=False to be clear about potential data movement?
That seems like a good idea to me, unless there's a reason you think why it might be better not to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumPy 2 now has these semantics: https://numpy.org/doc/stable/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
src_array = df[source].to_numpy() | ||
dst_array = df[target].to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add copy=False to be clear about potential data movement?
That seems like a good idea to me, unless there's a reason you think why it might be better not to.
/merge |
src_indices = cp.array(src_array) | ||
dst_indices = cp.array(dst_array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this may perform an extra/unnecessary copy if cp.asarray
above performed a copy. Maybe we should figure out best practice for determining whether a copy was made or not and avoid do extra copies.
This continues #4525 (and [this comment](#4525 (comment))) to avoid copies and to be more optimal whether using pandas, cudf, or cudf.pandas. Notably, using `s.to_numpy` with cudf will return a _numpy_ array, but `cudf.pandas` may return a _cupy_ array (proxy). Also, `s.to_numpy(copy=False)` ([from comment](#4525 (comment))) is not used, b/c cudf's `to_numpy` raises if `copy=False`. We get the behavior we want by not specifying `copy=`. I don't know if this is the best way to determine whether a copy occurred or not, but this seems like a useful pattern to establish, because we want to make ingest more efficient. CC @rlratzel Authors: - Erik Welch (https://github.com/eriknw) - Ralph Liu (https://github.com/nv-rliu) Approvers: - Rick Ratzel (https://github.com/rlratzel) URL: #4528
Optimistically use cupy, but fall back to numpy if necessary.
Also, bump lint versions.
CC @rlratzel