Better handle cudf.pandas in `from_pandas_edgelist` #4525

eriknw · 2024-07-08T17:21:23Z

Optimistically use cupy, but fall back to numpy if necessary.

Also, bump lint versions.

Optimistically use cupy, but fall back to numpy if necessary

eriknw · 2024-07-08T17:26:04Z

python/nx-cugraph/nx_cugraph/convert_matrix.py

    src_array = df[source].to_numpy()
    dst_array = df[target].to_numpy()


copy=False is currently the default here, but should we add copy=False to be clear about potential data movement? I wonder whether pandas will eventually follow numpy 2 semantics and use copy=None to only copy if necessary and copy=False to raise if a copy is necessary.

should we add copy=False to be clear about potential data movement?

That seems like a good idea to me, unless there's a reason you think why it might be better not to.

NumPy 2 now has these semantics: https://numpy.org/doc/stable/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword

rlratzel

Thanks!

rlratzel · 2024-07-08T17:50:05Z

python/nx-cugraph/nx_cugraph/convert_matrix.py

    src_array = df[source].to_numpy()
    dst_array = df[target].to_numpy()


should we add copy=False to be clear about potential data movement?

That seems like a good idea to me, unless there's a reason you think why it might be better not to.

rlratzel · 2024-07-08T17:52:11Z

/merge

eriknw · 2024-07-09T06:54:38Z

python/nx-cugraph/nx_cugraph/convert_matrix.py

        src_indices = cp.array(src_array)
        dst_indices = cp.array(dst_array)


Note that this may perform an extra/unnecessary copy if cp.asarray above performed a copy. Maybe we should figure out best practice for determining whether a copy was made or not and avoid do extra copies.

@rlratzel

This continues #4525 (and [this comment](#4525 (comment))) to avoid copies and to be more optimal whether using pandas, cudf, or cudf.pandas. Notably, using `s.to_numpy` with cudf will return a _numpy_ array, but `cudf.pandas` may return a _cupy_ array (proxy). Also, `s.to_numpy(copy=False)` ([from comment](#4525 (comment))) is not used, b/c cudf's `to_numpy` raises if `copy=False`. We get the behavior we want by not specifying `copy=`. I don't know if this is the best way to determine whether a copy occurred or not, but this seems like a useful pattern to establish, because we want to make ingest more efficient. CC @rlratzel Authors: - Erik Welch (https://github.com/eriknw) - Ralph Liu (https://github.com/nv-rliu) Approvers: - Rick Ratzel (https://github.com/rlratzel) URL: #4528

Better handle cudf.pandas in from_pandas_edgelist

f2529ae

Optimistically use cupy, but fall back to numpy if necessary

eriknw requested a review from a team as a code owner July 8, 2024 17:21

github-actions bot added the python label Jul 8, 2024

eriknw commented Jul 8, 2024

View reviewed changes

eriknw added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change nx-cugraph labels Jul 8, 2024

rlratzel approved these changes Jul 8, 2024

View reviewed changes

rapids-bot bot merged commit 355efc1 into rapidsai:branch-24.08 Jul 8, 2024
131 checks passed

eriknw commented Jul 9, 2024

View reviewed changes

eriknw mentioned this pull request Jul 9, 2024

Further optimize from_pandas_edgelist with cudf #4528

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handle cudf.pandas in `from_pandas_edgelist` #4525

Better handle cudf.pandas in `from_pandas_edgelist` #4525

eriknw commented Jul 8, 2024

eriknw Jul 8, 2024

rlratzel Jul 8, 2024

eriknw Jul 9, 2024

rlratzel left a comment

rlratzel Jul 8, 2024

rlratzel commented Jul 8, 2024

eriknw Jul 9, 2024

		src_array = df[source].to_numpy()
		dst_array = df[target].to_numpy()

		src_indices = cp.array(src_array)
		dst_indices = cp.array(dst_array)

Better handle cudf.pandas in from_pandas_edgelist #4525

Better handle cudf.pandas in from_pandas_edgelist #4525

Conversation

eriknw commented Jul 8, 2024

eriknw Jul 8, 2024

Choose a reason for hiding this comment

rlratzel Jul 8, 2024

Choose a reason for hiding this comment

eriknw Jul 9, 2024

Choose a reason for hiding this comment

rlratzel left a comment

Choose a reason for hiding this comment

rlratzel Jul 8, 2024

Choose a reason for hiding this comment

rlratzel commented Jul 8, 2024

eriknw Jul 9, 2024

Choose a reason for hiding this comment

Better handle cudf.pandas in `from_pandas_edgelist` #4525

Better handle cudf.pandas in `from_pandas_edgelist` #4525