Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vector properties #2882

Merged
merged 16 commits into from
Nov 18, 2022
Merged

Conversation

eriknw
Copy link
Contributor

@eriknw eriknw commented Nov 3, 2022

WIP. This probably closes #2721.

This uses cudf List dtypes to store vectors. When converting a vector column to arrow, the data appears to be on host, so it's unclear how many copies and moves of the data we're doing, but I don't think we have many easy alternatives besides relying on what cudf gives us. In pandas, vector properties are object dtype and stored as numpy arrays.

I think it makes sense for a vector property to be required to be the same length (i.e., if it's added multiple times).

We may want to add a method to convert a vector property to a numpy or cupy array.

  • How do we handle null values? Raise? Set to 0? Ignore/skip?
  • Should users be able to specify numpy or cupy array (host or device)?

When getting data, should we allow vector properties to be expanded?

Can we create a graph with vector property data?

Should we add a keyword argument to add_vertex_data to say "use all (or the rest) of columns as a vector property of this name"?

Should we allow vector properties to come in already as cupy List dtype?

@eriknw eriknw requested a review from a team as a code owner November 3, 2022 06:18
@BradReesWork BradReesWork added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 3, 2022
@BradReesWork BradReesWork added this to the 22.12 milestone Nov 3, 2022
@BradReesWork
Copy link
Member

@eriknw the code looks good, but I do not see any test associated with the changes

@rlratzel rlratzel marked this pull request as draft November 3, 2022 19:15
@codecov-commenter
Copy link

codecov-commenter commented Nov 4, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.12@403ee20). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.12    #2882   +/-   ##
===============================================
  Coverage                ?   60.35%           
===============================================
  Files                   ?      122           
  Lines                   ?     7158           
  Branches                ?        0           
===============================================
  Hits                    ?     4320           
  Misses                  ?     2838           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@eriknw eriknw marked this pull request as ready for review November 9, 2022 05:01
@alexbarghi-nv
Copy link
Member

@eriknw here are my responses to your questions:

  1. Vector properties should be fixed length; if attempting to update an existing vertex property, the length must match or an error should be thrown.
  2. Null values should be set to 0.
  3. Users should have the choice of a numpy or cupy array, but we can leave numpy support to a future PR.
  4. Could you clarify what you mean by "expanding" a vector property?
  5. Yes, and could we make that on by default with the default name as "x"?
  6. Yes, but let's do that in a future PR.

@eriknw
Copy link
Contributor Author

eriknw commented Nov 14, 2022

Very helpful @alexbarghi-nv, thank you. My replies:

  1. Vector properties should be fixed length; if attempting to update an existing vertex property, the length must match or an error should be thrown.

Perfect. This is as implemented.

  1. Null values should be set to 0.

I added a fillvalue=None argument. I allow this to be skipped though if data is expected to exist in order to be faster.

  1. Users should have the choice of a numpy or cupy array, but we can leave numpy support to a future PR.

K. I'm happy punting on this too and to give users the native object.

  1. Could you clarify what you mean by "expanding" a vector property?

"Expand" to return a dataframe instead of an array, which can kind of match how data comes into PropertyGraph. I don't think this is necessary, since it's easy to go from an array to a dataframe.

  1. Yes, and could we make that on by default with the default name as "x"?

Sounds good, but I don't want it on by default. Users should specify they want a vector property.

I think I want to support this by adding another keyword vector_property=None, which can be a string for the vector property name. How should this behave though? Should it include properties specified in property_columns, or should it include every other property that isn't included in property_columns or vector_properties. Thoughts?

We could squeeze this functionality into vector_properties (i.e., by passing a string or True to default to "x"), but this complicates typing and is less clear imho.

Edit: upon further thought, I think the most sense for vector_property="x" is to use all other properties as per the original proposal.

  1. Yes, but let's do that in a future PR.

Agreed.

@eriknw
Copy link
Contributor Author

eriknw commented Nov 15, 2022

Alright, I think the API and functionality is complete and includes feedback to:

  • fill value upon creation of array from vector property
  • simplify creating vector properties from "the rest" of the columns

I would like to add a couple more tests to wrap this up.

@VibhuJawa VibhuJawa self-requested a review November 15, 2022 07:10
Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small change, looks good otherwise

Comment on lines +2062 to +2065
try:
fill = list(fillvalue)
except Exception:
fill = [fillvalue] * length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use list here but numpy or cupy/array library here please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to use list here; it's the only way I could get it to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I think that's fine for now, we can revisit this when we start perf tuning if this becomes a problem

Copy link
Member

@alexbarghi-nv alexbarghi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thanks for working on this Erik , I expect some speed up with this PR on the below workflow. #2925

@eriknw eriknw changed the title Begin adding vector properties Add vector properties Nov 16, 2022
@alexbarghi-nv
Copy link
Member

rerun tests

@BradReesWork
Copy link
Member

rerun tests

@BradReesWork
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit a1f205c into rapidsai:branch-22.12 Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support vector properties in PG
5 participants