-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support vector properties in PG #2721
Comments
Question - should we have the concept of a "property" and a "feature", or is everything just a "property"? |
Question - overload add_node and add_edge API to specify where to save, or create new API for adding features? |
@VibhuJawa @alexbarghi-nv |
That sounds good to me. |
If we want to be able to return a vector property as an ndarray, do we need to add new methods such as:
PR #2882 currently saves vector properties as |
How about we have a function like
|
Yeah, that seems like the most natural API, so it's fair to ask "why not that?". First, it's trivial to convert a series with a normal dtype to a cupy or numpy array via Second, it's helpful to know the length of the vector property. We store this info in PropertyGraph for each vector property by column name. Without this, we would need to compute the length of the vector property using the offsets array of the ListDtype column (if we don't need to compute, then creating the array is virtually free--it's zero-copy). It's also useful to know the length of the vector for Dask-enabled PropertyGraph so that we know the number of columns of the returned array. I think we have three possible styles for the API when the user wants to convert a vector column to an array:
I think (3) may be the best for most users: they don't need to keep track of the vector property lengths, the operation to convert to an array should be wicked-fast, and having matching names probably matches their workflow. I'm not a user, though, so I'm open to direction as long as the tradeoffs are understood. |
Gotcha, this helps put things in perspective. With I think |
~WIP.~ This ~probably~ closes #2721. This uses cudf List dtypes to store vectors. When converting a vector column to arrow, the data appears to be on host, so it's unclear how many copies and moves of the data we're doing, but I don't think we have many easy alternatives besides relying on what cudf gives us. In pandas, vector properties are object dtype and stored as numpy arrays. I think it makes sense for a vector property to be required to be the same length (i.e., if it's added multiple times). We may want to add a method to convert a vector property to a numpy or cupy array. - How do we handle null values? Raise? Set to 0? Ignore/skip? - Should users be able to specify numpy or cupy array (host or device)? When getting data, should we allow vector properties to be expanded? Can we create a graph with vector property data? Should we add a keyword argument to `add_vertex_data` to say "use all (or the rest) of columns as a vector property of this name"? Should we allow vector properties to come in already as cupy List dtype? Authors: - Erik Welch (https://github.com/eriknw) Approvers: - Brad Rees (https://github.com/BradReesWork) - Alex Barghi (https://github.com/alexbarghi-nv) - Vibhu Jawa (https://github.com/VibhuJawa) URL: #2882
This would be used ideally for storing individual properties that consist of arrays of values.
The text was updated successfully, but these errors were encountered: