Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support vector properties in PG #2721

Closed
rlratzel opened this issue Sep 23, 2022 · 8 comments · Fixed by #2882
Closed

Support vector properties in PG #2721

rlratzel opened this issue Sep 23, 2022 · 8 comments · Fixed by #2882
Assignees
Milestone

Comments

@rlratzel
Copy link
Contributor

This would be used ideally for storing individual properties that consist of arrays of values.

@rlratzel rlratzel added this to the 22.12 milestone Sep 23, 2022
@rapidsai rapidsai deleted a comment from github-actions bot Oct 27, 2022
@BradReesWork
Copy link
Member

Question - should we have the concept of a "property" and a "feature", or is everything just a "property"?

@BradReesWork
Copy link
Member

Question - overload add_node and add_edge API to specify where to save, or create new API for adding features?

@rlratzel
Copy link
Contributor Author

rlratzel commented Nov 3, 2022

@VibhuJawa @alexbarghi-nv
Do you have any thoughts about what the return type for retrieving property data for vector properties should be? For instance, one thought would be to return a numpy ndarray for data already on host (ie. stored using a pandas DF internally) and a cupy ndarray for cudf-stored data.

@rlratzel rlratzel changed the title Research extension arrays in PG Support vector properties in PG Nov 3, 2022
@alexbarghi-nv
Copy link
Member

That sounds good to me.

@eriknw
Copy link
Contributor

eriknw commented Nov 3, 2022

If we want to be able to return a vector property as an ndarray, do we need to add new methods such as:

  • pg.get_vertex_vector_property(name, vertex_ids=vertex_ids, types=types)
  • or pg.convert_vector_property_to_array(df, col) or pg.convert_vector_property_to_array(df[col]) ?

PR #2882 currently saves vector properties as cudf.ListDtype columns in cudf or object dtype of numpy arrays in pandas, so it's easiest (right now anyway) to simply return these as they are in the dataframes. But, we probably want to make it easier to make arrays from these, so when and how will a user want to convert vector properties to arrays?

@VibhuJawa
Copy link
Member

#2882

If we want to be able to return a vector property as an ndarray

How about we have a function like pg.convert_property_to_array(df[col]) ?

  1. if df[col] is a vector property return nd-array
  2. if df[col] is a scaler property return a 1d array

@eriknw
Copy link
Contributor

eriknw commented Nov 7, 2022

How about we have a function like pg.convert_property_to_array(df[col]) ?

Yeah, that seems like the most natural API, so it's fair to ask "why not that?".

First, it's trivial to convert a series with a normal dtype to a cupy or numpy array via s.to_cupy() or s.to_numpy(), so I don't know how useful it would be to support that with an extra API.

Second, it's helpful to know the length of the vector property. We store this info in PropertyGraph for each vector property by column name. Without this, we would need to compute the length of the vector property using the offsets array of the ListDtype column (if we don't need to compute, then creating the array is virtually free--it's zero-copy). It's also useful to know the length of the vector for Dask-enabled PropertyGraph so that we know the number of columns of the returned array.

I think we have three possible styles for the API when the user wants to convert a vector column to an array:

  1. we compute to determine the length of the vector
  2. the user specifies the length of the vector
  3. we keep track of the vector lengths, and the user needs to give us an object whose name (and size) matches a known vector property for edges or vertices

I think (3) may be the best for most users: they don't need to keep track of the vector property lengths, the operation to convert to an array should be wicked-fast, and having matching names probably matches their workflow.

I'm not a user, though, so I'm open to direction as long as the tradeoffs are understood.

@VibhuJawa
Copy link
Member

Second, it's helpful to know the length of the vector property. We store this info in PropertyGraph for each vector property by column name. Without this, we would need to compute the length of the vector property using the offsets array of the ListDtype column (if we don't need to compute, then creating the array is virtually free--it's zero-copy). It's also useful to know the length of the vector for Dask-enabled PropertyGraph so that we know the number of columns of the returned array.

Gotcha, this helps put things in perspective.

With pg.convert_property_to_array(df[col]), I was thinking from a user perspective that they will not need to keep track when a property is a vector or not but i dont think the trade off of computing the size on the fly is worth it

I think pg.get_vertex_vector_property(name, vertex_ids=vertex_ids, types=types) makes sense then and the only change from a user perspective is that they should know if they are fetching a vector or a scaler . Thanks for the input.

rapids-bot bot pushed a commit that referenced this issue Nov 18, 2022
~WIP.~ This ~probably~ closes #2721.

This uses cudf List dtypes to store vectors. When converting a vector column to arrow, the data appears to be on host, so it's unclear how many copies and moves of the data we're doing, but I don't think we have many easy alternatives besides relying on what cudf gives us.  In pandas, vector properties are object dtype and stored as numpy arrays.

I think it makes sense for a vector property to be required to be the same length (i.e., if it's added multiple times).

We may want to add a method to convert a vector property to a numpy or cupy array.
- How do we handle null values? Raise? Set to 0? Ignore/skip?
- Should users be able to specify numpy or cupy array (host or device)?

When getting data, should we allow vector properties to be expanded?

Can we create a graph with vector property data?

Should we add a keyword argument to `add_vertex_data` to say "use all (or the rest) of columns as a vector property of this name"?

Should we allow vector properties to come in already as cupy List dtype?

Authors:
  - Erik Welch (https://github.com/eriknw)

Approvers:
  - Brad Rees (https://github.com/BradReesWork)
  - Alex Barghi (https://github.com/alexbarghi-nv)
  - Vibhu Jawa (https://github.com/VibhuJawa)

URL: #2882
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants