Support vector properties in PG #2721

rlratzel · 2022-09-23T16:52:29Z

This would be used ideally for storing individual properties that consist of arrays of values.

BradReesWork · 2022-10-27T19:43:46Z

Question - should we have the concept of a "property" and a "feature", or is everything just a "property"?

BradReesWork · 2022-10-27T19:45:47Z

Question - overload add_node and add_edge API to specify where to save, or create new API for adding features?

rlratzel · 2022-11-03T19:58:38Z

@VibhuJawa @alexbarghi-nv
Do you have any thoughts about what the return type for retrieving property data for vector properties should be? For instance, one thought would be to return a numpy ndarray for data already on host (ie. stored using a pandas DF internally) and a cupy ndarray for cudf-stored data.

alexbarghi-nv · 2022-11-03T20:14:16Z

That sounds good to me.

eriknw · 2022-11-03T21:29:14Z

If we want to be able to return a vector property as an ndarray, do we need to add new methods such as:

pg.get_vertex_vector_property(name, vertex_ids=vertex_ids, types=types)
or pg.convert_vector_property_to_array(df, col) or pg.convert_vector_property_to_array(df[col]) ?

PR #2882 currently saves vector properties as cudf.ListDtype columns in cudf or object dtype of numpy arrays in pandas, so it's easiest (right now anyway) to simply return these as they are in the dataframes. But, we probably want to make it easier to make arrays from these, so when and how will a user want to convert vector properties to arrays?

VibhuJawa · 2022-11-04T15:17:45Z

#2882

If we want to be able to return a vector property as an ndarray

How about we have a function like pg.convert_property_to_array(df[col]) ?

if df[col] is a vector property return nd-array
if df[col] is a scaler property return a 1d array

eriknw · 2022-11-07T04:00:04Z

How about we have a function like pg.convert_property_to_array(df[col]) ?

Yeah, that seems like the most natural API, so it's fair to ask "why not that?".

First, it's trivial to convert a series with a normal dtype to a cupy or numpy array via s.to_cupy() or s.to_numpy(), so I don't know how useful it would be to support that with an extra API.

Second, it's helpful to know the length of the vector property. We store this info in PropertyGraph for each vector property by column name. Without this, we would need to compute the length of the vector property using the offsets array of the ListDtype column (if we don't need to compute, then creating the array is virtually free--it's zero-copy). It's also useful to know the length of the vector for Dask-enabled PropertyGraph so that we know the number of columns of the returned array.

I think we have three possible styles for the API when the user wants to convert a vector column to an array:

we compute to determine the length of the vector
the user specifies the length of the vector
we keep track of the vector lengths, and the user needs to give us an object whose name (and size) matches a known vector property for edges or vertices

I think (3) may be the best for most users: they don't need to keep track of the vector property lengths, the operation to convert to an array should be wicked-fast, and having matching names probably matches their workflow.

I'm not a user, though, so I'm open to direction as long as the tradeoffs are understood.

VibhuJawa · 2022-11-07T19:58:40Z

Second, it's helpful to know the length of the vector property. We store this info in PropertyGraph for each vector property by column name. Without this, we would need to compute the length of the vector property using the offsets array of the ListDtype column (if we don't need to compute, then creating the array is virtually free--it's zero-copy). It's also useful to know the length of the vector for Dask-enabled PropertyGraph so that we know the number of columns of the returned array.

Gotcha, this helps put things in perspective.

With pg.convert_property_to_array(df[col]), I was thinking from a user perspective that they will not need to keep track when a property is a vector or not but i dont think the trade off of computing the size on the fly is worth it

I think pg.get_vertex_vector_property(name, vertex_ids=vertex_ids, types=types) makes sense then and the only change from a user perspective is that they should know if they are fetching a vector or a scaler . Thanks for the input.

~WIP.~ This ~probably~ closes #2721. This uses cudf List dtypes to store vectors. When converting a vector column to arrow, the data appears to be on host, so it's unclear how many copies and moves of the data we're doing, but I don't think we have many easy alternatives besides relying on what cudf gives us. In pandas, vector properties are object dtype and stored as numpy arrays. I think it makes sense for a vector property to be required to be the same length (i.e., if it's added multiple times). We may want to add a method to convert a vector property to a numpy or cupy array. - How do we handle null values? Raise? Set to 0? Ignore/skip? - Should users be able to specify numpy or cupy array (host or device)? When getting data, should we allow vector properties to be expanded? Can we create a graph with vector property data? Should we add a keyword argument to `add_vertex_data` to say "use all (or the rest) of columns as a vector property of this name"? Should we allow vector properties to come in already as cupy List dtype? Authors: - Erik Welch (https://github.com/eriknw) Approvers: - Brad Rees (https://github.com/BradReesWork) - Alex Barghi (https://github.com/alexbarghi-nv) - Vibhu Jawa (https://github.com/VibhuJawa) URL: #2882

rlratzel added this to the 22.12 milestone Sep 23, 2022

rlratzel assigned eriknw Sep 23, 2022

github-actions bot added the inactive-30d label Oct 23, 2022

rapidsai deleted a comment from github-actions bot Oct 27, 2022

eriknw mentioned this issue Nov 3, 2022

Add vector properties #2882

Merged

jarmak-nv removed the inactive-30d label Nov 3, 2022

rlratzel changed the title ~~Research extension arrays in PG~~ Support vector properties in PG Nov 3, 2022

rapids-bot bot closed this as completed in #2882 Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support vector properties in PG #2721

Support vector properties in PG #2721

rlratzel commented Sep 23, 2022

BradReesWork commented Oct 27, 2022

BradReesWork commented Oct 27, 2022

rlratzel commented Nov 3, 2022

alexbarghi-nv commented Nov 3, 2022

eriknw commented Nov 3, 2022 •

edited

Loading

VibhuJawa commented Nov 4, 2022

eriknw commented Nov 7, 2022

VibhuJawa commented Nov 7, 2022

Support vector properties in PG #2721

Support vector properties in PG #2721

Comments

rlratzel commented Sep 23, 2022

BradReesWork commented Oct 27, 2022

BradReesWork commented Oct 27, 2022

rlratzel commented Nov 3, 2022

alexbarghi-nv commented Nov 3, 2022

eriknw commented Nov 3, 2022 • edited Loading

VibhuJawa commented Nov 4, 2022

eriknw commented Nov 7, 2022

VibhuJawa commented Nov 7, 2022

eriknw commented Nov 3, 2022 •

edited

Loading