Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Support specifying filters and projections with Substrait expressions #33985

Closed
ianmcook opened this issue Feb 1, 2023 · 4 comments · Fixed by #34834
Closed

[C++] Support specifying filters and projections with Substrait expressions #33985

ianmcook opened this issue Feb 1, 2023 · 4 comments · Fixed by #34834

Comments

@ianmcook
Copy link
Member

ianmcook commented Feb 1, 2023

Describe the enhancement requested

In addition to representing full plans, Substrait can also be used to represent expressions (see substrait-io/substrait#405). It would be nice if Dataset and Acero could consume Substrait expressions and use them to specify filters and projections.

I would love to see us expose functions that:

  • Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
  • Receive a list of Substrait scalar expressions and use them to Project a Dataset
  • Receive a Boolean-valued Substrait scalar expression and use it to add a Filter node to the ExecPlan
  • Receive a list of Substrait scalar expressions and use it to add a Project node to the ExecPlan

Component(s)

C++

@ianmcook
Copy link
Member Author

ianmcook commented Feb 1, 2023

This plus substrait-io/substrait-java#128 would allow users to specify filters and projections as SQL expressions and execute them with Dataset and Acero.

@westonpace
Copy link
Member

The Arrow equivalent of Substrait's expression is arrow::compute::Expression. So I think an extended expression proto file would roughly translate to std::vector<std::pair<std::string, arrow::compute::Expression>> (not actually suggesting we use this API, just describing).

If we added an API for that then those compute expressions could be used when building Acero filters & projects.

Note: this sort of implies the user is not using Substrait to express their actual queries. This is fine, we have non-Substrait APIs for filter & project in pyarrow (e.g. Array.filter) and R (dplyr) so there is certainly room for it.

@ianmcook
Copy link
Member Author

ianmcook commented Feb 17, 2023

@westonpace to clarify: with the implementation you envision, could this also give us the ability to pass Substrait expressions to arrow::dataset::ScannerBuilder::Filter() and arrow::dataset::ScannerBuilder::Project(), correct? That would be very, very cool indeed.

@westonpace
Copy link
Member

westonpace commented Feb 17, 2023

You would need to use this variant of Project() but otherwise yes.

westonpace added a commit that referenced this issue Aug 22, 2023
…ssions (#34834)

### Rationale for this change

Substrait provides a library-independent way to represent compute expressions.  By serializing and deserializing pyarrow compute expression to substrait we can allow interoperability with other libraries.

Originally it was thought this would not be needed because users would be sending entire query plans (which contain expressions) back and forth and so there was no need to work with expressions by themselves.

However, as more and more APIs and integration points emerge it turns out there are situations where serializing expressions by themselves is useful.  For example, the proposed datasets protocol, or for the Java JNI datasets implementation (which uses Arrow-C++'s datasets)

### What changes are included in this PR?

In Arrow-C++ we add two new methods to serialize and deserialize a collection of named, bound expressions to Substrait's ExtendedExpression message.

In pyarrow we expose these two methods and also add utility methods to pyarrow.compute.Expression to convert a single expression to/from substrait (these will be encoded as an ExtendedExpression message with one expression named "expression")

In addition, this PR exposed that we do not have very many bindings for arrow-functions to substrait-functions (previous work has mostly focused on the reverse direction).  This PR adds many (though not all) new bindings.

In addition, this PR adds ToProto for cast and both FromProto and ToProto support for the SingularOrList expression type (we convert is_in to SingularOrList and convert SingularOrList to an or list).

This should provide support for all the sargable operators except between (there is no Arrow-C++ between function) and like (we still don't have arrow->substrait bindings for the string functions) which should be a sufficient set of expressions for a first release.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

There are new features, as described above, but no backwards incompatible changes.

### Caveats

There are a fair number of minor inconsistencies or surprises, many of which can be smoothed over by follow-up work.

#### Bound Expressions

Arrow-C++ has long had a distinction between "unbound expressions" (e.g. `a + b`) and "bound expressions" (e.g. `a:i32 + b:i32`).  A bound expression is an expression that has been bound to a schema of some kind.  Field references are resolved and the output type is known for every node of the AST.

Pyarrow has hidden this complexity and most pyarrow compute expressions that the user encounters will be unbound expressions.  Substrait is only capable (currently) of representing bound expressions.  As a result, in order to serialize expressions, the user will need to provide an input schema.  This can be an inconvenience for some workflows.  To resolve this, I would like to eventually add support for unbound expressions to Substrait (substrait-io/substrait#515)

Another minor annoyance of bound expressions is that an unbound pyarrow.compute.Expression object will not be equal to a bound pyarrow.compute.Expression object.  It would make testing easier if we had a `pyarrow.compute.Expression.equals` variant that did not examine bound fields.

#### Named field references

Pyarrow datasets users are used to working with named field references.  For example, one can set a filter `pc.equal(ds.field("x"), 7)`.  Substrait, since it requires everything to be bound, considers named references to be superfluous and does everything in terms of numeric indices into the base schema.  So the above expression, after round tripping, would become something like `pc.equal(ds.field(3), 7)` (assuming `"x"` is at index `3` in the schema used for serialization).  This is something that can be overcome in the future if Substrait adds support for unbound expressions.  Or, if that doesn't happen, it could still be implemented as a Substrait expression hint (this would allow named references to be used even if the user wants to work with bound expressions).

#### UDFs

UDFs ARE supported by this PR.  This covers both "builtin arrow functions that do not exist in substrait (e.g. shift_left)" and "custom UDFs added with `register_scalar_function`".  By default, UDFs will not be allowed when converting to Substrait because the resulting message would not be portable (e.g. you can't expect an external system to know about your custom UDFs).  However, you can set the `allow_udfs` flag to True and these will be allowed.  The Substrait representation will have the URI `urn:arrow:substrait_simple_extension_function`.

**Options**: Although UDFs are allowed we do not yet support UDFs that take function options.  These are trickier to convert to Substrait (though it should be possible in the future if someone is motivated enough).

#### Rough Edges

There are a few corner cases:

 * The function `is_in` converts to Substrait's `SingularOrList`.  On conversion back to Arrow this becomes an or list.  In other words, the function `is_in(5, [1, 2, 5])` converts to `5 == 1 || 5 == 2 || 5 == 5`.  This is because Substrait's or list is more expression and allows things like `5 == field_ref(0) || 5 == 7` which cannot be expressed as an `is_in` function.
 * Arrow functions can either be converted to Substrait or are considered UDFs.  However, there are a small number of functions which can "sometimes" be converted to Substrait depending on the function options.  At the moment I think this is only the `is_null` function.  The `is_null` function has an option `nan_is_null` which will allow you to consider `NaN` as a null value.  Substrait has no single function that evaluates both `NULL` and `NaN` as true.  In the meantime you can use `is_null || is_nan`.  In the future, should someone want to, they could add special logic to convert this case.
* Closes: #33985

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
@westonpace westonpace added this to the 14.0.0 milestone Aug 22, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
… expressions (apache#34834)

### Rationale for this change

Substrait provides a library-independent way to represent compute expressions.  By serializing and deserializing pyarrow compute expression to substrait we can allow interoperability with other libraries.

Originally it was thought this would not be needed because users would be sending entire query plans (which contain expressions) back and forth and so there was no need to work with expressions by themselves.

However, as more and more APIs and integration points emerge it turns out there are situations where serializing expressions by themselves is useful.  For example, the proposed datasets protocol, or for the Java JNI datasets implementation (which uses Arrow-C++'s datasets)

### What changes are included in this PR?

In Arrow-C++ we add two new methods to serialize and deserialize a collection of named, bound expressions to Substrait's ExtendedExpression message.

In pyarrow we expose these two methods and also add utility methods to pyarrow.compute.Expression to convert a single expression to/from substrait (these will be encoded as an ExtendedExpression message with one expression named "expression")

In addition, this PR exposed that we do not have very many bindings for arrow-functions to substrait-functions (previous work has mostly focused on the reverse direction).  This PR adds many (though not all) new bindings.

In addition, this PR adds ToProto for cast and both FromProto and ToProto support for the SingularOrList expression type (we convert is_in to SingularOrList and convert SingularOrList to an or list).

This should provide support for all the sargable operators except between (there is no Arrow-C++ between function) and like (we still don't have arrow->substrait bindings for the string functions) which should be a sufficient set of expressions for a first release.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

There are new features, as described above, but no backwards incompatible changes.

### Caveats

There are a fair number of minor inconsistencies or surprises, many of which can be smoothed over by follow-up work.

#### Bound Expressions

Arrow-C++ has long had a distinction between "unbound expressions" (e.g. `a + b`) and "bound expressions" (e.g. `a:i32 + b:i32`).  A bound expression is an expression that has been bound to a schema of some kind.  Field references are resolved and the output type is known for every node of the AST.

Pyarrow has hidden this complexity and most pyarrow compute expressions that the user encounters will be unbound expressions.  Substrait is only capable (currently) of representing bound expressions.  As a result, in order to serialize expressions, the user will need to provide an input schema.  This can be an inconvenience for some workflows.  To resolve this, I would like to eventually add support for unbound expressions to Substrait (substrait-io/substrait#515)

Another minor annoyance of bound expressions is that an unbound pyarrow.compute.Expression object will not be equal to a bound pyarrow.compute.Expression object.  It would make testing easier if we had a `pyarrow.compute.Expression.equals` variant that did not examine bound fields.

#### Named field references

Pyarrow datasets users are used to working with named field references.  For example, one can set a filter `pc.equal(ds.field("x"), 7)`.  Substrait, since it requires everything to be bound, considers named references to be superfluous and does everything in terms of numeric indices into the base schema.  So the above expression, after round tripping, would become something like `pc.equal(ds.field(3), 7)` (assuming `"x"` is at index `3` in the schema used for serialization).  This is something that can be overcome in the future if Substrait adds support for unbound expressions.  Or, if that doesn't happen, it could still be implemented as a Substrait expression hint (this would allow named references to be used even if the user wants to work with bound expressions).

#### UDFs

UDFs ARE supported by this PR.  This covers both "builtin arrow functions that do not exist in substrait (e.g. shift_left)" and "custom UDFs added with `register_scalar_function`".  By default, UDFs will not be allowed when converting to Substrait because the resulting message would not be portable (e.g. you can't expect an external system to know about your custom UDFs).  However, you can set the `allow_udfs` flag to True and these will be allowed.  The Substrait representation will have the URI `urn:arrow:substrait_simple_extension_function`.

**Options**: Although UDFs are allowed we do not yet support UDFs that take function options.  These are trickier to convert to Substrait (though it should be possible in the future if someone is motivated enough).

#### Rough Edges

There are a few corner cases:

 * The function `is_in` converts to Substrait's `SingularOrList`.  On conversion back to Arrow this becomes an or list.  In other words, the function `is_in(5, [1, 2, 5])` converts to `5 == 1 || 5 == 2 || 5 == 5`.  This is because Substrait's or list is more expression and allows things like `5 == field_ref(0) || 5 == 7` which cannot be expressed as an `is_in` function.
 * Arrow functions can either be converted to Substrait or are considered UDFs.  However, there are a small number of functions which can "sometimes" be converted to Substrait depending on the function options.  At the moment I think this is only the `is_null` function.  The `is_null` function has an option `nan_is_null` which will allow you to consider `NaN` as a null value.  Substrait has no single function that evaluates both `NULL` and `NaN` as true.  In the meantime you can use `is_null || is_nan`.  In the future, should someone want to, they could add special logic to convert this case.
* Closes: apache#33985

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Weston Pace <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants