-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add null input handling options for any_value
#652
feat: add null input handling options for any_value
#652
Conversation
extensions/functions_arithmetic.yaml
Outdated
- name: x | ||
value: any | ||
options: | ||
null_handling: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from concat, does it make sense here or is better to add e.g. a boolean arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think options make sense for this. The names ACCEPT_NULLS
and IGNORE_NULLS
sound a little weird to me, but I can't think of better ones currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added that original option to concat, but I agree it does sound a little weird. How about changing the option name to ignore_nulls
and have the options be ["True", "False"]
. I think True/False may need to be quoted. If i recall correctly I didn't do this originally because no other options were quoted, but that's changed since then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed in b18cecd - and quoted after, as that was needed to keep them as strings. Though it hurts my soul a bit to have a string "TRUE". Maybe should make them YES/NO instead to be less confusing 😅
extensions/functions_arithmetic.yaml
Outdated
@@ -1563,6 +1563,43 @@ aggregate_functions: | |||
values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR ] | |||
nullability: DECLARED_OUTPUT | |||
return: fp64? | |||
- name: "first" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the window functions are called first/last_value
, should we stick to that naming?
Spark seems to have both first
and first_value
, though first_value is only supported in SQL while first is supported as a method.
DataFusion has first_value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at other engines, Postgres and Trino only have support for first_value
and last_value
as window functions, and don't have first
and last
aggregate functions.
I think it make sense to keep first/last and first_value/last_value as seperate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks like this is a mess overall - postgres has only first_value
as window, Spark has first
and first_value
being the same, duckdb has first
for aggregate and both for window, DataFusion has first_value
as aggregate...
The purpose of the functions is the same, though. I actually think it'd make sense for Substrait to only have one set of these (as aggregate), and maybe it should be the _value
option to match already existing any_value
. But dunno if we can remove the window versions, I guess that'd be a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed them to have the "_value" postfix now - small annoyance there is that if someone includes now both these and window functions they'll get duplicate signatures...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the base name has to be unique across all of the files. So having two first_value functions with the same signature will likely mess things up. (And yes, we need a check.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the base name has to be unique across all of the files
I just saw this comment. I believe this is inconsistent with both the spirit and spec of extensions. Extensions should allow people to create new extensions that completely conflict with other extensions. That's why the spec specifies that functions are identified by a combination of their name and URI. Two different URIs can define the same names as entirely different things since they are in different namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recent discussion on the topic:
IIRC, the general consensus is "yes, different filenames should be able to have duplicate functio names but we're not there yet, there isn't much motivation to tackle the issue, and keeping the core Substrait functions unique makes life easier in the short term"
Onne thing I'm not sure - is it better to have these as both window and aggregate functions, or could we remove the window function version and just replace with this? |
858438b
to
0d924e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this seems reasonable to me, but I'm curious to see what others think.
extensions/functions_arithmetic.yaml
Outdated
nullability: DECLARED_OUTPUT | ||
decomposable: MANY | ||
intermediate: any? | ||
return: any? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both of these functions should be in functions_aggregate_generic.yaml
instead of functions_arithmetic.yaml
extensions/functions_arithmetic.yaml
Outdated
- name: x | ||
value: any | ||
options: | ||
null_handling: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think options make sense for this. The names ACCEPT_NULLS
and IGNORE_NULLS
sound a little weird to me, but I can't think of better ones currently.
extensions/functions_arithmetic.yaml
Outdated
@@ -1563,6 +1563,43 @@ aggregate_functions: | |||
values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR ] | |||
nullability: DECLARED_OUTPUT | |||
return: fp64? | |||
- name: "first" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at other engines, Postgres and Trino only have support for first_value
and last_value
as window functions, and don't have first
and last
aggregate functions.
I think it make sense to keep first/last and first_value/last_value as seperate.
@@ -35,3 +35,41 @@ aggregate_functions: | |||
value: any | |||
nullability: DECLARED_OUTPUT | |||
return: any? | |||
- name: "first" | |||
description: >- | |||
First value from a group of values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth noting that order matters for these two functions. Acero will reject plans that don't have a defined ordering on the input which might be a reasonable practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note c93c326!
@@ -35,3 +35,41 @@ aggregate_functions: | |||
value: any | |||
nullability: DECLARED_OUTPUT | |||
return: any? | |||
- name: "first" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call out to first_value with the difference to make it easier for folks to choose one or the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand, what do you see as the difference? My hope would be to replace the window versions with these completely, given that aggregate functions are also valid window functions, and the engines I looked at don't seem to make a difference between these - but maybe I missed something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference I see is what context they can be called in.
Okays, I revamped the PR a bit - now it moves the first_value and last_value from window funcs into aggregate funcs, and adds the null handling options. This makes most sense to me, but lmk what you think! |
extensions/functions_arithmetic.yaml
Outdated
nullability: DECLARED_OUTPUT | ||
decomposable: NONE | ||
return: any1 | ||
window_type: PARTITION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up, moving existing functions is a breaking change, and a relatively painful one to workaround. I would prefer if we could avoid breakage here.
See: #634 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the options are:
- move existing functions (breaking change
- duplicate existing functions (bad)
- add new functions with new name (leads to confusing end state as there's now two functions with same functionality but different name, unless we can deprecate the existing functions somehow)
For my need, I think both 1 and 3 work fine, so I don't have strong opinions - I guess it's a question of breaking now vs keeping a worse state for ever/until breaking later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We could also keep the functions in the same file, but move them into the aggregate functions block.
I guess it's a question of breaking now vs keeping a worse state for ever/until breaking later?
I think it would good to move these eventually, but it would be nice to do after substrait-java can handle duplicate functions in different files, because then we could make the change by duplicating the functions into the new file in one release, and then removing the old functions in the next release.
I've filed substrait-io/substrait-java#275 to track this work in substrait-java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the same file for now, fixing substrait-java, and then moving files works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I thought "moving" meant moving also from window -> aggregation. Keeping them in the old file is fine for me -like fc3fa78 ?
I don't love it but I won't necessarily vote against it. It isn't supported by some significant engines (e.g. Postgres, SQL server, Snowflake, ). However, it does appear to be supported by My main problem is that
That plan is more expensive than |
By Databricks, do you mean Spark? Turns out Spark has also
FWIW, I don't think those are the main uses for first/last, rather I think the need is more for the "any_value" concept. Now that I think of it, I'm not sure why Spark even has a The main reason I started this PR was to support Spark's We do have some uses of In the end, if it's preferable, I think I can actually turn Spark's Would that be better? Then I could change this PR to instead add the null-handling options into |
An alternative to first is to use a fetch relation but it becomes a lot more complicated to modify a complicated subquery to introduce it. Last does weird me out and it is a available in less places. Probably implemented for completeness and not actual use. |
Yes, I'd prefer that. Sorry for the churn. Agree the null handling is good. |
fc3fa78
to
ff68ec9
Compare
first
and last
aggregate functionsnth_value
any_value can be used in place of e.g. first() in Spark but it's missing an option for whether to ignore nulls in input or not
ff68ec9
to
9563135
Compare
nth_value
any_value
Done! All good, makes sense to be careful when adding stuff into the standard (or standard extensions)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, but I will point out that "ignore nulls" / "skip nulls" is a property that can apply to many (potentially all) aggregate functions.
This adds a "ignore_nulls" option for
any_value
that can be used when converting e.g. Spark's first()/first_value()/any_value()