Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add null input handling options for any_value #652

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Jun 26, 2024

This adds a "ignore_nulls" option for any_value that can be used when converting e.g. Spark's first()/first_value()/any_value()

- name: x
value: any
options:
null_handling:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from concat, does it make sense here or is better to add e.g. a boolean arg?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think options make sense for this. The names ACCEPT_NULLS and IGNORE_NULLS sound a little weird to me, but I can't think of better ones currently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that original option to concat, but I agree it does sound a little weird. How about changing the option name to ignore_nulls and have the options be ["True", "False"]. I think True/False may need to be quoted. If i recall correctly I didn't do this originally because no other options were quoted, but that's changed since then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed in b18cecd - and quoted after, as that was needed to keep them as strings. Though it hurts my soul a bit to have a string "TRUE". Maybe should make them YES/NO instead to be less confusing 😅

@@ -1563,6 +1563,43 @@ aggregate_functions:
values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR ]
nullability: DECLARED_OUTPUT
return: fp64?
- name: "first"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the window functions are called first/last_value, should we stick to that naming?

Spark seems to have both first and first_value, though first_value is only supported in SQL while first is supported as a method.
DataFusion has first_value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at other engines, Postgres and Trino only have support for first_value and last_value as window functions, and don't have first and last aggregate functions.

I think it make sense to keep first/last and first_value/last_value as seperate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks like this is a mess overall - postgres has only first_value as window, Spark has first and first_value being the same, duckdb has first for aggregate and both for window, DataFusion has first_value as aggregate...

The purpose of the functions is the same, though. I actually think it'd make sense for Substrait to only have one set of these (as aggregate), and maybe it should be the _value option to match already existing any_value. But dunno if we can remove the window versions, I guess that'd be a breaking change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed them to have the "_value" postfix now - small annoyance there is that if someone includes now both these and window functions they'll get duplicate signatures...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the base name has to be unique across all of the files. So having two first_value functions with the same signature will likely mess things up. (And yes, we need a check.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the base name has to be unique across all of the files

I just saw this comment. I believe this is inconsistent with both the spirit and spec of extensions. Extensions should allow people to create new extensions that completely conflict with other extensions. That's why the spec specifies that functions are identified by a combination of their name and URI. Two different URIs can define the same names as entirely different things since they are in different namespaces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent discussion on the topic:

#631 (comment)
#634

IIRC, the general consensus is "yes, different filenames should be able to have duplicate functio names but we're not there yet, there isn't much motivation to tackle the issue, and keeping the core Substrait functions unique makes life easier in the short term"

@Blizzara
Copy link
Contributor Author

Onne thing I'm not sure - is it better to have these as both window and aggregate functions, or could we remove the window function version and just replace with this?

@Blizzara Blizzara force-pushed the avo/first-last-as-aggregate-functions branch from 858438b to 0d924e9 Compare June 26, 2024 15:01
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this seems reasonable to me, but I'm curious to see what others think.

nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: any?
return: any?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both of these functions should be in functions_aggregate_generic.yaml instead of functions_arithmetic.yaml

extensions/functions_arithmetic.yaml Outdated Show resolved Hide resolved
- name: x
value: any
options:
null_handling:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think options make sense for this. The names ACCEPT_NULLS and IGNORE_NULLS sound a little weird to me, but I can't think of better ones currently.

@@ -1563,6 +1563,43 @@ aggregate_functions:
values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR ]
nullability: DECLARED_OUTPUT
return: fp64?
- name: "first"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at other engines, Postgres and Trino only have support for first_value and last_value as window functions, and don't have first and last aggregate functions.

I think it make sense to keep first/last and first_value/last_value as seperate.

@@ -35,3 +35,41 @@ aggregate_functions:
value: any
nullability: DECLARED_OUTPUT
return: any?
- name: "first"
description: >-
First value from a group of values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth noting that order matters for these two functions. Acero will reject plans that don't have a defined ordering on the input which might be a reasonable practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note c93c326!

@@ -35,3 +35,41 @@ aggregate_functions:
value: any
nullability: DECLARED_OUTPUT
return: any?
- name: "first"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call out to first_value with the difference to make it easier for folks to choose one or the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand, what do you see as the difference? My hope would be to replace the window versions with these completely, given that aggregate functions are also valid window functions, and the engines I looked at don't seem to make a difference between these - but maybe I missed something

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference I see is what context they can be called in.

@Blizzara
Copy link
Contributor Author

Okays, I revamped the PR a bit - now it moves the first_value and last_value from window funcs into aggregate funcs, and adds the null handling options.

This makes most sense to me, but lmk what you think!

EpsilonPrime
EpsilonPrime previously approved these changes Jun 26, 2024
nullability: DECLARED_OUTPUT
decomposable: NONE
return: any1
window_type: PARTITION
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up, moving existing functions is a breaking change, and a relatively painful one to workaround. I would prefer if we could avoid breakage here.

See: #634 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the options are:

  1. move existing functions (breaking change
  2. duplicate existing functions (bad)
  3. add new functions with new name (leads to confusing end state as there's now two functions with same functionality but different name, unless we can deprecate the existing functions somehow)

For my need, I think both 1 and 3 work fine, so I don't have strong opinions - I guess it's a question of breaking now vs keeping a worse state for ever/until breaking later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We could also keep the functions in the same file, but move them into the aggregate functions block.

I guess it's a question of breaking now vs keeping a worse state for ever/until breaking later?

I think it would good to move these eventually, but it would be nice to do after substrait-java can handle duplicate functions in different files, because then we could make the change by duplicating the functions into the new file in one release, and then removing the old functions in the next release.

I've filed substrait-io/substrait-java#275 to track this work in substrait-java.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the same file for now, fixing substrait-java, and then moving files works for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought "moving" meant moving also from window -> aggregation. Keeping them in the old file is fine for me -like fc3fa78 ?

EpsilonPrime
EpsilonPrime previously approved these changes Jun 27, 2024
@westonpace
Copy link
Member

I don't love it but I won't necessarily vote against it. It isn't supported by some significant engines (e.g. Postgres, SQL server, Snowflake, ). However, it does appear to be supported by DataFusion and DuckDB so there is some representation. Databricks has first/last but it marks them as "order is non-deterministics" and it's not clear that Databricks supports "ORDER BY" in an aggregate expression.

My main problem is that first(x order by x) is the same as min(x) and first(x order by y) can be obtained by arg_min(x, y). I think min / arg_min are better since they don't require an order by statement and so they are more easily implemented by engines (yes, any engine can optimize first into min / arg_min but that's just introducing extra steps for no real gain). I think the more common result would be an engine falling back to its order by implementation and introducing an expensive sort into the query (arg_min and arg_max can be calculated without a sort). For example, this appears to be what DataFusion does today:

❯ EXPLAIN SELECT first_value(val ORDER BY val) FROM foo;
+---------------+------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                     |
+---------------+------------------------------------------------------------------------------------------+
| logical_plan  | Aggregate: groupBy=[[]], aggr=[[FIRST_VALUE(foo.val) ORDER BY [foo.val ASC NULLS LAST]]] |
|               |   TableScan: foo projection=[val]                                                        |
| physical_plan | AggregateExec: mode=Single, gby=[], aggr=[FIRST_VALUE(foo.val)]                          |
|               |   SortExec: expr=[val@0 ASC NULLS LAST]                                                  |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                        |
|               |                                                                                          |
+---------------+------------------------------------------------------------------------------------------+

That plan is more expensive than SELECT min(val) FROM foo even though they are identical. That being said, I do think we should finish up the arg_min / arg_max PR (#326)

@Blizzara
Copy link
Contributor Author

I don't love it but I won't necessarily vote against it. It isn't supported by some significant engines (e.g. Postgres, SQL server, Snowflake, ). However, it does appear to be supported by DataFusion and DuckDB so there is some representation. Databricks has first/last but it marks them as "order is non-deterministics" and it's not clear that Databricks supports "ORDER BY" in an aggregate expression.

By Databricks, do you mean Spark? Turns out Spark has also any_value, and they are "interchanged" ie the implementation for any_value is just first. Substrait already has any_value, so one option is to just use that.

My main problem is that first(x order by x) is the same as min(x) and first(x order by y) can be obtained by arg_min(x, y).

FWIW, I don't think those are the main uses for first/last, rather I think the need is more for the "any_value" concept. Now that I think of it, I'm not sure why Spark even has a last, but maybe there's a reason.

The main reason I started this PR was to support Spark's distinct/dropDuplicates. Those get rewritten by the optimizer into a first aggregate: https://github.com/apache/spark/blob/df13ca05c475e98bf5c218a4503513065611a47f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L2262

We do have some uses of last as well, but it's possible those would be within windows, I don't have an easy way to check.

In the end, if it's preferable, I think I can actually turn Spark's first() aggregate into Substrait's any_value() and then that again into DataFusion's first_value(). Then if I do end up needing the last() as aggregate, I can add it as a Spark specific mapping or something.

Would that be better? Then I could change this PR to instead add the null-handling options into any_value, since those would be nice to have still.

@EpsilonPrime
Copy link
Member

An alternative to first is to use a fetch relation but it becomes a lot more complicated to modify a complicated subquery to introduce it. Last does weird me out and it is a available in less places. Probably implemented for completeness and not actual use.

@westonpace
Copy link
Member

Would that be better? Then I could change this PR to instead add the null-handling options into any_value, since those would be nice to have still.

Yes, I'd prefer that. Sorry for the churn. Agree the null handling is good.

@Blizzara Blizzara force-pushed the avo/first-last-as-aggregate-functions branch from fc3fa78 to ff68ec9 Compare July 1, 2024 19:51
@Blizzara Blizzara changed the title feat: add first and last aggregate functions feat: add null input handling options for nth_value Jul 1, 2024
any_value can be used in place of e.g. first() in Spark
but it's missing an option for whether to ignore nulls in
input or not
@Blizzara Blizzara force-pushed the avo/first-last-as-aggregate-functions branch from ff68ec9 to 9563135 Compare July 1, 2024 19:54
@Blizzara Blizzara changed the title feat: add null input handling options for nth_value feat: add null input handling options for any_value Jul 1, 2024
@Blizzara
Copy link
Contributor Author

Blizzara commented Jul 1, 2024

Yes, I'd prefer that. Sorry for the churn. Agree the null handling is good.

Done! All good, makes sense to be careful when adding stuff into the standard (or standard extensions)!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, but I will point out that "ignore nulls" / "skip nulls" is a property that can apply to many (potentially all) aggregate functions.

@EpsilonPrime EpsilonPrime merged commit 1890e6a into substrait-io:main Jul 3, 2024
17 checks passed
@Blizzara Blizzara deleted the avo/first-last-as-aggregate-functions branch July 8, 2024 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants