-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add extend expression message for expression only evaluation #405
Conversation
Expression expression = 1; | ||
AggregateFunction measure = 2; | ||
} | ||
NamedStruct base_schema = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like maybe this should be stated at the extended expression level as opposed to per ExpressionReference. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Multi expressions should evaluated on same input data thus one base_schema
will be okay. Will update it.
} | ||
NamedStruct base_schema = 2; | ||
// Field names in depth-first order | ||
string name = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename this output_names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what it means to have multiple expressions (more questions below).
// one or more expression trees. | ||
repeated ExpressionReference referred_expr = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean if there are multiple expressions?
Does the order matter? Are these expressions to be evaluated one by one from top to bottom?
Can expressions refer to the value calculated in previous expressions (perhaps as a special sort of reference)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We translate the multi expressions from frontend framework, following the order it occurs in original plan node. But it doesn't requires the consume side to handle it strictly in previous order. Backend can evaluated it one by one OR evaluated the expressions that be referred by other expressions first. For example, 2 expression a+b+c, a+b
are translated in such order, but expression a+b
can be evaluated first and then can be reused in evaluation of a+b+c
. Backend can analyze these expressions and do such optimization correspondingly. However, for the final ouputput batch of multi expressions, the columns should organized in same order as defined in extended expression message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add some comments in the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could one expression explicitly reference another expression? For example, if you added a new "expression reference" message you could have something like:
exprs[0] = a + b
exprs[1] = $exprs[0] + c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for current substrait definition, there is no suitable semantic to represent this reference. Correct me if I am wrong. And I personally don't prefer this way. In original expressions under single rel, like project plan node, they don't explicitly track this reference too. The reuse is often done through CSE(common sub expression) optimization, may be various from different frontend and backend engines.
a8c5a33
to
859fd0b
Compare
@jacques-n Please take a further review and help trigger the check again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. One minor issue
AggregateFunction measure = 2; | ||
} | ||
// Field names in depth-first order | ||
string output_name = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be repeated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and rebased.
390ebbb
to
75d80c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. @westonpace , you good with adding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are still things that are odd here. However, I'll admit that I'm not heavily invested here at the moment. I don't yet have a need for this. So if @jacques-n is content we want to proceed with this I think that is fine.
AggregateFunction measure = 2; | ||
} | ||
// Field names in depth-first order | ||
repeated string output_names = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's strange to me that output_names
are in ExpressionReference
and not ExtendedExpression
but this is a natural consequence of have repeated expressions at the top level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about this two options. Both of them can work as expected but seems it makes things complicated if we put it in ExtendedExpression
. Think about 2 expressions that one has a single output column called output_a
and the other has 2 output columns called output_b, output_c
. If we put it in ExtendedExpression
, then the names should like output_a, output_b, output_c
, then we have to do a mapping job when try to connect the column to its name. Do you think so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am confused. How can an expression have more than one output column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about a function like SELECT CAST(JSON '{"k1":1,"k2":23,"k3":456}' AS MAP(VARCHAR, INTEGER));
It should have 2 columns output in a columnar computation scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would have multiple output_names, I see now.
Think about 2 expressions that one has a single output column called output_a and the other has 2 output columns called output_b, output_c.
Only one of these should be the top-level expression correct? That is the only one whose names we care about.
For example, if we have three expressions:
expr_ref(0) + expr_ref(1): output_names="final"
field(0) * expr_ref(2): output_names="middle"
field(1) * 20: output_names="first"
The only meaningful value of output_names
is "final" correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any special meaning between names "first", "middle" and "final"? I mean is there any implicit relation between these expressions? if not, in case that they equally exist in a project node, which has three output columns for following plan node, all of these names are meaningful and necessary. Following plan node probably refer the columns by these names. Example pattern likes agg (sum("a"), min("b"), max("c") <- project(expr_0:"a", expr_1:"b", expr_2:"c")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. Sorry, I was thinking there was only one top-level expression.
// one or more expression trees with same order in plan rel | ||
repeated ExpressionReference referred_expr = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still find it confusing that we have a repeated list of expressions here. I think it would be more intuitive to have:
- One top level expression
- A repeated list of sub-expressions
We could do something similar to what we do with plans:
// An expression with output field names.
//
// This is for use at the root of an `Expression` tree.
message ExpressionRoot {
// The root expression
Expression expression = 1;
// Field names in depth-first order
repeated string names = 2;
}
message StandaloneExpression {
oneof rel_type {
// A sub-expression (used for references)
Expression expr = 1;
// The root of an expression tree
ExpressionRoot root = 2;
}
}
message ExtendedExpression {
...
repeated StandaloneExpression exprs;
...
}
However, I think I would prefer something more straightforward:
message ExtendedExpression {
...
// Top level expression
Expression expr = 3;
repeated string output_field_names = 4;
// Common sub-expressions (referenced by `expr`)
repeated Expression sub_expressions = 5;
...
}
Although, if we end up adopting something like #415 then we could use repeated sub-expressions everywhere and we wouldn't need any of this. It could just be:
message ExtendedExpression {
...
// Top level expression (may contain let expressions)
Expression expr = 3;
repeated string output_field_names = 4;
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your examples. I think the greatest benefit is that a let
expression is explicitly highlighted so that consumers doesn't need expr analysis again for an optimization like "CSE". This is a good idea but it should be contained in message Expression
inside? If so, ExtendedExpression
can keep as current.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had let
expression then we do not need the top level expression to be repeated.
However, I agree that the current proposal will still work if we introduce let
expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let
expression will make CSE easier but it also need producer does this expression level optimization and stores that info before translation to substrait. We are happy to do refine on this ExtendedExpression
message if needed when this idea finishes adoption.
@@ -0,0 +1,19 @@ | |||
# Extended expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prose can be cleaned up in a future PR but it isn't very clear to me. If you ping me with a reminder if this merges then I will be happy to make an attempt at doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried finding out a suitable name but seems it doesn't quite work. My user scenario is expression level evaluation offloading but current Expression
doesn't contain all required infos, like schema, function, etc. Any suggestion from you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant the entire file. Just minor grammatical things. It's not a big concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please squash all your commits into a single commit that follows the conventional commit specification and then we can go ahead and merge. Let me know if I can help with that process.
579fb82
to
80eddbc
Compare
ACTION NEEDED Substrait follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
d7b9d8a
to
f541a92
Compare
@westonpace PR squashed and description is updated. Please take a look again. Thanks! |
9ed2cdd
to
436ff7a
Compare
This PR proposes a way for using substrait as expression only representation
and will be use for expression evaluation in big data frameworks.