Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add extend expression message for expression only evaluation #405

Merged
merged 1 commit into from
Jan 18, 2023

Conversation

yma11
Copy link
Contributor

@yma11 yma11 commented Dec 12, 2022

This PR proposes a way for using substrait as expression only representation
and will be use for expression evaluation in big data frameworks.

@CLAassistant
Copy link

CLAassistant commented Dec 12, 2022

CLA assistant check
All committers have signed the CLA.

Expression expression = 1;
AggregateFunction measure = 2;
}
NamedStruct base_schema = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like maybe this should be stated at the extended expression level as opposed to per ExpressionReference. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Multi expressions should evaluated on same input data thus one base_schema will be okay. Will update it.

}
NamedStruct base_schema = 2;
// Field names in depth-first order
string name = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this output_names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what it means to have multiple expressions (more questions below).

Comment on lines 37 to 39
// one or more expression trees.
repeated ExpressionReference referred_expr = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean if there are multiple expressions?

Does the order matter? Are these expressions to be evaluated one by one from top to bottom?
Can expressions refer to the value calculated in previous expressions (perhaps as a special sort of reference)?

Copy link
Contributor Author

@yma11 yma11 Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We translate the multi expressions from frontend framework, following the order it occurs in original plan node. But it doesn't requires the consume side to handle it strictly in previous order. Backend can evaluated it one by one OR evaluated the expressions that be referred by other expressions first. For example, 2 expression a+b+c, a+b are translated in such order, but expression a+b can be evaluated first and then can be reused in evaluation of a+b+c. Backend can analyze these expressions and do such optimization correspondingly. However, for the final ouputput batch of multi expressions, the columns should organized in same order as defined in extended expression message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add some comments in the doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could one expression explicitly reference another expression? For example, if you added a new "expression reference" message you could have something like:

exprs[0] = a + b
exprs[1] = $exprs[0] + c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for current substrait definition, there is no suitable semantic to represent this reference. Correct me if I am wrong. And I personally don't prefer this way. In original expressions under single rel, like project plan node, they don't explicitly track this reference too. The reuse is often done through CSE(common sub expression) optimization, may be various from different frontend and backend engines.

@yma11 yma11 force-pushed the extend_expression branch from a8c5a33 to 859fd0b Compare December 15, 2022 01:04
@yma11
Copy link
Contributor Author

yma11 commented Dec 16, 2022

@jacques-n Please take a further review and help trigger the check again.

Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. One minor issue

AggregateFunction measure = 2;
}
// Field names in depth-first order
string output_name = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be repeated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and rebased.

@yma11 yma11 force-pushed the extend_expression branch from 390ebbb to 75d80c6 Compare January 6, 2023 03:23
jacques-n
jacques-n previously approved these changes Jan 6, 2023
Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. @westonpace , you good with adding this?

@yma11 yma11 requested review from westonpace and jacques-n and removed request for westonpace and jacques-n January 10, 2023 08:53
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are still things that are odd here. However, I'll admit that I'm not heavily invested here at the moment. I don't yet have a need for this. So if @jacques-n is content we want to proceed with this I think that is fine.

proto/substrait/extended_expression.proto Show resolved Hide resolved
AggregateFunction measure = 2;
}
// Field names in depth-first order
repeated string output_names = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's strange to me that output_names are in ExpressionReference and not ExtendedExpression but this is a natural consequence of have repeated expressions at the top level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about this two options. Both of them can work as expected but seems it makes things complicated if we put it in ExtendedExpression. Think about 2 expressions that one has a single output column called output_a and the other has 2 output columns called output_b, output_c. If we put it in ExtendedExpression, then the names should like output_a, output_b, output_c, then we have to do a mapping job when try to connect the column to its name. Do you think so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am confused. How can an expression have more than one output column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about a function like SELECT CAST(JSON '{"k1":1,"k2":23,"k3":456}' AS MAP(VARCHAR, INTEGER)); It should have 2 columns output in a columnar computation scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would have multiple output_names, I see now.

Think about 2 expressions that one has a single output column called output_a and the other has 2 output columns called output_b, output_c.

Only one of these should be the top-level expression correct? That is the only one whose names we care about.

For example, if we have three expressions:

expr_ref(0) + expr_ref(1): output_names="final"
field(0) * expr_ref(2): output_names="middle"
field(1) * 20: output_names="first"

The only meaningful value of output_names is "final" correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any special meaning between names "first", "middle" and "final"? I mean is there any implicit relation between these expressions? if not, in case that they equally exist in a project node, which has three output columns for following plan node, all of these names are meaningful and necessary. Following plan node probably refer the columns by these names. Example pattern likes agg (sum("a"), min("b"), max("c") <- project(expr_0:"a", expr_1:"b", expr_2:"c").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Sorry, I was thinking there was only one top-level expression.

Comment on lines +38 to +39
// one or more expression trees with same order in plan rel
repeated ExpressionReference referred_expr = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find it confusing that we have a repeated list of expressions here. I think it would be more intuitive to have:

  • One top level expression
  • A repeated list of sub-expressions

We could do something similar to what we do with plans:

// An expression with output field names.
//
// This is for use at the root of an `Expression` tree.
message ExpressionRoot {
  // The root expression
  Expression expression = 1;
  // Field names in depth-first order
  repeated string names = 2;
}

message StandaloneExpression {
  oneof rel_type {
    // A sub-expression (used for references)
    Expression expr = 1;
    // The root of an expression tree
    ExpressionRoot root = 2;
  }
}

message ExtendedExpression {
  ...
  repeated StandaloneExpression exprs;
  ...
}

However, I think I would prefer something more straightforward:

message ExtendedExpression {
...
// Top level expression
Expression expr = 3;
repeated string output_field_names = 4;
// Common sub-expressions (referenced by `expr`)
repeated Expression sub_expressions = 5;
...
}

Although, if we end up adopting something like #415 then we could use repeated sub-expressions everywhere and we wouldn't need any of this. It could just be:

message ExtendedExpression {
...
// Top level expression (may contain let expressions)
Expression expr = 3;
repeated string output_field_names = 4;
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your examples. I think the greatest benefit is that a let expression is explicitly highlighted so that consumers doesn't need expr analysis again for an optimization like "CSE". This is a good idea but it should be contained in message Expression inside? If so, ExtendedExpression can keep as current.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had let expression then we do not need the top level expression to be repeated.

However, I agree that the current proposal will still work if we introduce let expressions.

Copy link
Contributor Author

@yma11 yma11 Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let expression will make CSE easier but it also need producer does this expression level optimization and stores that info before translation to substrait. We are happy to do refine on this ExtendedExpression message if needed when this idea finishes adoption.

@@ -0,0 +1,19 @@
# Extended expression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prose can be cleaned up in a future PR but it isn't very clear to me. If you ping me with a reminder if this merges then I will be happy to make an attempt at doing so.

Copy link
Contributor Author

@yma11 yma11 Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried finding out a suitable name but seems it doesn't quite work. My user scenario is expression level evaluation offloading but current Expression doesn't contain all required infos, like schema, function, etc. Any suggestion from you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant the entire file. Just minor grammatical things. It's not a big concern.

westonpace
westonpace previously approved these changes Jan 13, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash all your commits into a single commit that follows the conventional commit specification and then we can go ahead and merge. Let me know if I can help with that process.

@github-actions
Copy link

ACTION NEEDED

Substrait follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@yma11 yma11 force-pushed the extend_expression branch 2 times, most recently from d7b9d8a to f541a92 Compare January 14, 2023 09:30
@yma11
Copy link
Contributor Author

yma11 commented Jan 14, 2023

@westonpace PR squashed and description is updated. Please take a look again. Thanks!

@yma11 yma11 force-pushed the extend_expression branch from 9ed2cdd to 436ff7a Compare January 18, 2023 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants