feat: add extend expression message for expression only evaluation #405

yma11 · 2022-12-12T07:16:29Z

This PR proposes a way for using substrait as expression only representation
and will be use for expression evaluation in big data frameworks.

CLAassistant · 2022-12-12T07:16:38Z

All committers have signed the CLA.

jacques-n · 2022-12-12T17:17:00Z

proto/substrait/extended_expression.proto

+    Expression expression = 1;
+    AggregateFunction measure = 2;
+  }
+  NamedStruct base_schema = 2;


I feel like maybe this should be stated at the extended expression level as opposed to per ExpressionReference. Thoughts?

Correct. Multi expressions should evaluated on same input data thus one base_schema will be okay. Will update it.

jacques-n · 2022-12-12T17:17:47Z

proto/substrait/extended_expression.proto

+  }
+  NamedStruct base_schema = 2;
+  // Field names in depth-first order
+  string name = 3;


Maybe rename this output_names?

westonpace

I'm not sure what it means to have multiple expressions (more questions below).

westonpace · 2022-12-12T21:06:01Z

proto/substrait/extended_expression.proto

+  // one or more expression trees.
+  repeated ExpressionReference referred_expr = 3;


What does it mean if there are multiple expressions?

Does the order matter? Are these expressions to be evaluated one by one from top to bottom?
Can expressions refer to the value calculated in previous expressions (perhaps as a special sort of reference)?

We translate the multi expressions from frontend framework, following the order it occurs in original plan node. But it doesn't requires the consume side to handle it strictly in previous order. Backend can evaluated it one by one OR evaluated the expressions that be referred by other expressions first. For example, 2 expression a+b+c, a+b are translated in such order, but expression a+b can be evaluated first and then can be reused in evaluation of a+b+c. Backend can analyze these expressions and do such optimization correspondingly. However, for the final ouputput batch of multi expressions, the columns should organized in same order as defined in extended expression message.

I add some comments in the doc.

Could one expression explicitly reference another expression? For example, if you added a new "expression reference" message you could have something like:

exprs[0] = a + b exprs[1] = $exprs[0] + c

I think for current substrait definition, there is no suitable semantic to represent this reference. Correct me if I am wrong. And I personally don't prefer this way. In original expressions under single rel, like project plan node, they don't explicitly track this reference too. The reuse is often done through CSE(common sub expression) optimization, may be various from different frontend and backend engines.

yma11 · 2022-12-16T01:33:55Z

@jacques-n Please take a further review and help trigger the check again.

jacques-n

Mostly looks good. One minor issue

jacques-n · 2023-01-06T02:51:12Z

proto/substrait/extended_expression.proto

+    AggregateFunction measure = 2;
+  }
+  // Field names in depth-first order
+  string output_name = 3;


should be repeated

Updated and rebased.

jacques-n

This looks good to me. @westonpace , you good with adding this?

westonpace

I think there are still things that are odd here. However, I'll admit that I'm not heavily invested here at the moment. I don't yet have a need for this. So if @jacques-n is content we want to proceed with this I think that is fine.

proto/substrait/extended_expression.proto

westonpace · 2023-01-11T18:18:47Z

proto/substrait/extended_expression.proto

+    AggregateFunction measure = 2;
+  }
+  // Field names in depth-first order
+  repeated string output_names = 3;


It's strange to me that output_names are in ExpressionReference and not ExtendedExpression but this is a natural consequence of have repeated expressions at the top level.

I have thought about this two options. Both of them can work as expected but seems it makes things complicated if we put it in ExtendedExpression. Think about 2 expressions that one has a single output column called output_a and the other has 2 output columns called output_b, output_c. If we put it in ExtendedExpression, then the names should like output_a, output_b, output_c, then we have to do a mapping job when try to connect the column to its name. Do you think so?

I think I am confused. How can an expression have more than one output column?

I am thinking about a function like SELECT CAST(JSON '{"k1":1,"k2":23,"k3":456}' AS MAP(VARCHAR, INTEGER)); It should have 2 columns output in a columnar computation scenario.

Yes, that would have multiple output_names, I see now.

Think about 2 expressions that one has a single output column called output_a and the other has 2 output columns called output_b, output_c.

Only one of these should be the top-level expression correct? That is the only one whose names we care about.

For example, if we have three expressions:

expr_ref(0) + expr_ref(1): output_names="final" field(0) * expr_ref(2): output_names="middle" field(1) * 20: output_names="first"

The only meaningful value of output_names is "final" correct?

Is there any special meaning between names "first", "middle" and "final"? I mean is there any implicit relation between these expressions? if not, in case that they equally exist in a project node, which has three output columns for following plan node, all of these names are meaningful and necessary. Following plan node probably refer the columns by these names. Example pattern likes agg (sum("a"), min("b"), max("c") <- project(expr_0:"a", expr_1:"b", expr_2:"c").

I see your point. Sorry, I was thinking there was only one top-level expression.

westonpace · 2023-01-11T18:27:30Z

proto/substrait/extended_expression.proto

+  // one or more expression trees with same order in plan rel
+  repeated ExpressionReference referred_expr = 3;


I still find it confusing that we have a repeated list of expressions here. I think it would be more intuitive to have:

One top level expression

A repeated list of sub-expressions

We could do something similar to what we do with plans:

// An expression with output field names. // // This is for use at the root of an `Expression` tree. message ExpressionRoot { // The root expression Expression expression = 1; // Field names in depth-first order repeated string names = 2; } message StandaloneExpression { oneof rel_type { // A sub-expression (used for references) Expression expr = 1; // The root of an expression tree ExpressionRoot root = 2; } } message ExtendedExpression { ... repeated StandaloneExpression exprs; ... }

However, I think I would prefer something more straightforward:

message ExtendedExpression { ... // Top level expression Expression expr = 3; repeated string output_field_names = 4; // Common sub-expressions (referenced by `expr`) repeated Expression sub_expressions = 5; ... }

Although, if we end up adopting something like #415 then we could use repeated sub-expressions everywhere and we wouldn't need any of this. It could just be:

message ExtendedExpression { ... // Top level expression (may contain let expressions) Expression expr = 3; repeated string output_field_names = 4; ... }

Thanks for your examples. I think the greatest benefit is that a let expression is explicitly highlighted so that consumers doesn't need expr analysis again for an optimization like "CSE". This is a good idea but it should be contained in message Expression inside? If so, ExtendedExpression can keep as current.

If we had let expression then we do not need the top level expression to be repeated.

However, I agree that the current proposal will still work if we introduce let expressions.

let expression will make CSE easier but it also need producer does this expression level optimization and stores that info before translation to substrait. We are happy to do refine on this ExtendedExpression message if needed when this idea finishes adoption.

westonpace · 2023-01-11T18:28:29Z

site/docs/expressions/extended_expression.md

@@ -0,0 +1,19 @@
+# Extended expression


This prose can be cleaned up in a future PR but it isn't very clear to me. If you ping me with a reminder if this merges then I will be happy to make an attempt at doing so.

I tried finding out a suitable name but seems it doesn't quite work. My user scenario is expression level evaluation offloading but current Expression doesn't contain all required infos, like schema, function, etc. Any suggestion from you?

Sorry, I meant the entire file. Just minor grammatical things. It's not a big concern.

westonpace

Please squash all your commits into a single commit that follows the conventional commit specification and then we can go ahead and merge. Let me know if I can help with that process.

github-actions · 2023-01-14T09:27:00Z

ACTION NEEDED

Substrait follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

yma11 · 2023-01-14T09:38:00Z

@westonpace PR squashed and description is updated. Please take a look again. Thanks!

jacques-n reviewed Dec 12, 2022

View reviewed changes

westonpace reviewed Dec 12, 2022

View reviewed changes

yma11 force-pushed the extend_expression branch from a8c5a33 to 859fd0b Compare December 15, 2022 01:04

jacques-n reviewed Jan 6, 2023

View reviewed changes

yma11 force-pushed the extend_expression branch from 390ebbb to 75d80c6 Compare January 6, 2023 03:23

jacques-n previously approved these changes Jan 6, 2023

View reviewed changes

yma11 dismissed jacques-n’s stale review via 579fb82 January 9, 2023 11:04

yma11 requested review from westonpace and jacques-n and removed request for westonpace and jacques-n January 10, 2023 08:53

westonpace reviewed Jan 11, 2023

View reviewed changes

westonpace previously approved these changes Jan 13, 2023

View reviewed changes

yma11 dismissed westonpace’s stale review via 80eddbc January 14, 2023 09:26

yma11 force-pushed the extend_expression branch from 579fb82 to 80eddbc Compare January 14, 2023 09:26

yma11 force-pushed the extend_expression branch 2 times, most recently from d7b9d8a to f541a92 Compare January 14, 2023 09:30

feat: add extended expression for expression only evaluation

436ff7a

yma11 force-pushed the extend_expression branch from 9ed2cdd to 436ff7a Compare January 18, 2023 14:30

westonpace approved these changes Jan 18, 2023

View reviewed changes

westonpace merged commit d35f0ed into substrait-io:main Jan 18, 2023

wjones127 mentioned this pull request Feb 1, 2023

[Python][Rust] Create extension point in python for Dataset/Scanner apache/arrow#33986

Open

wjones127 mentioned this pull request Feb 1, 2023

docs: fix prose for extended expression #434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add extend expression message for expression only evaluation #405

feat: add extend expression message for expression only evaluation #405

yma11 commented Dec 12, 2022 •

edited

Loading

CLAassistant commented Dec 12, 2022 •

edited

Loading

jacques-n Dec 12, 2022

yma11 Dec 13, 2022

jacques-n Dec 12, 2022

yma11 Dec 13, 2022

westonpace left a comment

westonpace Dec 12, 2022

yma11 Dec 13, 2022 •

edited

Loading

yma11 Dec 13, 2022

westonpace Dec 14, 2022

yma11 Dec 15, 2022

yma11 commented Dec 16, 2022

jacques-n left a comment

jacques-n Jan 6, 2023

yma11 Jan 6, 2023

jacques-n left a comment

westonpace left a comment •

edited

Loading

westonpace Jan 11, 2023

yma11 Jan 12, 2023

westonpace Jan 12, 2023

yma11 Jan 13, 2023

westonpace Jan 13, 2023

yma11 Jan 13, 2023

westonpace Jan 13, 2023

westonpace Jan 11, 2023

yma11 Jan 12, 2023

westonpace Jan 12, 2023

yma11 Jan 13, 2023 •

edited

Loading

westonpace Jan 11, 2023

yma11 Jan 12, 2023 •

edited

Loading

westonpace Jan 12, 2023

westonpace left a comment

github-actions bot commented Jan 14, 2023

yma11 commented Jan 14, 2023

		// one or more expression trees.
		repeated ExpressionReference referred_expr = 3;

		// one or more expression trees with same order in plan rel
		repeated ExpressionReference referred_expr = 3;

feat: add extend expression message for expression only evaluation #405

feat: add extend expression message for expression only evaluation #405

Conversation

yma11 commented Dec 12, 2022 • edited Loading

CLAassistant commented Dec 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yma11 Dec 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yma11 commented Dec 16, 2022

jacques-n left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacques-n left a comment

Choose a reason for hiding this comment

westonpace left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yma11 Jan 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yma11 Jan 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 14, 2023

yma11 commented Jan 14, 2023

yma11 commented Dec 12, 2022 •

edited

Loading

CLAassistant commented Dec 12, 2022 •

edited

Loading

yma11 Dec 13, 2022 •

edited

Loading

westonpace left a comment •

edited

Loading

yma11 Jan 13, 2023 •

edited

Loading

yma11 Jan 12, 2023 •

edited

Loading