Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add extend expression message for expression only evaluation #405

Merged
merged 1 commit into from
Jan 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions proto/substrait/extended_expression.proto
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
// SPDX-License-Identifier: Apache-2.0
syntax = "proto3";

package substrait;

import "substrait/algebra.proto";
import "substrait/extensions/extensions.proto";
import "substrait/plan.proto";
import "substrait/type.proto";

option csharp_namespace = "Substrait.Protobuf";
option go_package = "github.com/substrait-io/substrait-go/proto";
option java_multiple_files = true;
option java_package = "io.substrait.proto";

message ExpressionReference {
oneof expr_type {
Expression expression = 1;
AggregateFunction measure = 2;
}
westonpace marked this conversation as resolved.
Show resolved Hide resolved
// Field names in depth-first order
repeated string output_names = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's strange to me that output_names are in ExpressionReference and not ExtendedExpression but this is a natural consequence of have repeated expressions at the top level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about this two options. Both of them can work as expected but seems it makes things complicated if we put it in ExtendedExpression. Think about 2 expressions that one has a single output column called output_a and the other has 2 output columns called output_b, output_c. If we put it in ExtendedExpression, then the names should like output_a, output_b, output_c, then we have to do a mapping job when try to connect the column to its name. Do you think so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am confused. How can an expression have more than one output column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about a function like SELECT CAST(JSON '{"k1":1,"k2":23,"k3":456}' AS MAP(VARCHAR, INTEGER)); It should have 2 columns output in a columnar computation scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would have multiple output_names, I see now.

Think about 2 expressions that one has a single output column called output_a and the other has 2 output columns called output_b, output_c.

Only one of these should be the top-level expression correct? That is the only one whose names we care about.

For example, if we have three expressions:

expr_ref(0) + expr_ref(1): output_names="final"
field(0) * expr_ref(2): output_names="middle"
field(1) * 20: output_names="first"

The only meaningful value of output_names is "final" correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any special meaning between names "first", "middle" and "final"? I mean is there any implicit relation between these expressions? if not, in case that they equally exist in a project node, which has three output columns for following plan node, all of these names are meaningful and necessary. Following plan node probably refer the columns by these names. Example pattern likes agg (sum("a"), min("b"), max("c") <- project(expr_0:"a", expr_1:"b", expr_2:"c").

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Sorry, I was thinking there was only one top-level expression.

}

// Describe a set of operations to complete.
// For compactness sake, identifiers are normalized at the plan level.
message ExtendedExpression {
// Substrait version of the expression. Optional up to 0.17.0, required for later
// versions.
Version version = 7;

// a list of yaml specifications this expression may depend on
repeated substrait.extensions.SimpleExtensionURI extension_uris = 1;

// a list of extensions this expression may depend on
repeated substrait.extensions.SimpleExtensionDeclaration extensions = 2;

// one or more expression trees with same order in plan rel
repeated ExpressionReference referred_expr = 3;
Comment on lines +38 to +39
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find it confusing that we have a repeated list of expressions here. I think it would be more intuitive to have:

  • One top level expression
  • A repeated list of sub-expressions

We could do something similar to what we do with plans:

// An expression with output field names.
//
// This is for use at the root of an `Expression` tree.
message ExpressionRoot {
  // The root expression
  Expression expression = 1;
  // Field names in depth-first order
  repeated string names = 2;
}

message StandaloneExpression {
  oneof rel_type {
    // A sub-expression (used for references)
    Expression expr = 1;
    // The root of an expression tree
    ExpressionRoot root = 2;
  }
}

message ExtendedExpression {
  ...
  repeated StandaloneExpression exprs;
  ...
}

However, I think I would prefer something more straightforward:

message ExtendedExpression {
...
// Top level expression
Expression expr = 3;
repeated string output_field_names = 4;
// Common sub-expressions (referenced by `expr`)
repeated Expression sub_expressions = 5;
...
}

Although, if we end up adopting something like #415 then we could use repeated sub-expressions everywhere and we wouldn't need any of this. It could just be:

message ExtendedExpression {
...
// Top level expression (may contain let expressions)
Expression expr = 3;
repeated string output_field_names = 4;
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your examples. I think the greatest benefit is that a let expression is explicitly highlighted so that consumers doesn't need expr analysis again for an optimization like "CSE". This is a good idea but it should be contained in message Expression inside? If so, ExtendedExpression can keep as current.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had let expression then we do not need the top level expression to be repeated.

However, I agree that the current proposal will still work if we introduce let expressions.

Copy link
Contributor Author

@yma11 yma11 Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let expression will make CSE easier but it also need producer does this expression level optimization and stores that info before translation to substrait. We are happy to do refine on this ExtendedExpression message if needed when this idea finishes adoption.


NamedStruct base_schema = 4;
// additional extensions associated with this expression.
substrait.extensions.AdvancedExtension advanced_extensions = 5;

// A list of com.google.Any entities that this plan may use. Can be used to
// warn if some embedded message types are unknown. Note that this list may
// include message types that are ignorable (optimizations) or that are
// unused. In many cases, a consumer may be able to work with a plan even if
// one or more message types defined here are unknown.
repeated string expected_type_urls = 6;
}
19 changes: 19 additions & 0 deletions site/docs/expressions/extended_expression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Extended expression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prose can be cleaned up in a future PR but it isn't very clear to me. If you ping me with a reminder if this merges then I will be happy to make an attempt at doing so.

Copy link
Contributor Author

@yma11 yma11 Jan 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried finding out a suitable name but seems it doesn't quite work. My user scenario is expression level evaluation offloading but current Expression doesn't contain all required infos, like schema, function, etc. Any suggestion from you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant the entire file. Just minor grammatical things. It's not a big concern.


Extended expression is provided for expression level protocol instead of plan rels. It mainly targets for expression only evaluation, such as those computed in Filter/Project/Aggregation rels. Different from original expression defined in substrait protocol, it requires more information to completely describe the computation context, including input data schema, referred function signatures and output schema.

Besides, as it will be used seperately with plan rel representation, it need include basic fields like Version.

## Input and output data schema

Similar as `base_schema` defined in [ReadRel](https://github.com/substrait-io/substrait/blob/7f272f13f22cd5f5842baea42bcf7961e6251881/proto/substrait/algebra.proto#L58), the input data schema tells name/type/nullibilty and layout info of input data for target expression evalutation. It also has a field `name` to define name of output data.

## Referred expression

It will has one or more referred expressions in this message and the referred expressions can be [Expression](https://github.com/substrait-io/substrait/blob/7f272f13f22cd5f5842baea42bcf7961e6251881/proto/substrait/algebra.proto) or [AggregateFunction](https://github.com/substrait-io/substrait/blob/7f272f13f22cd5f5842baea42bcf7961e6251881/proto/substrait/algebra.proto#L1170). More types of expression can be added for more scenarios.

For multi expressions, user can translate them following same order as it occurs in original plan rel. But it does NOT require the consume side to handle it strictly in previous order. Only need to make sure columns in final output are organized in same order as defined in extended expression message.

## Function extensions

As in the expression message, functions are used by referring function anchor so the related extensions are needed to determine detailed function signature.