Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SessionContext/SessionState::create_physical_expr() to create PhysicalExpressions from Exprs #10330

Merged
merged 6 commits into from
May 7, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented May 1, 2024

Draft as it builds on #10331

Which issue does this PR close?

Closes #10181

Rationale for this change

Rationale:

There important use cases for creating Exprs and executing them as a
PhysicalExpr outside the context of a query (for example, to apply delete
predicates in delta.rs)

At the moment this is possible but is often tricky as users have to manually
invoke type coercion and simplification rules to get the Expr into a form that
can be executed (see example here)

After #8045, to use certain expressions that translate to functions it is also important to
to apply function rewrite rules. See #10181 for more details

Thus having a proper, tested, documented public API that does this I think is
valuable and will prevent future regressions as we continue refactoring.

What changes are included in this PR?

Changes

  1. Add SessionContext::create_physical_expr() and SessionState::create_physical_expr()
  2. Apply FunctionRewrites
  3. Update examples to use the new API
  4. Add tests for the new API

Are these changes tested?

Yes, new tests

Are there any user-facing changes?

Yes: New API

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate labels May 1, 2024
@alamb alamb changed the title Add SessionContext::create_physical_expr() and SessionState::create_physical_expr() Add APIs to create PhysicalExpressions from Exprs: SessionContext::create_physical_expr() and SessionState::create_physical_expr() May 1, 2024
@alamb alamb force-pushed the alamb/expression_api branch from 86548e3 to cfd4440 Compare May 1, 2024 17:37
@alamb alamb force-pushed the alamb/expression_api branch from cfd4440 to c326003 Compare May 1, 2024 18:51
@alamb alamb force-pushed the alamb/expression_api branch from c326003 to e896763 Compare May 1, 2024 18:58
@@ -92,7 +90,8 @@ fn evaluate_demo() -> Result<()> {
let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));

// First, you make a "physical expression" from the logical `Expr`
let physical_expr = physical_expr(&batch.schema(), expr)?;
let df_schema = DFSchema::try_from(batch.schema())?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows how the new API is used - SessionContext::new().create_physical_expr

@@ -248,21 +251,6 @@ fn make_ts_field(name: &str) -> Field {
make_field(name, DataType::Timestamp(TimeUnit::Nanosecond, tz))
}

/// Build a physical expression from a logical one, after applying simplification and type coercion
pub fn physical_expr(schema: &Schema, expr: Expr) -> Result<Arc<dyn PhysicalExpr>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR basically ports this code out of the examples and into SessionState and adds documentation and tests

@@ -806,6 +820,12 @@ impl From<&DFSchema> for Schema {
}
}

impl AsRef<Schema> for DFSchema {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows DFSchema to be treated like a &Schema, which is now possible after #9595

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Copy link
Contributor Author

@alamb alamb May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, #9595 was epic -- kudos to @haohuaijin and @matthewmturner for making that happen

@@ -510,6 +515,34 @@ impl SessionContext {
}
}

/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public API with example

@@ -2024,6 +2089,35 @@ impl SessionState {
}
}

struct SessionSimpifyProvider<'a> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids cloning schema / schema refs

let mut expr = simplifier.coerce(expr, df_schema)?;

// rewrite Exprs to functions if necessary
let config_options = self.config_options();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying these rewrites is a new feature (and what actually closes #10181)

}

#[test]
fn test_get_field() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tests fail without function rewrites applied

@alamb alamb changed the title Add APIs to create PhysicalExpressions from Exprs: SessionContext::create_physical_expr() and SessionState::create_physical_expr() Add SessionContext/SessionState::create_physical_expr() to create PhysicalExpressions from Exprs May 1, 2024
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions (and a typo) but I think this provides what we will need.

datafusion/core/src/execution/context/mod.rs Outdated Show resolved Hide resolved
/// // provide type information that `a` is an Int32
/// let schema = Schema::new(vec![Field::new("a", DataType::Int32, true)]);
/// let df_schema = DFSchema::try_from(schema).unwrap();
/// // Create a PhysicalExpr. Note DataFusion automatically coerces (casts) `1i64` to `1i32`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, we have some (probably duplicated) type coercion logic on our side we might be able to replace.

/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type
/// coercion, and function rewrites.
///
/// Note: The expression is not [simplified]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My rationale was that simplification is substantially more expensive due to the rewrites required and is not strictly necessary for evaluation, so I felt not simplifying would be less surprising

Also, if we are going to do simplification, it might also be reasonable to ask "why not other optimization like comparison cast unwrapping" too

I think it is important to apply coercion as is very hard to create the right expression to get the type exactly aligned resulting in hard to fix errors. Same thing for this function rewriting.

I'll add this context to the comments in this PR

/// Note: The expression is not [simplified]
///
/// # See Also:
/// * [`SessionContext::create_physical_expr`] for a higher-level API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewaht unrelated to this PR but It's news to me that "SessionContext" is higher level than "SessionState". The only comparison between the two that I can find is https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#sessioncontext-sessionstate-and-taskcontext which says:

A SessionContext can be created from a SessionConfig and stores the state for a particular query session. A single SessionContext can run multiple queries.

SessionState contains information available during query planning (creating LogicalPlans and ExecutionPlans).

From this it is not obvious that a "SessionContext" contains a "SessionState". It's also very confusing when I would use "SessionContext" and when I would use "SessionState" in my user application. Currently, I've been going off the philosophy of "if an API method needs a SessionState I'd better create one and if an API method needs a SessionContext I'd better create one."

Some kind of long form design documentation on these two types would be interesting (or if something already exists that I've missed or is in a blog post somewhere that would be cool too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is confusing -- I will write up some documentation about this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made #10350 to hopefully clarify

datafusion/core/src/execution/context/mod.rs Outdated Show resolved Hide resolved
@@ -806,6 +820,12 @@ impl From<&DFSchema> for Schema {
}
}

impl AsRef<Schema> for DFSchema {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

@andygrove
Copy link
Member

@phillipleblanc @westonpace Do you have any further feedback on this PR? I think this is the final item we're waiting on before releasing 38.0.0

@phillipleblanc
Copy link
Contributor

@phillipleblanc @westonpace Do you have any further feedback on this PR? I think this is the final item we're waiting on before releasing 38.0.0

Looks good to me.

@alamb
Copy link
Contributor Author

alamb commented May 7, 2024

Thanks for your reviews @phillipleblanc and @westonpace

@andygrove I can't merge this PR until it is approved by a committer, based on the rules in this repo, but from my perspective it is good to go.

Screenshot 2024-05-07 at 5 44 54 AM

@andygrove andygrove merged commit c8b8c74 into apache:main May 7, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules
Projects
None yet
4 participants