-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SessionContext
/SessionState::create_physical_expr()
to create PhysicalExpressions
from Expr
s
#10330
Conversation
SessionContext::create_physical_expr()
and SessionState::create_physical_expr()
PhysicalExpressions
from Expr
s: SessionContext::create_physical_expr()
and SessionState::create_physical_expr()
86548e3
to
cfd4440
Compare
cfd4440
to
c326003
Compare
…e_physical_expr()`
c326003
to
e896763
Compare
@@ -92,7 +90,8 @@ fn evaluate_demo() -> Result<()> { | |||
let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8))); | |||
|
|||
// First, you make a "physical expression" from the logical `Expr` | |||
let physical_expr = physical_expr(&batch.schema(), expr)?; | |||
let df_schema = DFSchema::try_from(batch.schema())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows how the new API is used - SessionContext::new().create_physical_expr
@@ -248,21 +251,6 @@ fn make_ts_field(name: &str) -> Field { | |||
make_field(name, DataType::Timestamp(TimeUnit::Nanosecond, tz)) | |||
} | |||
|
|||
/// Build a physical expression from a logical one, after applying simplification and type coercion | |||
pub fn physical_expr(schema: &Schema, expr: Expr) -> Result<Arc<dyn PhysicalExpr>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR basically ports this code out of the examples and into SessionState
and adds documentation and tests
@@ -806,6 +820,12 @@ impl From<&DFSchema> for Schema { | |||
} | |||
} | |||
|
|||
impl AsRef<Schema> for DFSchema { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows DFSchema to be treated like a &Schema, which is now possible after #9595
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, #9595 was epic -- kudos to @haohuaijin and @matthewmturner for making that happen
@@ -510,6 +515,34 @@ impl SessionContext { | |||
} | |||
} | |||
|
|||
/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New public API with example
@@ -2024,6 +2089,35 @@ impl SessionState { | |||
} | |||
} | |||
|
|||
struct SessionSimpifyProvider<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avoids cloning schema / schema refs
let mut expr = simplifier.coerce(expr, df_schema)?; | ||
|
||
// rewrite Exprs to functions if necessary | ||
let config_options = self.config_options(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applying these rewrites is a new feature (and what actually closes #10181)
} | ||
|
||
#[test] | ||
fn test_get_field() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tests fail without function rewrites applied
PhysicalExpressions
from Expr
s: SessionContext::create_physical_expr()
and SessionState::create_physical_expr()
SessionContext
/SessionState::create_physical_expr()
to create PhysicalExpressions
from Expr
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions (and a typo) but I think this provides what we will need.
/// // provide type information that `a` is an Int32 | ||
/// let schema = Schema::new(vec![Field::new("a", DataType::Int32, true)]); | ||
/// let df_schema = DFSchema::try_from(schema).unwrap(); | ||
/// // Create a PhysicalExpr. Note DataFusion automatically coerces (casts) `1i64` to `1i32` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, we have some (probably duplicated) type coercion logic on our side we might be able to replace.
/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type | ||
/// coercion, and function rewrites. | ||
/// | ||
/// Note: The expression is not [simplified] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My rationale was that simplification is substantially more expensive due to the rewrites required and is not strictly necessary for evaluation, so I felt not simplifying would be less surprising
Also, if we are going to do simplification, it might also be reasonable to ask "why not other optimization like comparison cast unwrapping" too
I think it is important to apply coercion as is very hard to create the right expression to get the type exactly aligned resulting in hard to fix errors. Same thing for this function rewriting.
I'll add this context to the comments in this PR
/// Note: The expression is not [simplified] | ||
/// | ||
/// # See Also: | ||
/// * [`SessionContext::create_physical_expr`] for a higher-level API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewaht unrelated to this PR but It's news to me that "SessionContext" is higher level than "SessionState". The only comparison between the two that I can find is https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#sessioncontext-sessionstate-and-taskcontext which says:
A SessionContext can be created from a SessionConfig and stores the state for a particular query session. A single SessionContext can run multiple queries.
SessionState contains information available during query planning (creating LogicalPlans and ExecutionPlans).
From this it is not obvious that a "SessionContext" contains a "SessionState". It's also very confusing when I would use "SessionContext" and when I would use "SessionState" in my user application. Currently, I've been going off the philosophy of "if an API method needs a SessionState I'd better create one and if an API method needs a SessionContext I'd better create one."
Some kind of long form design documentation on these two types would be interesting (or if something already exists that I've missed or is in a blog post somewhere that would be cool too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is confusing -- I will write up some documentation about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made #10350 to hopefully clarify
@@ -806,6 +820,12 @@ impl From<&DFSchema> for Schema { | |||
} | |||
} | |||
|
|||
impl AsRef<Schema> for DFSchema { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯
@phillipleblanc @westonpace Do you have any further feedback on this PR? I think this is the final item we're waiting on before releasing 38.0.0 |
Looks good to me. |
Thanks for your reviews @phillipleblanc and @westonpace @andygrove I can't merge this PR until it is approved by a committer, based on the rules in this repo, but from my perspective it is good to go. |
Draft as it builds on #10331Which issue does this PR close?
Closes #10181
Rationale for this change
Rationale:
There important use cases for creating
Expr
s and executing them as aPhysicalExpr
outside the context of a query (for example, to apply deletepredicates in delta.rs)
At the moment this is possible but is often tricky as users have to manually
invoke type coercion and simplification rules to get the
Expr
into a form thatcan be executed (see example here)
After #8045, to use certain expressions that translate to functions it is also important to
to apply function rewrite rules. See #10181 for more details
Thus having a proper, tested, documented public API that does this I think is
valuable and will prevent future regressions as we continue refactoring.
What changes are included in this PR?
Changes
SessionContext::create_physical_expr()
andSessionState::create_physical_expr()
FunctionRewrites
Are these changes tested?
Yes, new tests
Are there any user-facing changes?
Yes: New API