-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: coalesce schema issues #12308
fix: coalesce schema issues #12308
Conversation
6954677
to
01fab57
Compare
4c7989e
to
30a5c5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mesejo, I think overall looks good to me
5c57ac6
to
1cc4344
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mesejo and @jayzhan211
cc @findepi who I believe is also working in this area / thinking about functions
Self { | ||
signature: Signature::one_of( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that moving the signature from a data driven description (aka describe "what" is needed and letting some other code compute if the given arguments match that signature), this PR is moving many of the functions towards more functional (each function has to implement its own custom coercion, likely resulting in significant duplication).
What do you think (perhaps as a follow on PR) of adding DataType::Null
support to the Signature calculations somehow rather than inlining / duplicating the coercion logic?
Maybe something like
Signature::allow_null(..)
that would support automatically coercing arguments from null?
Or maybe we should always support coercing Null to any type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative signature like Signature::String, similar to Signature::numeric that includes converting null to string too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure -- I was just reacting that this "handle null" pattern seems common and it seems like this approach will require custom coerce logic for all functions 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Null to T coercion needs to be handled elsewhere anyway (eg when computing type of a UNION, etc.).
We can free functions from having to bother about coercions at all and let the engine calculate coercions when building the logical plan.
This is actually super fundamental for DataFusion vision as a composable query engine. Coercion rules are very implementation-specific. If we had functions spiced up with coercions inside them, that would make those functions non-reusable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100%
It seems to me like Signature
is supposed to communicate what types the function implementation has a native implementation for and the coercion of whatever the user provided doesn't match one of the supported types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi Are you suggesting something like general coercion that is non-function specific? But what if we want different coercion rule for different function, we might need to do coercion function wise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we want different coercion rules for different functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea is that it is more flexible to the user, although, without the real use case, it might be a premature optimization 🤔.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to chip in late; this PR addresses other issues, such as #12307. I wonder if I could split it and leave the changes regarding the coercion of functions in this one (to keep the discussion in one place) and the others in a new PR. Would that be ok?
func.name(), | ||
func.signature().clone(), | ||
&arg_data_types, | ||
let new_data_types = data_types_with_scalar_udf(&arg_data_types, func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the root cause of the issue and to solve this other changes are necessary. Therefore, I think we should go with this change and maybe further optimize the coercion in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so should I leave it as it is? Or change it back to how it was:
data_types_with_scalar_udf(&arg_data_types, func)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I defer to @jayzhan211 -- if he is good to merge this PR, let's get the conflicts resolved and merge it in.
If there is additional work we know is needed / could be cleaned up, let's try and file them as tickets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conflicts solved! 😄
1cc4344
to
854ed60
Compare
Thanks @mesejo and @alamb @findepi @andygrove for the review |
@@ -1907,7 +1907,7 @@ select position('' in '') | |||
1 | |||
|
|||
|
|||
query error DataFusion error: Execution error: The STRPOS/INSTR/POSITION function can only accept strings, but got Int64. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree in principle, but actually why didn't this work?
Looking at the code here, every function that accepts a string, should be invocable with a number or pretty much anything
datafusion/datafusion/expr/src/type_coercion/functions.rs
Lines 681 to 682 in 8db30e2
// Any type can be coerced into strings | |
(Utf8 | LargeUtf8, _) => Some(type_into.clone()), |
i am not convinced this is how it should be, but that's not the point here. My question is -- why this didn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coerce_types
doesn't go to coerce_from
. That is the reason. I hope to deprecate coerce_from
some day but not yet 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, but not sure i understand. It this explaining why the query fails before or after this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query was already failing before this PR. The reason is in the function that computes the return type:
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
utf8_to_int_type(&arg_types[0], "strpos/instr/position")
}
The source of the error is in these lines:
datafusion/datafusion/functions/src/utils.rs
Lines 57 to 67 in 5740774
data_type => { | |
return datafusion_common::exec_err!( | |
"The {} function can only accept strings, but got {:?}.", | |
name.to_uppercase(), | |
data_type | |
); | |
} | |
}) | |
} | |
}; | |
} |
This PR appears to have introduced another regression found by @progval |
Which issue does this PR close?
Closes #12307.
Are these changes tested?
Yes
Are there any user-facing changes?
No