-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Signature::String and return error if input of strpos
is integer
#12751
Changes from all commits
b60ddea
9ead621
86e005f
b1aa23b
3b5695a
f38ef3b
b002c39
b95c66a
10f5638
984b5cc
045a202
1b37176
f10d43d
6b0f2d8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -125,6 +125,11 @@ pub enum TypeSignature { | |
/// Fixed number of arguments of numeric types. | ||
/// See <https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#method.is_numeric> to know which type is considered numeric | ||
Numeric(usize), | ||
/// Fixed number of arguments of all the same string types. | ||
/// The precedence of type from high to low is Utf8View, LargeUtf8 and Utf8. | ||
/// Null is considerd as Utf8 by default | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Null is not Utf8, but may be coercible to that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I agree I agree the function signature should define what the function accepts There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we found a function that reject null to string, we could define it with |
||
/// Dictionary with string value type is also handled. | ||
String(usize), | ||
} | ||
|
||
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)] | ||
|
@@ -190,8 +195,11 @@ impl TypeSignature { | |
.collect::<Vec<String>>() | ||
.join(", ")] | ||
} | ||
TypeSignature::String(num) => { | ||
vec![format!("String({num})")] | ||
} | ||
TypeSignature::Numeric(num) => { | ||
vec![format!("Numeric({})", num)] | ||
vec![format!("Numeric({num})")] | ||
} | ||
TypeSignature::Exact(types) | TypeSignature::Coercible(types) => { | ||
vec![Self::join_types(types, ", ")] | ||
|
@@ -280,6 +288,14 @@ impl Signature { | |
} | ||
} | ||
|
||
/// A specified number of numeric arguments | ||
pub fn string(arg_count: usize, volatility: Volatility) -> Self { | ||
Self { | ||
type_signature: TypeSignature::String(arg_count), | ||
volatility, | ||
} | ||
} | ||
|
||
/// An arbitrary number of arguments of any type. | ||
pub fn variadic_any(volatility: Volatility) -> Self { | ||
Self { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,8 +26,9 @@ use datafusion_common::{ | |
utils::{coerced_fixed_size_list_to_list, list_ndims}, | ||
Result, | ||
}; | ||
use datafusion_expr_common::signature::{ | ||
ArrayFunctionSignature, FIXED_SIZE_LIST_WILDCARD, TIMEZONE_WILDCARD, | ||
use datafusion_expr_common::{ | ||
signature::{ArrayFunctionSignature, FIXED_SIZE_LIST_WILDCARD, TIMEZONE_WILDCARD}, | ||
type_coercion::binary::string_coercion, | ||
}; | ||
use std::sync::Arc; | ||
|
||
|
@@ -176,6 +177,7 @@ fn is_well_supported_signature(type_signature: &TypeSignature) -> bool { | |
type_signature, | ||
TypeSignature::UserDefined | ||
| TypeSignature::Numeric(_) | ||
| TypeSignature::String(_) | ||
| TypeSignature::Coercible(_) | ||
| TypeSignature::Any(_) | ||
) | ||
|
@@ -381,6 +383,67 @@ fn get_valid_types( | |
.iter() | ||
.map(|valid_type| current_types.iter().map(|_| valid_type.clone()).collect()) | ||
.collect(), | ||
TypeSignature::String(number) => { | ||
if *number < 1 { | ||
return plan_err!( | ||
"The signature expected at least one argument but received {}", | ||
current_types.len() | ||
); | ||
} | ||
if *number != current_types.len() { | ||
return plan_err!( | ||
"The signature expected {} arguments but received {}", | ||
number, | ||
current_types.len() | ||
); | ||
} | ||
|
||
fn coercion_rule( | ||
lhs_type: &DataType, | ||
rhs_type: &DataType, | ||
) -> Result<DataType> { | ||
match (lhs_type, rhs_type) { | ||
(DataType::Null, DataType::Null) => Ok(DataType::Utf8), | ||
(DataType::Null, data_type) | (data_type, DataType::Null) => { | ||
coercion_rule(data_type, &DataType::Utf8) | ||
} | ||
(DataType::Dictionary(_, lhs), DataType::Dictionary(_, rhs)) => { | ||
coercion_rule(lhs, rhs) | ||
} | ||
(DataType::Dictionary(_, v), other) | ||
| (other, DataType::Dictionary(_, v)) => coercion_rule(v, other), | ||
_ => { | ||
if let Some(coerced_type) = string_coercion(lhs_type, rhs_type) { | ||
Ok(coerced_type) | ||
} else { | ||
plan_err!( | ||
"{} and {} are not coercible to a common string type", | ||
Comment on lines
+415
to
+420
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe coercion rules should not be defined separately for each function. In PostgreSQL the docs unfortunately don't say what casts are implicitly added, one has to query the system catalog for that SELECT * FROM pg_cast
WHERE castcontext = 'i'; -- 'i' indicates implicit casts the point is -- the implicit cast / implicit coercion rules are defined on type pairs, and don't called function name as a context. Thus if type A is coercible to B for the sake of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll consider how to implement a global, function-agnostic coercion rule, while still allowing function-specific coercion for customization |
||
lhs_type, | ||
rhs_type | ||
) | ||
} | ||
} | ||
} | ||
} | ||
|
||
// Length checked above, safe to unwrap | ||
let mut coerced_type = current_types.first().unwrap().to_owned(); | ||
for t in current_types.iter().skip(1) { | ||
coerced_type = coercion_rule(&coerced_type, t)?; | ||
Comment on lines
+431
to
+432
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is important decision place. Why would we coerce all arguments to one common super type? Why should a function abstain from accepting a utf8view and utf8 argument pair? fwiw this is exactly what If we want to do this for some reason, I would prefer the functions to be converted one by one, so that we can eliminate dead code. For example if cc @alamb There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think this is a simplification for the common case of functions that don't handle all possible String types. Many of the built in function implementations in DataFusion take multiple string arguments but typically only handle one common type (not all possible combinations of types. There is likely a tradeoff in runtime performance with specialized functions for all possible input parameter type combinations and the compiletime / binary size. I think it is reasomable (and common) to coerce to a common super type There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That is why I think we need function-specific coercion, coercion rule may be different for each function and each downstream project. It is better not to assume the coercion rule for the user, or at least they should be able to define their own coercion rule function-wise if the default one doesn't fit their need. Signature::String is a general signature that fit most of the string related function, so we should define the most common way, so I choose the rule that has only one single type for simplicity. |
||
} | ||
|
||
fn base_type_or_default_type(data_type: &DataType) -> DataType { | ||
if data_type.is_null() { | ||
DataType::Utf8 | ||
} else if let DataType::Dictionary(_, v) = data_type { | ||
base_type_or_default_type(v) | ||
} else { | ||
data_type.to_owned() | ||
} | ||
} | ||
|
||
vec![vec![base_type_or_default_type(&coerced_type); *number]] | ||
} | ||
TypeSignature::Numeric(number) => { | ||
if *number < 1 { | ||
return plan_err!( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,10 +19,9 @@ | |
|
||
use arrow::datatypes::DataType; | ||
use datafusion_common::{exec_err, DataFusionError, Result}; | ||
use datafusion_expr::ColumnarValue; | ||
use datafusion_expr::{ColumnarValue, TypeSignature}; | ||
|
||
use arrow::array::{ArrayRef, BooleanArray, Float32Array, Float64Array}; | ||
use datafusion_expr::TypeSignature::*; | ||
use datafusion_expr::{ScalarUDFImpl, Signature, Volatility}; | ||
use std::any::Any; | ||
use std::sync::Arc; | ||
|
@@ -43,7 +42,10 @@ impl IsNanFunc { | |
use DataType::*; | ||
Self { | ||
signature: Signature::one_of( | ||
vec![Exact(vec![Float32]), Exact(vec![Float64])], | ||
vec![ | ||
TypeSignature::Exact(vec![Float32]), | ||
TypeSignature::Exact(vec![Float64]), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for improving the code style! i see there is a bunch of codestyle improvement like this. Would it be possible to package them separately so that mechanical changes can be reviewed with less scrutiny and more focus put on important changes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's clear that the change is necessary due to the name conflict with Signature::String, so it would be better to group them together in this case |
||
], | ||
Volatility::Immutable, | ||
), | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,10 +25,9 @@ use datafusion_common::{ | |
}; | ||
use datafusion_expr::expr::ScalarFunction; | ||
use datafusion_expr::simplify::{ExprSimplifyResult, SimplifyInfo}; | ||
use datafusion_expr::{ColumnarValue, Expr, ScalarUDF}; | ||
use datafusion_expr::{ColumnarValue, Expr, ScalarUDF, TypeSignature}; | ||
|
||
use arrow::array::{ArrayRef, Float64Array, Int64Array}; | ||
use datafusion_expr::TypeSignature::*; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer not to import wildcard in non-test code area, in this PR it conflicts with rust |
||
use datafusion_expr::{ScalarUDFImpl, Signature, Volatility}; | ||
use std::any::Any; | ||
use std::sync::Arc; | ||
|
@@ -52,7 +51,10 @@ impl PowerFunc { | |
use DataType::*; | ||
Self { | ||
signature: Signature::one_of( | ||
vec![Exact(vec![Int64, Int64]), Exact(vec![Float64, Float64])], | ||
vec![ | ||
TypeSignature::Exact(vec![Int64, Int64]), | ||
TypeSignature::Exact(vec![Float64, Float64]), | ||
], | ||
Volatility::Immutable, | ||
), | ||
aliases: vec![String::from("pow")], | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -39,13 +39,8 @@ impl Default for BitLengthFunc { | |
|
||
impl BitLengthFunc { | ||
pub fn new() -> Self { | ||
use DataType::*; | ||
Self { | ||
signature: Signature::uniform( | ||
1, | ||
vec![Utf8, LargeUtf8], | ||
Volatility::Immutable, | ||
), | ||
signature: Signature::string(1, Volatility::Immutable), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The function declared it accepts Does this function work for string views? |
||
} | ||
} | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,9 +22,9 @@ use arrow::datatypes::DataType; | |
use datafusion_common::{exec_err, Result}; | ||
use datafusion_expr::function::Hint; | ||
use datafusion_expr::scalar_doc_sections::DOC_SECTION_STRING; | ||
use datafusion_expr::TypeSignature::*; | ||
use datafusion_expr::{ColumnarValue, Documentation, Volatility}; | ||
use datafusion_expr::{ScalarUDFImpl, Signature}; | ||
use datafusion_expr::{ | ||
ColumnarValue, Documentation, ScalarUDFImpl, Signature, TypeSignature, Volatility, | ||
}; | ||
use std::any::Any; | ||
use std::sync::OnceLock; | ||
|
||
|
@@ -49,18 +49,9 @@ impl Default for BTrimFunc { | |
|
||
impl BTrimFunc { | ||
pub fn new() -> Self { | ||
use DataType::*; | ||
Self { | ||
signature: Signature::one_of( | ||
vec![ | ||
// Planner attempts coercion to the target type starting with the most preferred candidate. | ||
// For example, given input `(Utf8View, Utf8)`, it first tries coercing to `(Utf8View, Utf8View)`. | ||
// If that fails, it proceeds to `(Utf8, Utf8)`. | ||
Exact(vec![Utf8View, Utf8View]), | ||
Exact(vec![Utf8, Utf8]), | ||
Exact(vec![Utf8View]), | ||
Exact(vec![Utf8]), | ||
], | ||
vec![TypeSignature::String(2), TypeSignature::String(1)], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The function did not declare support for LargeUtf8, so the code below checking arg 0 data type to be LargeUtf8 was probably dead code. Now it no longer is. Should this new capability be test covered? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be an excellent use of @goldmedal 's infrastructure for testing in https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files/string#readme |
||
Volatility::Immutable, | ||
), | ||
aliases: vec![String::from("trim")], | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
octet_length
does not support list type, so change toid
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
octet_length takes a string, but everything is supposed to be coercible to string (for some reason). the original example should probably still work, but it is a good idea to add a test with octet_length directly on a string column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "everything is supposed to be coercable to a string" from where do you mean? Is that a SQL "standard" behavior? I don't think it is necessairly the behavior of DataFusion (or postgres)
For example, postgres doesn't automatically coerce integers to strings in all cases. Here is one:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mean this
datafusion/datafusion/expr/src/type_coercion/functions.rs
Lines 649 to 650 in 8aafa54
no, i don't think so.
i am NOT proposing this behavior, or saying that such behavior is warranted.
I am just relying what the code says that DF does.