-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] A new Scalar Function interface #7977
Comments
Out of curiosity to read this ticket. For such as proposal to change important APIs, it will be better to make this issue more trackable and understandable for all who are interested. Could you add what existing APIs lack of and why they cannot meet the requirement into the description? Just only a sentence saying "they are not capable to do this" is not enough and too rough. |
Sure, I updated the description with the summary of previous discussions |
Here is an example of what a NEW api might look like Existing API(BTW this took me non trivial time to get the types to line up and compile correctly) //! Implementation of `to_timestamp` function that
//! overrides the built in version in DataFusion because the semantics changed
//! upstream: <https://github.com/apache/arrow-datafusion/pull/7844>
/// Implementation of to_timestamp
pub(crate) static TO_TIMESTAMP_UDF: Lazy<Arc<ScalarUDF>> = Lazy::new(|| {
Arc::new(ScalarUDF::new(
"to_timestamp",
&Signature::uniform(
1,
vec![
DataType::Int64,
DataType::Timestamp(TimeUnit::Nanosecond, None),
DataType::Timestamp(TimeUnit::Microsecond, None),
DataType::Timestamp(TimeUnit::Millisecond, None),
DataType::Timestamp(TimeUnit::Second, None),
DataType::Utf8,
],
Volatility::Immutable,
),
&TO_TIMESTAMP_RETURN_TYPE,
&TO_TIMESTAMP_IMPL,
))
});
static TO_TIMESTAMP_RETURN_TYPE: Lazy<ReturnTypeFunction> = Lazy::new(|| {
let func =
|_arg_types: &[DataType]| Ok(Arc::new(DataType::Timestamp(TimeUnit::Nanosecond, None)));
Arc::new(func)
});
static TO_TIMESTAMP_IMPL: Lazy<ScalarFunctionImplementation> = Lazy::new(|| {
let func = |args: &[ColumnarValue]| {
if args.len() != 1 {
return internal_err!("to_timestamp expected 1 argument, got {}", args.len());
}
match args[0].data_type() {
// call through to arrow cast kernel
DataType::Int64 | DataType::Timestamp(_, _) => cast_column(
&args[0],
&DataType::Timestamp(TimeUnit::Nanosecond, None),
None,
),
DataType::Utf8 => datetime_expressions::to_timestamp_nanos(args),
dt => internal_err!("to_timestamp does not support argument type '{dt}'"),
}
};
Arc::new(func)
}); Here is what such a function could look like as a trait, which I think is much more approachable to new users (in addition to being much more easily separated out, as @2010YOUY01 mentions above) /// Implementation of to_timestamp
struct ToTimestamp {};
impl ScalarFunction for ToTimestamp {
fn name(&self) -> &str { "to_timestamp" }
fn signature(&self) -> Signature {
Signature::uniform(
1,
vec![
DataType::Int64,
DataType::Timestamp(TimeUnit::Nanosecond, None),
DataType::Timestamp(TimeUnit::Microsecond, None),
DataType::Timestamp(TimeUnit::Millisecond, None),
DataType::Timestamp(TimeUnit::Second, None),
DataType::Utf8,
]
}
fn volatility(&self) -> Volatility {
Volatility::Immutable,
}
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
Ok(DataType::Timestamp(TimeUnit::Nanosecond, None))
}
fn evaluate(&self, args: &[ColumnarValue]) -> Result<Vec<ColumnarValue>>) {
if args.len() != 1 {
return internal_err!("to_timestamp expected 1 argument, got {}", args.len());
}
match args[0].data_type() {
// call through to arrow cast kernel
DataType::Int64 | DataType::Timestamp(_, _) => cast_column(
&args[0],
&DataType::Timestamp(TimeUnit::Nanosecond, None),
None,
),
DataType::Utf8 => datetime_expressions::to_timestamp_nanos(args),
dt => internal_err!("to_timestamp does not support argument type '{dt}'"),
}
}
} |
Here is a slightly different proposal (basically extend ScalarUDF to support all features of |
Here is a slightly different proposal: #8045 |
@2010YOUY01 what do you think about closing this ticket / unifying on #8045 ? |
Is your feature request related to a problem or challenge?
As previously discussed in #7110 #7752 : We want to define Scalar Functions outside the core to reduce Datafusion Core binary size, and also make UDF management easy.
But the current
BuiltinScalarFunction
andScalarUDF
interfaces are not capable to do this:Problems with
BuiltinScalarFunction
Enum BuiltinScalarFunction
, and function implementations likereturn_type()
are large methods that match every enum variant. Adding a new function requires modifications in multiple places (not easy to add functions), also we can't define scalar functions outside the core.Problems with
ScalarUDF
ScalarUDF
s are represented with a struct, it can not cover all the functionalities of existing built-in functionsScalarUDF
requires constructing a struct in an imperative way (see examples/simple_udf.rs), this way is not ergonomic especially when you have to manage large number of functions in a separate crate.So we would like to introduce a new interface
trait ScalarFunctionDef
: it can define a function declaratively and in one place (easier thanEnum BuiltinScalarFunction
's adding arms for pattern matching in multiple places, andstruct ScalarUDF
's imperative way of defining a new function.After introducing the new interface we can gradually migrating existing built-in functions to the new one, the old UDF interface
create_udf()
can keep unchanged.Describe the solution you'd like
Objective
trait ScalarFunctionDef
to solve the above-mentioned issueEnum BuiltinScalarFunction
as the underlying execution mechanism in the core code, and scalar UDFs are usingStruct ScalarUDF
, both will be moving toScalarFunctionDef
as the unified internal representation.Implementation Plan
ScalarFunctionDef
and replace the old built-in functions' execution code1.1 Replace SQL execution code with
ScalarFunctionDef
1.2 Change other relevant execution components, like the Logical Expression Constructor for Scalar Functions, to be compatible with
ScalarFunctionDef
.Struct ScalarUDF
->ScalarFunctionDef
...and migrating existing functions
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: