Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] A new Scalar Function interface #7977

Closed
2010YOUY01 opened this issue Oct 30, 2023 · 6 comments
Closed

[Epic] A new Scalar Function interface #7977

2010YOUY01 opened this issue Oct 30, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@2010YOUY01
Copy link
Contributor

2010YOUY01 commented Oct 30, 2023

Is your feature request related to a problem or challenge?

As previously discussed in #7110 #7752 : We want to define Scalar Functions outside the core to reduce Datafusion Core binary size, and also make UDF management easy.
But the current BuiltinScalarFunction and ScalarUDF interfaces are not capable to do this:

Problems with BuiltinScalarFunction

  • built-in functions are implemented with Enum BuiltinScalarFunction, and function implementations like return_type() are large methods that match every enum variant. Adding a new function requires modifications in multiple places (not easy to add functions), also we can't define scalar functions outside the core.

Problems with ScalarUDF

  • Now ScalarUDFs are represented with a struct, it can not cover all the functionalities of existing built-in functions
  • Defining a new ScalarUDF requires constructing a struct in an imperative way (see examples/simple_udf.rs), this way is not ergonomic especially when you have to manage large number of functions in a separate crate.

So we would like to introduce a new interface trait ScalarFunctionDef: it can define a function declaratively and in one place (easier than Enum BuiltinScalarFunction's adding arms for pattern matching in multiple places, and struct ScalarUDF's imperative way of defining a new function.
After introducing the new interface we can gradually migrating existing built-in functions to the new one, the old UDF interface create_udf() can keep unchanged.

Describe the solution you'd like

Objective

  1. Define a new interface trait ScalarFunctionDef to solve the above-mentioned issue
  2. Now built-in functions are using Enum BuiltinScalarFunction as the underlying execution mechanism in the core code, and scalar UDFs are using Struct ScalarUDF, both will be moving to ScalarFunctionDef as the unified internal representation.

Implementation Plan

  1. Introducing new interface ScalarFunctionDef and replace the old built-in functions' execution code
    1.1 Replace SQL execution code with ScalarFunctionDef
    1.2 Change other relevant execution components, like the Logical Expression Constructor for Scalar Functions, to be compatible with ScalarFunctionDef.
  2. Migrate old UDF Implementations: Struct ScalarUDF -> ScalarFunctionDef
  3. Function package management interface
    ...and migrating existing functions

Describe alternatives you've considered

No response

Additional context

No response

@2010YOUY01 2010YOUY01 added the enhancement New feature or request label Oct 30, 2023
@viirya
Copy link
Member

viirya commented Oct 30, 2023

But the current BuiltinScalarFunction and ScalarUDF interface are not capable to do this, so we would like to introduce a new interface trait ScalarFunctionDef and gradually migrate existing functions into the new one.

Out of curiosity to read this ticket. For such as proposal to change important APIs, it will be better to make this issue more trackable and understandable for all who are interested. Could you add what existing APIs lack of and why they cannot meet the requirement into the description? Just only a sentence saying "they are not capable to do this" is not enough and too rough.

@2010YOUY01
Copy link
Contributor Author

But the current BuiltinScalarFunction and ScalarUDF interface are not capable to do this, so we would like to introduce a new interface trait ScalarFunctionDef and gradually migrate existing functions into the new one.

Out of curiosity to read this ticket. For such as proposal to change important APIs, it will be better to make this issue more trackable and understandable for all who are interested. Could you add what existing APIs lack of and why they cannot meet the requirement into the description? Just only a sentence saying "they are not capable to do this" is not enough and too rough.

Sure, I updated the description with the summary of previous discussions

@alamb alamb changed the title Moving to the new Scalar Function interface Epic: A new Scalar Function interface Oct 31, 2023
@alamb alamb changed the title Epic: A new Scalar Function interface [Epic] A new Scalar Function interface Oct 31, 2023
@alamb
Copy link
Contributor

alamb commented Oct 31, 2023

Here is an example of what a NEW api might look like

Existing API

(BTW this took me non trivial time to get the types to line up and compile correctly)

//! Implementation of `to_timestamp` function that
//! overrides the built in version in DataFusion because the semantics changed
//! upstream: <https://github.com/apache/arrow-datafusion/pull/7844>


/// Implementation of to_timestamp
pub(crate) static TO_TIMESTAMP_UDF: Lazy<Arc<ScalarUDF>> = Lazy::new(|| {
    Arc::new(ScalarUDF::new(
        "to_timestamp",
        &Signature::uniform(
            1,
            vec![
                DataType::Int64,
                DataType::Timestamp(TimeUnit::Nanosecond, None),
                DataType::Timestamp(TimeUnit::Microsecond, None),
                DataType::Timestamp(TimeUnit::Millisecond, None),
                DataType::Timestamp(TimeUnit::Second, None),
                DataType::Utf8,
            ],
            Volatility::Immutable,
        ),
        &TO_TIMESTAMP_RETURN_TYPE,
        &TO_TIMESTAMP_IMPL,
    ))
});

static TO_TIMESTAMP_RETURN_TYPE: Lazy<ReturnTypeFunction> = Lazy::new(|| {
    let func =
        |_arg_types: &[DataType]| Ok(Arc::new(DataType::Timestamp(TimeUnit::Nanosecond, None)));

    Arc::new(func)
});

static TO_TIMESTAMP_IMPL: Lazy<ScalarFunctionImplementation> = Lazy::new(|| {
    let func = |args: &[ColumnarValue]| {
        if args.len() != 1 {
            return internal_err!("to_timestamp expected 1 argument, got {}", args.len());
        }

        match args[0].data_type() {
            // call through to arrow cast kernel
            DataType::Int64 | DataType::Timestamp(_, _) => cast_column(
                &args[0],
                &DataType::Timestamp(TimeUnit::Nanosecond, None),
                None,
            ),
            DataType::Utf8 => datetime_expressions::to_timestamp_nanos(args),
            dt => internal_err!("to_timestamp does not support argument type '{dt}'"),
        }
    };

    Arc::new(func)
});

Here is what such a function could look like as a trait, which I think is much more approachable to new users (in addition to being much more easily separated out, as @2010YOUY01 mentions above)

/// Implementation of to_timestamp
struct ToTimestamp {};
impl ScalarFunction for ToTimestamp {
  fn name(&self) -> &str { "to_timestamp" }

  fn signature(&self) -> Signature {
        Signature::uniform(
            1,
            vec![
                DataType::Int64,
                DataType::Timestamp(TimeUnit::Nanosecond, None),
                DataType::Timestamp(TimeUnit::Microsecond, None),
                DataType::Timestamp(TimeUnit::Millisecond, None),
                DataType::Timestamp(TimeUnit::Second, None),
                DataType::Utf8,
            ]
   }

   fn volatility(&self) -> Volatility {
     Volatility::Immutable,
    }

   fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
    Ok(DataType::Timestamp(TimeUnit::Nanosecond, None))
  }

  fn evaluate(&self, args: &[ColumnarValue]) -> Result<Vec<ColumnarValue>>) {
        if args.len() != 1 {
            return internal_err!("to_timestamp expected 1 argument, got {}", args.len());
        }

        match args[0].data_type() {
            // call through to arrow cast kernel
            DataType::Int64 | DataType::Timestamp(_, _) => cast_column(
                &args[0],
                &DataType::Timestamp(TimeUnit::Nanosecond, None),
                None,
            ),
            DataType::Utf8 => datetime_expressions::to_timestamp_nanos(args),
            dt => internal_err!("to_timestamp does not support argument type '{dt}'"),
        }
    }
}

@alamb
Copy link
Contributor

alamb commented Nov 3, 2023

Here is a slightly different proposal (basically extend ScalarUDF to support all features of BuiltInScalarFunction and then remove BuiltInScalarFunction): #8039

@alamb
Copy link
Contributor

alamb commented Nov 3, 2023

Here is a slightly different proposal: #8045

@alamb
Copy link
Contributor

alamb commented Nov 6, 2023

@2010YOUY01 what do you think about closing this ticket / unifying on #8045 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants