Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve consistency and documentation on error handling in in UDFs #11618

Open
edmondop opened this issue Jul 23, 2024 · 11 comments
Open

Improve consistency and documentation on error handling in in UDFs #11618

edmondop opened this issue Jul 23, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@edmondop
Copy link
Contributor

edmondop commented Jul 23, 2024

Is your feature request related to a problem or challenge?

When writing a new UDF, a developer needs to decide how to perform error management in functions that return Result, such as return_type and invoke. Looking at the existing codebase it is not obvious what are the error management best practices

Errors in return type

Errors in invoke

In the resize function, there is a check on argument lengths in invoke which is not present in the return_type function

use datafusion_common::{exec_err, internal_datafusion_err, Result, ScalarValue};

The same function also returns exec_err, and internal_datafusion_err

Describe the solution you'd like

As a developers of custom UDF I would like to know:

  • what errors I need to check for and what are already checked by the planner (number of arguments?)
  • what type of errors need to be raised in which conditions

Describe alternatives you've considered

No response

Additional context

No response

@edmondop edmondop added the enhancement New feature or request label Jul 23, 2024
@2010YOUY01
Copy link
Contributor

We should definitely explain them better in the doc.
Here is my understanding. I'm wondering if anyone has additional thoughts or if I'm understanding something wrong.

Planning Error

  1. The planner will first check the argument type based on the function signature, and raise PlanningError if input arg is wrong. (It could be hard to specify the behavior, link to the actual checking code might help...)
  2. For some edge case planning can't do comprehensive arg type check, further argument check can be performed in return_type(), here we should return PlanningError if input argument is wrong
  3. It's possible there are some existing UDF implementations that hacked input argument checking logic into invoke()...which should not be encouraged

Execution Error

  1. Inside invoke(), when there is any expected error (e.g. something like divide by 0), explicitly return ExecutionError, I think it's better not to directly return lower level error messages like ArrowError, which might be hard to interpret sometimes.
  2. Inside invoke() if there are some branches that should be unreachable, throw InternalError in such branches. If such a branch is executed there might be a potential bug, we want to fail fast in this case.

@alamb
Copy link
Contributor

alamb commented Jul 24, 2024

I agree with what @2010YOUY01 says

Some additional comments about "Planning Error" are that for most UDFs that provide a signature DataFusion will ensure that the function is called with one of the valid types.

The UDF can assume the inputs are one of the valid combinations

If it wants to check and validate, it should return an internal_err if the planner provides arguments that don't match one of its proposed signatures

@Omega359
Copy link
Contributor

I had done some work in aligning the errors in the past, see #9164 for details

@edmondop
Copy link
Contributor Author

@2010YOUY01 just to confirm, for example in this case


it should return a planner error, not an exec_error. Also the check on the length of the arguments should be done in return_type and not in
return exec_err!("array_resize needs two or three arguments");

Is this correct?

@2010YOUY01
Copy link
Contributor

@2010YOUY01 just to confirm, for example in this case

it should return a planner error, not an exec_error. Also the check on the length of the arguments should be done in return_type and not in

return exec_err!("array_resize needs two or three arguments");

Is this correct?

I agree.

The first error can be triggered by bad input, so it's a planning error

> select array_resize('foo', 5, 0);
Execution error: Not reachable, data_type should be List, LargeList or FixedSizeList

The second one is redundant, but it's also okay to change it to internal_err!("Not reachable, arguments number check should already be done")

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 7, 2024

Ideally argument check should be handled by signature not return_type. Types should be coerced correctly when it reaches return_type.

But before the signature are mature enough, we somehow need to add the check (length, types) inside return_type

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 8, 2024

I had file an issue to improve on signature for coercion and length + type checking for functions #10507

Although I'm not sure does it really help or not since Signature::UserDefined might be good enough to handle complex type signature already so I didn't push forward on this and move on other tasks instead

@edmondop
Copy link
Contributor Author

edmondop commented Aug 9, 2024

@jayzhan211 do you think we should start by adding an helper macro for pattern matching branches of UDF invoke that should never happen?

udf_unreachable!()

that might create an InternalError

DataFusionError::Internal("This branch of the UDF should never be executed since pre-requisites are checked through type_signature and return_type")

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 10, 2024

IMO we should improve on signature instead of adding check in different function. This not only add unnecessary overhead and let the code less readable. And ideally we don't need any check in return_type and invoke

@edmondop
Copy link
Contributor Author

IMO we should improve on signature instead of adding check in different function. This not only add unnecessary overhead and let the code less readable. And ideally we don't need any check in return_type and invoke

Wouldn't inevitably be necessary if you do pattern matching to handle the branches that are impossible? (Because of signature enforcement)

Only if we use refined types it would be possible to avoid I think

@jayzhan211
Copy link
Contributor

Only if we use refined types it would be possible to avoid I think

I guess at least we could verified it with UserDefined and check against all the edge case in coerce_types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants