Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve type coercion and casting #8302

Open
jayzhan211 opened this issue Nov 22, 2023 · 18 comments
Open

Improve type coercion and casting #8302

jayzhan211 opened this issue Nov 22, 2023 · 18 comments
Labels
enhancement New feature or request

Comments

@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 22, 2023

Is your feature request related to a problem or challenge?

I think there is room for improvement in type coerceion or casting.

Background

comparison_coercion is widely used in datafusion, a lossless conversion
https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/binary.rs

can_coerce_from is used mainly for signature, a lossless conversion
https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/functions.rs

can_cast_types is from arrow-cast, which is a lossy conversion. It is also used in some comparison_coercion building block. https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L76-L273

Not sure if there is other coercion I missed

Proposal

comparison_coercion and can_coerce_from seem like doing the similar thing, maybe we can just have one lossless conversion. If lossless conversion is useful for arrow-rs, we can introduce a lossless version of can_cast_types, then rely on it for datafusion.

Lossy conversion vs Lossless

I think the definition for lossy is that the value is not recoverable after casting back, otherwise it is lossless.

Lossy

  • Int32 to Int16 / Int8

Lossless

  • Int32 to Int64

Describe the solution you'd like

  1. Replace can_coerce_from with comparison_coercion's building block numeric coercion, list coercion, string coercion, null coercion, etc
  2. Split list_coercion from string_coercion to make each building block of coercion clear on the task it focus on. list_coercion do list/fixed size list/large list coercion, string_coercion do utf/large utf coercion.
  3. Introduce these lossless coercion to arrow-rs?

Known issue or question I have

  • Introduce list_coercion that currently exist in string_concat_coercion
  • No list coercion for can_coerce_from
  • Decimal128 can cast to Float64 in can_coerce_from, why?

Describe alternatives you've considered

If there are many customize conversion need, then this change might not be helpful at all. We need other approach to let type casting / coercion easy to use.

No response

Additional context

No response

@jayzhan211 jayzhan211 added the enhancement New feature or request label Nov 22, 2023
@jayzhan211
Copy link
Contributor Author

@tustvold Do you think introduce a lossless version of can_cast_types helpful?
cc @alamb

@tustvold
Copy link
Contributor

tustvold commented Nov 22, 2023

Decimal128 can cast to Float64 in can_coerce_from, why?

Floating points inherently permit precision loss. It is fairly standard that otherwise checked conversion permit precision loss when converting between floats.

For example - https://docs.rs/num-traits/latest/num_traits/cast/trait.FromPrimitive.html

A value can be represented by the target type when it lies within the range of scalars supported by the target type. For example, a negative integer cannot be represented by an unsigned integer type, and an i64 with a very high magnitude might not be convertible to an i32. On the other hand, conversions with possible precision loss or truncation are admitted, like an f32 with a decimal part to an integer type, or even a large f64 saturating to f32 infinity.

No list coercion for can_coerce_from

I think this would make sense as an addition, the current support for nested types in arrow-cast is very limited.

If lossless conversion is useful for arrow-rs

In general I think it would be very confusing if coercion can result in silent data loss, perhaps you might articulate why this would be useful? My current thinking is casting should never silently lose data.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Nov 22, 2023

My current thinking is casting should never silently lose data.

https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L76C2-L80

Comment in can_cast_types mention that cast is lossy.

https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L212-L216

Casting Int64 to Int32 return true. I'm not sure if the actual casting block this lossy conversion, but if not, could it silently lose data?

@tustvold
Copy link
Contributor

tustvold commented Nov 22, 2023

That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Nov 22, 2023

That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data

Any scenario that we should hint it as potentially fallible but actually do nothing for them? It seems to me return false is more straightforward if we do nothing.

@jayzhan211
Copy link
Contributor Author

However, I think we can still cleanup comparison_coercion and can_coerce_from and forget about lossy conversion.

@jayzhan211
Copy link
Contributor Author

That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data

Any scenario that we should hint it as potentially fallible but actually do nothing for them? It seems to me return false is more straightforward if we do nothing.

Maybe Int64(1) cast to Int32(1) is one that is able to safely cast without data loss.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Nov 22, 2023

Maybe we need another verion of can_cast_types that no possible fallible hint. For i64, only u64 or larger type, no i32 or i16 since we don't know the actual data.

@tustvold
Copy link
Contributor

Is that not what can_coerce_from is?

@jayzhan211
Copy link
Contributor Author

can_coerce_from

Yes! Should we have it aside can_cast_types in arrow-rs? Or just let it stay in datafusion for now.

@tustvold
Copy link
Contributor

Given the type coercion logic itself lives in DataFusion I personally think it makes sense for it to remain alongside it

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Nov 23, 2023

Short summary for the next step of this issue

  1. Replace can_coerce_from with component in comparison_coercion, so we dont need to think about which one to use. Reuse the code if possible, call can_cast_types if possible.
  2. Add nested support. Start from List

TODO

Issue

  • Return Result for comparison_coercion, so we can differentiate unimplemented casting or not possible casting Return Option<T> is probably enough.
  • Consider having component of comparison_coercion public, like Allow default value on lag/lead if they are coerceable to value #8308 can have numerice_coercion instead of comparison_coercion. No use case.
  • Float is converted to Decimal with possible overflow in comparison_coercion, while Decimal is coerced to Float in can_coerced_from. Confused 😕 😕 😕

Note

can_coerce_from / comparison_coercion unlike arrow-cast should only check the casting to the larger type since the actual data is unknown.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Nov 29, 2023

https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/expr/src/type_coercion/binary.rs#L388-L406

In comparison_binary_numeric_coercion, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are?

for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too

@alamb
Copy link
Contributor

alamb commented Nov 29, 2023

https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/expr/src/type_coercion/binary.rs#L388-L406

In comparison_binary_numeric_coercion, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are?

for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too

The comment in the code you linked seems to say "some information loss is inevitable":

        // The following match arms encode the following logic: Given the two
        // integral types, we choose the narrowest possible integral type that
        // accommodates all values of both types. Note that some information
        // loss is inevitable when we have a signed type and a `UInt64`, in
        // which case we use `Int64`;i.e. the widest signed integral type.

It is likely for "usability" so that you can compare i64 and u64 without an explicit type cast 🤷

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Nov 30, 2023

https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/expr/src/type_coercion/binary.rs#L388-L406

In comparison_binary_numeric_coercion, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are?
for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too

The comment in the code you linked seems to say "some information loss is inevitable":

        // The following match arms encode the following logic: Given the two
        // integral types, we choose the narrowest possible integral type that
        // accommodates all values of both types. Note that some information
        // loss is inevitable when we have a signed type and a `UInt64`, in
        // which case we use `Int64`;i.e. the widest signed integral type.

It is likely for "usability" so that you can compare i64 and u64 without an explicit type cast 🤷

I don't think that usability is the good reason :( It might cause data loss.

@alamb I would like to change them to the straightforward version, choose the larger type between two. I think it can also be used all the place in datafusion easily (one place is can_coerce_from). Is there any concern about this change?

If such implicit casting is found helpful in the end, we can have two version, one that lossy is acceptable, one is not.

@alamb
Copy link
Contributor

alamb commented Nov 30, 2023

. Is there any concern about this change?

I would recommend you write some tests / try out the change before making a full featured PR. I am not sure what the implications are with the change you propose, but I suspect it could be non trivial and subtle

@jayzhan211
Copy link
Contributor Author

// The following match arms encode the following logic: Given the two
// integral types, we choose the narrowest possible integral type that
// accommodates all values of both types. Note that some information
// loss is inevitable when we have a signed type and a UInt64, in
// which case we use Int64;i.e. the widest signed integral type.

I rethink about this. This is the best we can do, like the comment mentioned, it is inevitable :(

@jayzhan211
Copy link
Contributor Author

After #8385, I found that it is not ideal to have one coercion for all the place (compare op, math op, signature coercion, etc... ).

@jayzhan211 jayzhan211 changed the title Unify type coercion and casting Improve type coercion and casting Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants