-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve type coercion and casting #8302
Comments
Floating points inherently permit precision loss. It is fairly standard that otherwise checked conversion permit precision loss when converting between floats. For example - https://docs.rs/num-traits/latest/num_traits/cast/trait.FromPrimitive.html
I think this would make sense as an addition, the current support for nested types in arrow-cast is very limited.
In general I think it would be very confusing if coercion can result in silent data loss, perhaps you might articulate why this would be useful? My current thinking is casting should never silently lose data. |
Comment in Casting Int64 to Int32 return true. I'm not sure if the |
That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data |
Any scenario that we should hint it as potentially fallible but actually do nothing for them? It seems to me return false is more straightforward if we do nothing. |
However, I think we can still cleanup |
Maybe Int64(1) cast to Int32(1) is one that is able to safely cast without data loss. |
Maybe we need another verion of |
Is that not what can_coerce_from is? |
Yes! Should we have it aside |
Given the type coercion logic itself lives in DataFusion I personally think it makes sense for it to remain alongside it |
Short summary for the next step of this issue
TODO
Issue
Note
|
In for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too |
The comment in the code you linked seems to say "some information loss is inevitable":
It is likely for "usability" so that you can compare i64 and u64 without an explicit type cast 🤷 |
I don't think that usability is the good reason :( It might cause data loss. @alamb I would like to change them to the straightforward version, choose the larger type between two. I think it can also be used all the place in datafusion easily (one place is If such implicit casting is found helpful in the end, we can have two version, one that lossy is acceptable, one is not. |
I would recommend you write some tests / try out the change before making a full featured PR. I am not sure what the implications are with the change you propose, but I suspect it could be non trivial and subtle |
I rethink about this. This is the best we can do, like the comment mentioned, it is inevitable :( |
After #8385, I found that it is not ideal to have one coercion for all the place (compare op, math op, signature coercion, etc... ). |
Is your feature request related to a problem or challenge?
I think there is room for improvement in
type coerceion
orcasting
.Background
comparison_coercion
is widely used in datafusion, a lossless conversionhttps://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/binary.rs
can_coerce_from
is used mainly for signature, a lossless conversionhttps://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/functions.rs
can_cast_types
is fromarrow-cast
, which is a lossy conversion. It is also used in somecomparison_coercion
building block. https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L76-L273Not sure if there is other coercion I missed
Proposal
comparison_coercion
andcan_coerce_from
seem like doing the similar thing, maybe we can just have onelossless
conversion. If lossless conversion is useful for arrow-rs, we can introduce a lossless version ofcan_cast_types
, then rely on it for datafusion.Lossy conversion vs Lossless
I think the definition for lossy is that the value is not recoverable after casting back, otherwise it is lossless.
Lossy
Lossless
Describe the solution you'd like
can_coerce_from
withcomparison_coercion
's building blocknumeric coercion
,list coercion
,string coercion
,null coercion
, etclist_coercion
fromstring_coercion
to make each building block of coercion clear on the task it focus on. list_coercion dolist
/fixed size list
/large list
coercion,string_coercion
doutf
/large utf
coercion.Known issue or question I have
list_coercion
that currently exist instring_concat_coercion
can_coerce_from
can_coerce_from
, why?Describe alternatives you've considered
If there are many customize conversion need, then this change might not be helpful at all. We need other approach to let type casting / coercion easy to use.
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: