-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11496: [Rust] WIP aggregation on NaN values. #9416
Conversation
Thanks @ritchie46 The code is here: The definition of total order is documented here: https://doc.rust-lang.org/std/primitive.f64.html#method.total_cmp |
Look good @Dandandan. Yes, it seems logical to have the |
Thanks @Dandandan and @ritchie46 -- my opinions are:
|
In PostgreSQL, both |
Some background from my perspective: I implemented the NaN sorting behaviour initially in ARROW-9895 to be consistent with Postgres behaviour and then later implemented the same behaviour for the aggregation kernels in ARROW-10216. At that time I did no know about the total order specification. Total order seems a bit more "standard" than Postgres in this case, but the only difference is how negative NaN is handled. The current state is I think that lexicographical sort and simd min/max don't distinguish between negative/positive NaN and consider all NaN to be bigger than all other values, while single column sort and scalar min/max follow total order. Considering that it's not possible to distinguish different NaN without transmuting and I'm not sure whether operations involving NaN are required to keep the sign, this inconsistency is probably not urgent to fix. Another thing to keep in mind is that NaN and Null are two completely separate concepts in both Postgres and the arrow data model and nulls are always excluded when evaluating aggregations. For ignoring NaN values, I think an efficient implementation could be a separate kernel that maps NaN to null. |
Ok, there are some distinct views on the matter (mine included). I would assume users of the kernels also have varying requirements regarding NaN behavior. For this reason, I propose to leave the
|
d4608a9
to
356c300
Compare
@alamb @jorgecarleitao @jhorstmann @Dandandan What do you think of letting the user define one of those predefined NaN behaviors? |
@ritchie46 I think letting the user decide how they want to handle NaN values in min/max kernels sounds like a good idea to me |
FWIW, pandas also seems to ignore
|
The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository Please see the mailing-list thread for more details We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs. |
#10096 has removed the arrow implementation from this repository (it now resides in https://github.com/apache/arrow-rs and https://github.com/apache/arrow-datafusion) in the hopes of streamlining the development process Please re-target this PR (let us know if you need help doing so) to one/both of the new repositories. Thank you for understanding and helping to make arrow-rs and datafusion better |
@jorgecarleitao @Dandandan @nevi-me @alamb
Before I continue on this I want to start a discussion on what the behavior of aggregation should be. I started this after I got this issue: pola-rs/polars#328.
It boils down to this:
I initially thought this was a bug, but then I saw that this behavior is asserted in the tests as being valid.
However this is different behavior than that of most numerical tools I know (e.g. numpy, tensorflow, torch, etc.). The IEEE 754 standard states that "Any comparison with a NaN is treated as unordered.".
But if a max aggregation, differs from a min aggregation with regards to NaN this implies to me that we currently treat NaN as ordered and that NaN is larger than any number.
IMO we should return NaN for both the max and the min kernel and may also add a variant that excludes the NaNs.