Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11496: [Rust] WIP aggregation on NaN values. #9416

Closed
wants to merge 3 commits into from

Conversation

ritchie46
Copy link
Contributor

@ritchie46 ritchie46 commented Feb 4, 2021

@jorgecarleitao @Dandandan @nevi-me @alamb

Before I continue on this I want to start a discussion on what the behavior of aggregation should be. I started this after I got this issue: pola-rs/polars#328.

It boils down to this:

        let a = Float64Array::from_iter_values(vec![1.0, f64::NAN]);
        dbg!(min(&a));
        dbg!(max(&a));
[arrow/src/compute/kernels/aggregate.rs:905] min(&a) = Some(
    1.0,
)
[arrow/src/compute/kernels/aggregate.rs:906] max(&a) = Some(
    NaN,
)

I initially thought this was a bug, but then I saw that this behavior is asserted in the tests as being valid.

However this is different behavior than that of most numerical tools I know (e.g. numpy, tensorflow, torch, etc.). The IEEE 754 standard states that "Any comparison with a NaN is treated as unordered.".

But if a max aggregation, differs from a min aggregation with regards to NaN this implies to me that we currently treat NaN as ordered and that NaN is larger than any number.

IMO we should return NaN for both the max and the min kernel and may also add a variant that excludes the NaNs.

@github-actions
Copy link

github-actions bot commented Feb 4, 2021

@Dandandan
Copy link
Contributor

Thanks @ritchie46
A change I did recently in the sort kernel is to use an implementation of totalOrder for floats, which is a revision on IEEE 754.
I think it would be great if we could be consistent with that change as well here + it would probably make the implementation easier and avoids problems you mention as well. What do you think?

The code is here:
https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/sort.rs#L44

The definition of total order is documented here:

https://doc.rust-lang.org/std/primitive.f64.html#method.total_cmp

@ritchie46
Copy link
Contributor Author

Look good @Dandandan. Yes, it seems logical to have the max/min aggregations and sorting mechanics based on the same foundation.

@alamb
Copy link
Contributor

alamb commented Feb 5, 2021

Thanks @Dandandan and @ritchie46 -- my opinions are:

  1. min/max and sort should be consistent with respect to nulls for floating point
  2. We should strive to implement the same null semantics as postgres (which I don't have handy here or remember off the top of my head)

@Dandandan
Copy link
Contributor

In PostgreSQL, both min and max return non-null whenever the expression returns at least 1 non-null value.

@jhorstmann
Copy link
Contributor

Some background from my perspective:

I implemented the NaN sorting behaviour initially in ARROW-9895 to be consistent with Postgres behaviour and then later implemented the same behaviour for the aggregation kernels in ARROW-10216. At that time I did no know about the total order specification. Total order seems a bit more "standard" than Postgres in this case, but the only difference is how negative NaN is handled. The current state is I think that lexicographical sort and simd min/max don't distinguish between negative/positive NaN and consider all NaN to be bigger than all other values, while single column sort and scalar min/max follow total order.

Considering that it's not possible to distinguish different NaN without transmuting and I'm not sure whether operations involving NaN are required to keep the sign, this inconsistency is probably not urgent to fix.

Another thing to keep in mind is that NaN and Null are two completely separate concepts in both Postgres and the arrow data model and nulls are always excluded when evaluating aggregations.

For ignoring NaN values, I think an efficient implementation could be a separate kernel that maps NaN to null.

@ritchie46
Copy link
Contributor Author

Ok, there are some distinct views on the matter (mine included). I would assume users of the kernels also have varying requirements regarding NaN behavior. For this reason, I propose to leave the min, max kernels as is, and add additional min_float and max_float kernels in which users can define the NaN behavior:

  • Propogate: any NaN in array leads to NaN output.
  • Ignore: (Postgres behavior) NaNs are ignore if any is aN.
  • TotalOrder: Follow the total order standard.
  • ?

@ritchie46
Copy link
Contributor Author

Ok, there are some distinct views on the matter (mine included). I would assume users of the kernels also have varying requirements regarding NaN behavior. For this reason, I propose to leave the min, max kernels as is, and add additional min_float and max_float kernels in which users can define the NaN behavior:

Propogate: any NaN in array leads to NaN output.
Ignore: (Postgres behavior) NaNs are ignore if any is aN.
TotalOrder: Follow the total order standard.
?

@alamb @jorgecarleitao @jhorstmann @Dandandan

What do you think of letting the user define one of those predefined NaN behaviors?

@alamb
Copy link
Contributor

alamb commented Mar 19, 2021

@ritchie46 I think letting the user decide how they want to handle NaN values in min/max kernels sounds like a good idea to me

@alamb
Copy link
Contributor

alamb commented Mar 19, 2021

FWIW, pandas also seems to ignore nan values for min/max (which kind of makes sense to me):

Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pa;
>>> from math import nan
>>> s = pa.Series([1,2,3,4,nan])
>>> s.max()
4.0
>>> s.min()
1.0
>>> 
``~

@alamb
Copy link
Contributor

alamb commented Apr 19, 2021

The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository

Please see the mailing-list thread for more details

We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs.

@alamb
Copy link
Contributor

alamb commented May 3, 2021

#10096 has removed the arrow implementation from this repository (it now resides in https://github.com/apache/arrow-rs and https://github.com/apache/arrow-datafusion) in the hopes of streamlining the development process

Please re-target this PR (let us know if you need help doing so) to one/both of the new repositories.

Thank you for understanding and helping to make arrow-rs and datafusion better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants