ARROW-11496: [Rust] WIP aggregation on NaN values. #9416

ritchie46 · 2021-02-04T14:02:20Z

@jorgecarleitao @Dandandan @nevi-me @alamb

Before I continue on this I want to start a discussion on what the behavior of aggregation should be. I started this after I got this issue: pola-rs/polars#328.

It boils down to this:

        let a = Float64Array::from_iter_values(vec![1.0, f64::NAN]);
        dbg!(min(&a));
        dbg!(max(&a));

[arrow/src/compute/kernels/aggregate.rs:905] min(&a) = Some(
    1.0,
)
[arrow/src/compute/kernels/aggregate.rs:906] max(&a) = Some(
    NaN,
)

I initially thought this was a bug, but then I saw that this behavior is asserted in the tests as being valid.

However this is different behavior than that of most numerical tools I know (e.g. numpy, tensorflow, torch, etc.). The IEEE 754 standard states that "Any comparison with a NaN is treated as unordered.".

But if a max aggregation, differs from a min aggregation with regards to NaN this implies to me that we currently treat NaN as ordered and that NaN is larger than any number.

IMO we should return NaN for both the max and the min kernel and may also add a variant that excludes the NaNs.

github-actions · 2021-02-04T14:09:15Z

https://issues.apache.org/jira/browse/ARROW-11496

Dandandan · 2021-02-04T17:24:07Z

Thanks @ritchie46
A change I did recently in the sort kernel is to use an implementation of totalOrder for floats, which is a revision on IEEE 754.
I think it would be great if we could be consistent with that change as well here + it would probably make the implementation easier and avoids problems you mention as well. What do you think?

The code is here:
https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/sort.rs#L44

The definition of total order is documented here:

https://doc.rust-lang.org/std/primitive.f64.html#method.total_cmp

ritchie46 · 2021-02-05T07:28:26Z

Look good @Dandandan. Yes, it seems logical to have the max/min aggregations and sorting mechanics based on the same foundation.

alamb · 2021-02-05T10:56:27Z

Thanks @Dandandan and @ritchie46 -- my opinions are:

min/max and sort should be consistent with respect to nulls for floating point
We should strive to implement the same null semantics as postgres (which I don't have handy here or remember off the top of my head)

Dandandan · 2021-02-05T11:18:40Z

In PostgreSQL, both min and max return non-null whenever the expression returns at least 1 non-null value.

jhorstmann · 2021-02-05T22:52:17Z

Some background from my perspective:

I implemented the NaN sorting behaviour initially in ARROW-9895 to be consistent with Postgres behaviour and then later implemented the same behaviour for the aggregation kernels in ARROW-10216. At that time I did no know about the total order specification. Total order seems a bit more "standard" than Postgres in this case, but the only difference is how negative NaN is handled. The current state is I think that lexicographical sort and simd min/max don't distinguish between negative/positive NaN and consider all NaN to be bigger than all other values, while single column sort and scalar min/max follow total order.

Considering that it's not possible to distinguish different NaN without transmuting and I'm not sure whether operations involving NaN are required to keep the sign, this inconsistency is probably not urgent to fix.

Another thing to keep in mind is that NaN and Null are two completely separate concepts in both Postgres and the arrow data model and nulls are always excluded when evaluating aggregations.

For ignoring NaN values, I think an efficient implementation could be a separate kernel that maps NaN to null.

…havior

ritchie46 · 2021-02-11T13:37:10Z

Ok, there are some distinct views on the matter (mine included). I would assume users of the kernels also have varying requirements regarding NaN behavior. For this reason, I propose to leave the min, max kernels as is, and add additional min_float and max_float kernels in which users can define the NaN behavior:

Propogate: any NaN in array leads to NaN output.
Ignore: (Postgres behavior) NaNs are ignore if any is aN.
TotalOrder: Follow the total order standard.
?

ritchie46 · 2021-03-19T09:38:42Z

Ok, there are some distinct views on the matter (mine included). I would assume users of the kernels also have varying requirements regarding NaN behavior. For this reason, I propose to leave the min, max kernels as is, and add additional min_float and max_float kernels in which users can define the NaN behavior:

Propogate: any NaN in array leads to NaN output.
Ignore: (Postgres behavior) NaNs are ignore if any is aN.
TotalOrder: Follow the total order standard.
?

@alamb @jorgecarleitao @jhorstmann @Dandandan

What do you think of letting the user define one of those predefined NaN behaviors?

alamb · 2021-03-19T13:14:01Z

@ritchie46 I think letting the user decide how they want to handle NaN values in min/max kernels sounds like a good idea to me

alamb · 2021-03-19T13:23:19Z

FWIW, pandas also seems to ignore nan values for min/max (which kind of makes sense to me):

Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pa;
>>> from math import nan
>>> s = pa.Series([1,2,3,4,nan])
>>> s.max()
4.0
>>> s.min()
1.0
>>> 
``~

alamb · 2021-04-19T10:23:35Z

The Apache Arrow Rust community is moving the Rust implementation into its own dedicated github repositories arrow-rs and arrow-datafusion. It is likely we will not merge this PR into this repository

Please see the mailing-list thread for more details

We expect the process to take a few days and will follow up with a migration plan for the in-flight PRs.

alamb · 2021-05-03T13:00:02Z

#10096 has removed the arrow implementation from this repository (it now resides in https://github.com/apache/arrow-rs and https://github.com/apache/arrow-datafusion) in the hopes of streamlining the development process

Please re-target this PR (let us know if you need help doing so) to one/both of the new repositories.

Thank you for understanding and helping to make arrow-rs and datafusion better

float-variant-aggregation

cf4b4a2

github-actions bot added the Component: Rust label Feb 4, 2021

ritchie46 added 2 commits February 11, 2021 12:20

Distinct float aggregation kernels with more control regarding NaN be…

d24e9e2

…havior

clippy allow float cmp in single test

6fce305

jorgecarleitao force-pushed the master branch from d4608a9 to 356c300 Compare February 14, 2021 12:09

alamb closed this May 3, 2021

alamb mentioned this pull request May 6, 2021

Include NaN in Parquet stats (again) apache/arrow-rs#264

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11496: [Rust] WIP aggregation on NaN values. #9416

ARROW-11496: [Rust] WIP aggregation on NaN values. #9416

ritchie46 commented Feb 4, 2021 •

edited

Loading

github-actions bot commented Feb 4, 2021

Dandandan commented Feb 4, 2021

ritchie46 commented Feb 5, 2021

alamb commented Feb 5, 2021

Dandandan commented Feb 5, 2021

jhorstmann commented Feb 5, 2021

ritchie46 commented Feb 11, 2021

ritchie46 commented Mar 19, 2021

alamb commented Mar 19, 2021

alamb commented Mar 19, 2021

alamb commented Apr 19, 2021

alamb commented May 3, 2021

ARROW-11496: [Rust] WIP aggregation on NaN values. #9416

ARROW-11496: [Rust] WIP aggregation on NaN values. #9416

Conversation

ritchie46 commented Feb 4, 2021 • edited Loading

github-actions bot commented Feb 4, 2021

Dandandan commented Feb 4, 2021

ritchie46 commented Feb 5, 2021

alamb commented Feb 5, 2021

Dandandan commented Feb 5, 2021

jhorstmann commented Feb 5, 2021

ritchie46 commented Feb 11, 2021

ritchie46 commented Mar 19, 2021

alamb commented Mar 19, 2021

alamb commented Mar 19, 2021

alamb commented Apr 19, 2021

alamb commented May 3, 2021

ritchie46 commented Feb 4, 2021 •

edited

Loading