API: handle NaN as min/max stats in evaluators #2069

yyanyy · 2021-01-12T00:54:24Z

This PR is based on the behavior we observed in here:

NaN is always populated as max when present in Parquet, and both max and min for ORC if the first row of the file is NaN
when lower bound is NaN, upper bound will also always be NaN
in Java, NaN is considered to be greater than all numbers in comparison

This fixes #1761.

jackye1995 · 2021-01-12T01:37:25Z

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

        return ROWS_CANNOT_MATCH;
      }

      if (lowerBounds != null && lowerBounds.containsKey(id)) {
        T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));

        int cmp = lit.comparator().compare(lower, lit.value());
-        if (cmp > 0) {
+
+        // Due to the comparison implementation of ORC stats, for float/double columns in ORC files,


Thanks for noticing this. Instead of having this paragraph everywhere, can we abstract this to a method like checkNaNLowerBound, and make this the documentation of that method?

Thanks for the quick review! Yeah I do realize that repeating the same comment over and over again is a bit annoying, but I wasn't sure where the right balance is. Since I'm hoping to check for isNaN after comparing to avoid unnecessary checking in lt/ltEq, the only thing I can abstract out is NaNUtil.isNaN(lower), that we are essentially wrapping around a wrapper; and also I guess that might not help much with readability since the actual explanation in this case will be outside of the logic flow here, so the reader will have to jump around to understand the full intention. Maybe we can shorten this comment everywhere and have the full version at the start of the class? Do you/other people have any suggestion?

Sorry I did not see the comment, I think it is valuable enough to abstract out the method given the amount of comments, and have something like:

/* * all the comments... */ private boolean isLowerBoundNaN(lower) { return NaNUtil.isNaN(lower); }

I am not good at naming, you can probably come up with a better name...

I don't that there is a need for an extra method that has just one method call. I'd probably do it like this:

T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id)); if (NaNUtil.isNaN(lower)) { // NaN indicates unreliable bounds. See the InclusiveMetricsEvaluator docs for more. return ROWS_MIGHT_MATCH; } int cmp = lit.comparator().compare(lower, lit.value()); if (cmp > 0) { return ROWS_CANNOT_MATCH; }

The docs would go in the javadoc for the whole class, and each NaN check could simply refer back to it. I also moved the NaN check above the comparison to keep the logic simple: if the value is NaN, the bound is invalid.

jbapple · 2021-01-14T04:50:14Z

I think in Java NaN is neither greater then, nor equal to, nor less than any numbers. That’s the IEEE754 semantics, at least.

yyanyy · 2021-01-15T21:40:43Z

I think in Java NaN is neither greater then, nor equal to, nor less than any numbers. That’s the IEEE754 semantics, at least.

Thanks for the interests in this PR! I think in Java, in comparisons NaN is handled as greater than all floating point numbers including infinity, but you are right NaN shouldn't normally be comparable.

jbapple · 2021-01-15T22:31:40Z

Odd, maybe we're trying different things. What does this print for you?

public class Comp {
  public static void main(String[] args) {
     System.out.println(Double.NaN > 0.0);
  }
}

I'm on OpenJDK 1.8, and this prints false for me.

yyanyy · 2021-01-16T00:04:54Z

Odd, maybe we're trying different things. What does this print for you?
public class Comp {
  public static void main(String[] args) {
     System.out.println(Double.NaN > 0.0);
  }
}
I'm on OpenJDK 1.8, and this prints false for me.

Yeah this prints false for me too, as well as Double.NaN == Double.NaN. I think we workaround this problem in evaluators by using a Comparator.naturalOrder() instead of direct operator comparison, as the following code prints the behavior that NaN is considered larger than positive infinity as described in here(by returning -1, 1, 1, -1, 0 respectively)

    Comparator comparator = Comparator.naturalOrder();
    System.out.println(comparator.compare(1.2F, Float.NaN));
    System.out.println(comparator.compare(Float.NaN, 1.2F));
    System.out.println(comparator.compare(Float.NaN, Float.POSITIVE_INFINITY));
    System.out.println(comparator.compare(Float.POSITIVE_INFINITY, Float.NaN));
    System.out.println(comparator.compare(Float.NaN, Float.NaN));

jackye1995 · 2021-01-16T20:29:38Z

@jbapple @yyanyy this blog has a good summary: https://blog.csdn.net/jierui001/article/details/3278382

jbapple · 2021-01-16T21:04:07Z

LGTM. One note I'll make: not all languages treat NaN like Java. For instance, in C++, there are roughly 2^52 different NaN values, and the totalOrder predicate sorts negative NaNs before negative infinity. As such, you could have columns with a NaN lower bound and a numeric upper bound. I'm not sure what the latest draft Iceberg specification says about this.

jackye1995 · 2021-01-16T21:14:54Z

not all languages treat NaN like Java. For instance, in C++, there are roughly 2^52 different NaN values, and the totalOrder predicate sorts negative NaNs before negative infinity.

Yeah that's why I was asking about the spec in #2055. I think What you describe is consistent with the current plan of -NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN.

rdblue · 2021-01-16T22:00:54Z

api/src/test/java/org/apache/iceberg/expressions/TestMetricsEvaluatorsNaNHandling.java

+
+    shouldRead = new InclusiveMetricsEvaluator(
+        SCHEMA, Expressions.in("some_nan_correct_bounds", 30F)).eval(FILE);
+    Assert.assertFalse("Should not match: 30 not within bounds", shouldRead);


Nit: 1F and 30F could be combined to one case, Expressions.in("some_nan_correct_bounds", 1F, 30F).

rdblue · 2021-01-16T22:03:37Z

api/src/test/java/org/apache/iceberg/expressions/TestMetricsEvaluatorsNaNHandling.java

+      Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);
+
+      shouldRead = new StrictMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 10F)).eval(FILE);
+      Assert.assertFalse("Should not match: nan value exists", shouldRead);


This case would return false anyway because the correct upper bound is > 10. I think you should change this test case to 30F so that it matches the whole range, but does not match because there are some NaN values.

rdblue · 2021-01-16T22:11:31Z

@yyanyy, the tests look good other than one case that I mentioned. I also commented on the implementation thread, but overall I think this is correct. Nice work!

kbendick · 2021-01-18T01:08:16Z

For ORC metrics, which currently don't populate nanValueCounts in the metrics and for which it seems difficult to populate nanValueCounts, will this still work when we create Metrics using the non-deprecated constructors that require passing nanValueCounts? This is in reference to this issue for removing the Metrics constructors that don't take in nanValueCounts maps: #2044

I suppose when we go to deprecate the old Metrics constructors, we can simply pass null into the Metrics constructor for nanValueCounts with ORC in order to keep the current behavior. Or update the various evaluator code paths to ensure that a missing key from this map does not indicate anything special if we choose to pass in an empty map.

This is all based off of my understanding of ORC's ColumnStatistics and the notes in this PR about upperBound and lowerBound being NaN in ORC if any value in the file is NaN, so please correct me if I'm mistaken and somebody sees an easy way to get NaN stats from an ORC file. To me, I fail to see how we can get the count of NaN values in the following code block (though as I mentioned we can always continue to pass in null to match the existing behavior, provided the various metrics evaluators are aware that null for that Map entry for nullValueCounts doesnt convey much meaning for ORC).

We could also potentially track NaN values in the various orc writers, like the parquet writers do.

The relevant code for populating ORC metrics is here:

iceberg/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

Lines 100 to 172 in 29915e3

    
           private static Metrics buildOrcMetrics(final long numOfRows, final TypeDescription orcSchema, 
        
                                                  final ColumnStatistics[] colStats, final MetricsConfig metricsConfig, 
        
                                                  final NameMapping mapping) { 
        
             final TypeDescription orcSchemaWithIds = (!ORCSchemaUtil.hasIds(orcSchema) && mapping != null) ? 
        
                 ORCSchemaUtil.applyNameMapping(orcSchema, mapping) : orcSchema; 
        
             final Set<Integer> statsColumns = statsColumns(orcSchemaWithIds); 
        
             final MetricsConfig effectiveMetricsConfig = Optional.ofNullable(metricsConfig) 
        
                 .orElseGet(MetricsConfig::getDefault); 
        
             Map<Integer, Long> columnSizes = Maps.newHashMapWithExpectedSize(colStats.length); 
        
             Map<Integer, Long> valueCounts = Maps.newHashMapWithExpectedSize(colStats.length); 
        
             Map<Integer, Long> nullCounts = Maps.newHashMapWithExpectedSize(colStats.length); 
        
             if (!ORCSchemaUtil.hasIds(orcSchemaWithIds)) { 
        
               return new Metrics(numOfRows, 
        
                   columnSizes, 
        
                   valueCounts, 
        
                   nullCounts, 
        
                   null, 
        
                   null); 
        
             } 
        
             final Schema schema = ORCSchemaUtil.convert(orcSchemaWithIds); 
        
             Map<Integer, ByteBuffer> lowerBounds = Maps.newHashMap(); 
        
             Map<Integer, ByteBuffer> upperBounds = Maps.newHashMap(); 
        
             for (int i = 0; i < colStats.length; i++) { 
        
               final ColumnStatistics colStat = colStats[i]; 
        
               final TypeDescription orcCol = orcSchemaWithIds.findSubtype(i); 
        
               final Optional<Types.NestedField> icebergColOpt = ORCSchemaUtil.icebergID(orcCol) 
        
                   .map(schema::findField); 
        
               if (icebergColOpt.isPresent()) { 
        
                 final Types.NestedField icebergCol = icebergColOpt.get(); 
        
                 final int fieldId = icebergCol.fieldId(); 
        
                 final MetricsMode metricsMode = effectiveMetricsConfig.columnMode(icebergCol.name()); 
        
                 columnSizes.put(fieldId, colStat.getBytesOnDisk()); 
        
                 if (metricsMode == MetricsModes.None.get()) { 
        
                   continue; 
        
                 } 
        
                 if (statsColumns.contains(fieldId)) { 
        
                   // Since ORC does not track null values nor repeated ones, the value count for columns in 
        
                   // containers (maps, list) may be larger than what it actually is, however these are not 
        
                   // used in experssions right now. For such cases, we use the value number of values 
        
                   // directly stored in ORC. 
        
                   if (colStat.hasNull()) { 
        
                     nullCounts.put(fieldId, numOfRows - colStat.getNumberOfValues()); 
        
                   } else { 
        
                     nullCounts.put(fieldId, 0L); 
        
                   } 
        
                   valueCounts.put(fieldId, colStat.getNumberOfValues() + nullCounts.get(fieldId)); 
        
                   if (metricsMode != MetricsModes.Counts.get()) { 
        
                     Optional<ByteBuffer> orcMin = (colStat.getNumberOfValues() > 0) ? 
        
                         fromOrcMin(icebergCol.type(), colStat, metricsMode) : Optional.empty(); 
        
                     orcMin.ifPresent(byteBuffer -> lowerBounds.put(icebergCol.fieldId(), byteBuffer)); 
        
                     Optional<ByteBuffer> orcMax = (colStat.getNumberOfValues() > 0) ? 
        
                         fromOrcMax(icebergCol.type(), colStat, metricsMode) : Optional.empty(); 
        
                     orcMax.ifPresent(byteBuffer -> upperBounds.put(icebergCol.fieldId(), byteBuffer)); 
        
                   } 
        
                 } 
        
               } 
        
             } 
        
             return new Metrics(numOfRows, 
        
                 columnSizes, 
        
                 valueCounts, 
        
                 nullCounts, 
        
                 lowerBounds, 
        
                 upperBounds); 
        
           }

yyanyy · 2021-01-19T19:36:57Z

For ORC metrics, which currently don't populate nanValueCounts in the metrics and for which it seems difficult to populate nanValueCounts, will this still work when we create Metrics using the non-deprecated constructors that require passing nanValueCounts? This is in reference to this issue for removing the Metrics constructors that don't take in nanValueCounts maps: #2044

I suppose when we go to deprecate the old Metrics constructors, we can simply pass null into the Metrics constructor for nanValueCounts with ORC in order to keep the current behavior. Or update the various evaluator code paths to ensure that a missing key from this map does not indicate anything special if we choose to pass in an empty map.

This is all based off of my understanding of ORC's ColumnStatistics and the notes in this PR about upperBound and lowerBound being NaN in ORC if any value in the file is NaN, so please correct me if I'm mistaken and somebody sees an easy way to get NaN stats from an ORC file. To me, I fail to see how we can get the count of NaN values in the following code block (though as I mentioned we can always continue to pass in null to match the existing behavior, provided the various metrics evaluators are aware that null for that Map entry for nullValueCounts doesnt convey much meaning for ORC).

We could also potentially track NaN values in the various orc writers, like the parquet writers do.

The relevant code for populating ORC metrics is here:

iceberg/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java

Lines 100 to 172 in 29915e3

private static Metrics buildOrcMetrics(final long numOfRows, final TypeDescription orcSchema,

final ColumnStatistics[] colStats, final MetricsConfig metricsConfig,

final NameMapping mapping) {

final TypeDescription orcSchemaWithIds = (!ORCSchemaUtil.hasIds(orcSchema) && mapping != null) ?

ORCSchemaUtil.applyNameMapping(orcSchema, mapping) : orcSchema;

final Set<Integer> statsColumns = statsColumns(orcSchemaWithIds);

final MetricsConfig effectiveMetricsConfig = Optional.ofNullable(metricsConfig)

.orElseGet(MetricsConfig::getDefault);

Map<Integer, Long> columnSizes = Maps.newHashMapWithExpectedSize(colStats.length);

Map<Integer, Long> valueCounts = Maps.newHashMapWithExpectedSize(colStats.length);

Map<Integer, Long> nullCounts = Maps.newHashMapWithExpectedSize(colStats.length);

if (!ORCSchemaUtil.hasIds(orcSchemaWithIds)) {

return new Metrics(numOfRows,

columnSizes,

valueCounts,

nullCounts,

null,

null);

}

final Schema schema = ORCSchemaUtil.convert(orcSchemaWithIds);

Map<Integer, ByteBuffer> lowerBounds = Maps.newHashMap();

Map<Integer, ByteBuffer> upperBounds = Maps.newHashMap();

for (int i = 0; i < colStats.length; i++) {

final ColumnStatistics colStat = colStats[i];

final TypeDescription orcCol = orcSchemaWithIds.findSubtype(i);

final Optional<Types.NestedField> icebergColOpt = ORCSchemaUtil.icebergID(orcCol)

.map(schema::findField);

if (icebergColOpt.isPresent()) {

final Types.NestedField icebergCol = icebergColOpt.get();

final int fieldId = icebergCol.fieldId();

final MetricsMode metricsMode = effectiveMetricsConfig.columnMode(icebergCol.name());

columnSizes.put(fieldId, colStat.getBytesOnDisk());

if (metricsMode == MetricsModes.None.get()) {

continue;

}

if (statsColumns.contains(fieldId)) {

// Since ORC does not track null values nor repeated ones, the value count for columns in

// containers (maps, list) may be larger than what it actually is, however these are not

// used in experssions right now. For such cases, we use the value number of values

// directly stored in ORC.

if (colStat.hasNull()) {

nullCounts.put(fieldId, numOfRows - colStat.getNumberOfValues());

} else {

nullCounts.put(fieldId, 0L);

}

valueCounts.put(fieldId, colStat.getNumberOfValues() + nullCounts.get(fieldId));

if (metricsMode != MetricsModes.Counts.get()) {

Optional<ByteBuffer> orcMin = (colStat.getNumberOfValues() > 0) ?

fromOrcMin(icebergCol.type(), colStat, metricsMode) : Optional.empty();

orcMin.ifPresent(byteBuffer -> lowerBounds.put(icebergCol.fieldId(), byteBuffer));

Optional<ByteBuffer> orcMax = (colStat.getNumberOfValues() > 0) ?

fromOrcMax(icebergCol.type(), colStat, metricsMode) : Optional.empty();

orcMax.ifPresent(byteBuffer -> upperBounds.put(icebergCol.fieldId(), byteBuffer));

}

}

}

}

return new Metrics(numOfRows,

columnSizes,

valueCounts,

nullCounts,

lowerBounds,

upperBounds);

}

Thanks for your interests in this PR! You are right that some of this PR will not be working without NaN counter for checking if a field contains partial/all NaN, and it's hard to get ORC nanValueCounts from ORC file directly. I think today evaluators should treat missing keys from this map or lack of such map the same way as not having NaN value count information for a given field, or do you think we are not doing this in some places?

Regarding populating NaN counter for ORC files, yes we are going to track NaN values in various ORC writers just as we do for parquet writers, and the change is currently in review in #1790.

yyanyy · 2021-01-20T00:35:15Z

I have updated the PR based on comments. Thank you everyone for your interests in this PR!

rdblue · 2021-01-20T17:04:24Z

Thank you for fixing this, @yyanyy! Great work.

Thanks to @jbapple, @jackye1995, and @kbendick for reviewing!

API: handle NaN as min/max stats in evaluators

b5d77f1

github-actions bot added the API label Jan 12, 2021

jackye1995 reviewed Jan 12, 2021

View reviewed changes

rdblue added this to the Java 0.11.0 Release milestone Jan 12, 2021

rdblue reviewed Jan 16, 2021

View reviewed changes

update comments and tests

f3e6f4e

rdblue merged commit b018e21 into apache:master Jan 20, 2021

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021

API: Handle NaN lower or upper bound in stats evaluators (apache#2069)

0717122

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: handle NaN as min/max stats in evaluators #2069

API: handle NaN as min/max stats in evaluators #2069

yyanyy commented Jan 12, 2021

jackye1995 Jan 12, 2021

yyanyy Jan 12, 2021

jackye1995 Jan 14, 2021 •

edited

Loading

rdblue Jan 16, 2021 •

edited

Loading

jbapple commented Jan 14, 2021

yyanyy commented Jan 15, 2021

jbapple commented Jan 15, 2021

yyanyy commented Jan 16, 2021

jackye1995 commented Jan 16, 2021

jbapple commented Jan 16, 2021

jackye1995 commented Jan 16, 2021

rdblue Jan 16, 2021

rdblue Jan 16, 2021

rdblue commented Jan 16, 2021

kbendick commented Jan 18, 2021 •

edited

Loading

yyanyy commented Jan 19, 2021

yyanyy commented Jan 20, 2021

rdblue commented Jan 20, 2021

API: handle NaN as min/max stats in evaluators #2069

API: handle NaN as min/max stats in evaluators #2069

Conversation

yyanyy commented Jan 12, 2021

jackye1995 Jan 12, 2021

Choose a reason for hiding this comment

yyanyy Jan 12, 2021

Choose a reason for hiding this comment

jackye1995 Jan 14, 2021 • edited Loading

Choose a reason for hiding this comment

rdblue Jan 16, 2021 • edited Loading

Choose a reason for hiding this comment

jbapple commented Jan 14, 2021

yyanyy commented Jan 15, 2021

jbapple commented Jan 15, 2021

yyanyy commented Jan 16, 2021

jackye1995 commented Jan 16, 2021

jbapple commented Jan 16, 2021

jackye1995 commented Jan 16, 2021

rdblue Jan 16, 2021

Choose a reason for hiding this comment

rdblue Jan 16, 2021

Choose a reason for hiding this comment

rdblue commented Jan 16, 2021

kbendick commented Jan 18, 2021 • edited Loading

yyanyy commented Jan 19, 2021

yyanyy commented Jan 20, 2021

rdblue commented Jan 20, 2021

jackye1995 Jan 14, 2021 •

edited

Loading

rdblue Jan 16, 2021 •

edited

Loading

kbendick commented Jan 18, 2021 •

edited

Loading