Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: handle NaN as min/max stats in evaluators #2069

Merged
merged 2 commits into from
Jan 20, 2021
Merged

Conversation

yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Jan 12, 2021

This PR is based on the behavior we observed in here:

  • NaN is always populated as max when present in Parquet, and both max and min for ORC if the first row of the file is NaN
  • when lower bound is NaN, upper bound will also always be NaN
  • in Java, NaN is considered to be greater than all numbers in comparison

This fixes #1761.

@github-actions github-actions bot added the API label Jan 12, 2021
return ROWS_CANNOT_MATCH;
}

if (lowerBounds != null && lowerBounds.containsKey(id)) {
T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));

int cmp = lit.comparator().compare(lower, lit.value());
if (cmp > 0) {

// Due to the comparison implementation of ORC stats, for float/double columns in ORC files,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing this. Instead of having this paragraph everywhere, can we abstract this to a method like checkNaNLowerBound, and make this the documentation of that method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick review! Yeah I do realize that repeating the same comment over and over again is a bit annoying, but I wasn't sure where the right balance is. Since I'm hoping to check for isNaN after comparing to avoid unnecessary checking in lt/ltEq, the only thing I can abstract out is NaNUtil.isNaN(lower), that we are essentially wrapping around a wrapper; and also I guess that might not help much with readability since the actual explanation in this case will be outside of the logic flow here, so the reader will have to jump around to understand the full intention. Maybe we can shorten this comment everywhere and have the full version at the start of the class? Do you/other people have any suggestion?

Copy link
Contributor

@jackye1995 jackye1995 Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I did not see the comment, I think it is valuable enough to abstract out the method given the amount of comments, and have something like:

/*
 * all the comments...
 */
private boolean isLowerBoundNaN(lower) {
  return NaNUtil.isNaN(lower);
}

I am not good at naming, you can probably come up with a better name...

Copy link
Contributor

@rdblue rdblue Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't that there is a need for an extra method that has just one method call. I'd probably do it like this:

        T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
        if (NaNUtil.isNaN(lower)) {
          // NaN indicates unreliable bounds. See the InclusiveMetricsEvaluator docs for more.
          return ROWS_MIGHT_MATCH;
        }

        int cmp = lit.comparator().compare(lower, lit.value());
        if (cmp > 0) {
          return ROWS_CANNOT_MATCH;
        }

The docs would go in the javadoc for the whole class, and each NaN check could simply refer back to it. I also moved the NaN check above the comparison to keep the logic simple: if the value is NaN, the bound is invalid.

@rdblue rdblue added this to the Java 0.11.0 Release milestone Jan 12, 2021
@jbapple
Copy link
Contributor

jbapple commented Jan 14, 2021

I think in Java NaN is neither greater then, nor equal to, nor less than any numbers. That’s the IEEE754 semantics, at least.

@yyanyy
Copy link
Contributor Author

yyanyy commented Jan 15, 2021

I think in Java NaN is neither greater then, nor equal to, nor less than any numbers. That’s the IEEE754 semantics, at least.

Thanks for the interests in this PR! I think in Java, in comparisons NaN is handled as greater than all floating point numbers including infinity, but you are right NaN shouldn't normally be comparable.

@jbapple
Copy link
Contributor

jbapple commented Jan 15, 2021

Odd, maybe we're trying different things. What does this print for you?

public class Comp {
  public static void main(String[] args) {
     System.out.println(Double.NaN > 0.0);
  }
}

I'm on OpenJDK 1.8, and this prints false for me.

@yyanyy
Copy link
Contributor Author

yyanyy commented Jan 16, 2021

Odd, maybe we're trying different things. What does this print for you?

public class Comp {
  public static void main(String[] args) {
     System.out.println(Double.NaN > 0.0);
  }
}

I'm on OpenJDK 1.8, and this prints false for me.

Yeah this prints false for me too, as well as Double.NaN == Double.NaN. I think we workaround this problem in evaluators by using a Comparator.naturalOrder() instead of direct operator comparison, as the following code prints the behavior that NaN is considered larger than positive infinity as described in here(by returning -1, 1, 1, -1, 0 respectively)

    Comparator comparator = Comparator.naturalOrder();
    System.out.println(comparator.compare(1.2F, Float.NaN));
    System.out.println(comparator.compare(Float.NaN, 1.2F));
    System.out.println(comparator.compare(Float.NaN, Float.POSITIVE_INFINITY));
    System.out.println(comparator.compare(Float.POSITIVE_INFINITY, Float.NaN));
    System.out.println(comparator.compare(Float.NaN, Float.NaN));

@jackye1995
Copy link
Contributor

@jbapple @yyanyy this blog has a good summary: https://blog.csdn.net/jierui001/article/details/3278382

@jbapple
Copy link
Contributor

jbapple commented Jan 16, 2021

LGTM. One note I'll make: not all languages treat NaN like Java. For instance, in C++, there are roughly 2^52 different NaN values, and the totalOrder predicate sorts negative NaNs before negative infinity. As such, you could have columns with a NaN lower bound and a numeric upper bound. I'm not sure what the latest draft Iceberg specification says about this.

@jackye1995
Copy link
Contributor

not all languages treat NaN like Java. For instance, in C++, there are roughly 2^52 different NaN values, and the totalOrder predicate sorts negative NaNs before negative infinity.

Yeah that's why I was asking about the spec in #2055. I think What you describe is consistent with the current plan of -NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN.


shouldRead = new InclusiveMetricsEvaluator(
SCHEMA, Expressions.in("some_nan_correct_bounds", 30F)).eval(FILE);
Assert.assertFalse("Should not match: 30 not within bounds", shouldRead);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 1F and 30F could be combined to one case, Expressions.in("some_nan_correct_bounds", 1F, 30F).

Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead);

shouldRead = new StrictMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 10F)).eval(FILE);
Assert.assertFalse("Should not match: nan value exists", shouldRead);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case would return false anyway because the correct upper bound is > 10. I think you should change this test case to 30F so that it matches the whole range, but does not match because there are some NaN values.

@rdblue
Copy link
Contributor

rdblue commented Jan 16, 2021

@yyanyy, the tests look good other than one case that I mentioned. I also commented on the implementation thread, but overall I think this is correct. Nice work!

@kbendick
Copy link
Contributor

kbendick commented Jan 18, 2021

For ORC metrics, which currently don't populate nanValueCounts in the metrics and for which it seems difficult to populate nanValueCounts, will this still work when we create Metrics using the non-deprecated constructors that require passing nanValueCounts? This is in reference to this issue for removing the Metrics constructors that don't take in nanValueCounts maps: #2044

I suppose when we go to deprecate the old Metrics constructors, we can simply pass null into the Metrics constructor for nanValueCounts with ORC in order to keep the current behavior. Or update the various evaluator code paths to ensure that a missing key from this map does not indicate anything special if we choose to pass in an empty map.

This is all based off of my understanding of ORC's ColumnStatistics and the notes in this PR about upperBound and lowerBound being NaN in ORC if any value in the file is NaN, so please correct me if I'm mistaken and somebody sees an easy way to get NaN stats from an ORC file. To me, I fail to see how we can get the count of NaN values in the following code block (though as I mentioned we can always continue to pass in null to match the existing behavior, provided the various metrics evaluators are aware that null for that Map entry for nullValueCounts doesnt convey much meaning for ORC).

We could also potentially track NaN values in the various orc writers, like the parquet writers do.

The relevant code for populating ORC metrics is here:

private static Metrics buildOrcMetrics(final long numOfRows, final TypeDescription orcSchema,
final ColumnStatistics[] colStats, final MetricsConfig metricsConfig,
final NameMapping mapping) {
final TypeDescription orcSchemaWithIds = (!ORCSchemaUtil.hasIds(orcSchema) && mapping != null) ?
ORCSchemaUtil.applyNameMapping(orcSchema, mapping) : orcSchema;
final Set<Integer> statsColumns = statsColumns(orcSchemaWithIds);
final MetricsConfig effectiveMetricsConfig = Optional.ofNullable(metricsConfig)
.orElseGet(MetricsConfig::getDefault);
Map<Integer, Long> columnSizes = Maps.newHashMapWithExpectedSize(colStats.length);
Map<Integer, Long> valueCounts = Maps.newHashMapWithExpectedSize(colStats.length);
Map<Integer, Long> nullCounts = Maps.newHashMapWithExpectedSize(colStats.length);
if (!ORCSchemaUtil.hasIds(orcSchemaWithIds)) {
return new Metrics(numOfRows,
columnSizes,
valueCounts,
nullCounts,
null,
null);
}
final Schema schema = ORCSchemaUtil.convert(orcSchemaWithIds);
Map<Integer, ByteBuffer> lowerBounds = Maps.newHashMap();
Map<Integer, ByteBuffer> upperBounds = Maps.newHashMap();
for (int i = 0; i < colStats.length; i++) {
final ColumnStatistics colStat = colStats[i];
final TypeDescription orcCol = orcSchemaWithIds.findSubtype(i);
final Optional<Types.NestedField> icebergColOpt = ORCSchemaUtil.icebergID(orcCol)
.map(schema::findField);
if (icebergColOpt.isPresent()) {
final Types.NestedField icebergCol = icebergColOpt.get();
final int fieldId = icebergCol.fieldId();
final MetricsMode metricsMode = effectiveMetricsConfig.columnMode(icebergCol.name());
columnSizes.put(fieldId, colStat.getBytesOnDisk());
if (metricsMode == MetricsModes.None.get()) {
continue;
}
if (statsColumns.contains(fieldId)) {
// Since ORC does not track null values nor repeated ones, the value count for columns in
// containers (maps, list) may be larger than what it actually is, however these are not
// used in experssions right now. For such cases, we use the value number of values
// directly stored in ORC.
if (colStat.hasNull()) {
nullCounts.put(fieldId, numOfRows - colStat.getNumberOfValues());
} else {
nullCounts.put(fieldId, 0L);
}
valueCounts.put(fieldId, colStat.getNumberOfValues() + nullCounts.get(fieldId));
if (metricsMode != MetricsModes.Counts.get()) {
Optional<ByteBuffer> orcMin = (colStat.getNumberOfValues() > 0) ?
fromOrcMin(icebergCol.type(), colStat, metricsMode) : Optional.empty();
orcMin.ifPresent(byteBuffer -> lowerBounds.put(icebergCol.fieldId(), byteBuffer));
Optional<ByteBuffer> orcMax = (colStat.getNumberOfValues() > 0) ?
fromOrcMax(icebergCol.type(), colStat, metricsMode) : Optional.empty();
orcMax.ifPresent(byteBuffer -> upperBounds.put(icebergCol.fieldId(), byteBuffer));
}
}
}
}
return new Metrics(numOfRows,
columnSizes,
valueCounts,
nullCounts,
lowerBounds,
upperBounds);
}

@yyanyy
Copy link
Contributor Author

yyanyy commented Jan 19, 2021

For ORC metrics, which currently don't populate nanValueCounts in the metrics and for which it seems difficult to populate nanValueCounts, will this still work when we create Metrics using the non-deprecated constructors that require passing nanValueCounts? This is in reference to this issue for removing the Metrics constructors that don't take in nanValueCounts maps: #2044

I suppose when we go to deprecate the old Metrics constructors, we can simply pass null into the Metrics constructor for nanValueCounts with ORC in order to keep the current behavior. Or update the various evaluator code paths to ensure that a missing key from this map does not indicate anything special if we choose to pass in an empty map.

This is all based off of my understanding of ORC's ColumnStatistics and the notes in this PR about upperBound and lowerBound being NaN in ORC if any value in the file is NaN, so please correct me if I'm mistaken and somebody sees an easy way to get NaN stats from an ORC file. To me, I fail to see how we can get the count of NaN values in the following code block (though as I mentioned we can always continue to pass in null to match the existing behavior, provided the various metrics evaluators are aware that null for that Map entry for nullValueCounts doesnt convey much meaning for ORC).

We could also potentially track NaN values in the various orc writers, like the parquet writers do.

The relevant code for populating ORC metrics is here:

private static Metrics buildOrcMetrics(final long numOfRows, final TypeDescription orcSchema,
final ColumnStatistics[] colStats, final MetricsConfig metricsConfig,
final NameMapping mapping) {
final TypeDescription orcSchemaWithIds = (!ORCSchemaUtil.hasIds(orcSchema) && mapping != null) ?
ORCSchemaUtil.applyNameMapping(orcSchema, mapping) : orcSchema;
final Set<Integer> statsColumns = statsColumns(orcSchemaWithIds);
final MetricsConfig effectiveMetricsConfig = Optional.ofNullable(metricsConfig)
.orElseGet(MetricsConfig::getDefault);
Map<Integer, Long> columnSizes = Maps.newHashMapWithExpectedSize(colStats.length);
Map<Integer, Long> valueCounts = Maps.newHashMapWithExpectedSize(colStats.length);
Map<Integer, Long> nullCounts = Maps.newHashMapWithExpectedSize(colStats.length);
if (!ORCSchemaUtil.hasIds(orcSchemaWithIds)) {
return new Metrics(numOfRows,
columnSizes,
valueCounts,
nullCounts,
null,
null);
}
final Schema schema = ORCSchemaUtil.convert(orcSchemaWithIds);
Map<Integer, ByteBuffer> lowerBounds = Maps.newHashMap();
Map<Integer, ByteBuffer> upperBounds = Maps.newHashMap();
for (int i = 0; i < colStats.length; i++) {
final ColumnStatistics colStat = colStats[i];
final TypeDescription orcCol = orcSchemaWithIds.findSubtype(i);
final Optional<Types.NestedField> icebergColOpt = ORCSchemaUtil.icebergID(orcCol)
.map(schema::findField);
if (icebergColOpt.isPresent()) {
final Types.NestedField icebergCol = icebergColOpt.get();
final int fieldId = icebergCol.fieldId();
final MetricsMode metricsMode = effectiveMetricsConfig.columnMode(icebergCol.name());
columnSizes.put(fieldId, colStat.getBytesOnDisk());
if (metricsMode == MetricsModes.None.get()) {
continue;
}
if (statsColumns.contains(fieldId)) {
// Since ORC does not track null values nor repeated ones, the value count for columns in
// containers (maps, list) may be larger than what it actually is, however these are not
// used in experssions right now. For such cases, we use the value number of values
// directly stored in ORC.
if (colStat.hasNull()) {
nullCounts.put(fieldId, numOfRows - colStat.getNumberOfValues());
} else {
nullCounts.put(fieldId, 0L);
}
valueCounts.put(fieldId, colStat.getNumberOfValues() + nullCounts.get(fieldId));
if (metricsMode != MetricsModes.Counts.get()) {
Optional<ByteBuffer> orcMin = (colStat.getNumberOfValues() > 0) ?
fromOrcMin(icebergCol.type(), colStat, metricsMode) : Optional.empty();
orcMin.ifPresent(byteBuffer -> lowerBounds.put(icebergCol.fieldId(), byteBuffer));
Optional<ByteBuffer> orcMax = (colStat.getNumberOfValues() > 0) ?
fromOrcMax(icebergCol.type(), colStat, metricsMode) : Optional.empty();
orcMax.ifPresent(byteBuffer -> upperBounds.put(icebergCol.fieldId(), byteBuffer));
}
}
}
}
return new Metrics(numOfRows,
columnSizes,
valueCounts,
nullCounts,
lowerBounds,
upperBounds);
}

Thanks for your interests in this PR! You are right that some of this PR will not be working without NaN counter for checking if a field contains partial/all NaN, and it's hard to get ORC nanValueCounts from ORC file directly. I think today evaluators should treat missing keys from this map or lack of such map the same way as not having NaN value count information for a given field, or do you think we are not doing this in some places?

Regarding populating NaN counter for ORC files, yes we are going to track NaN values in various ORC writers just as we do for parquet writers, and the change is currently in review in #1790.

@yyanyy
Copy link
Contributor Author

yyanyy commented Jan 20, 2021

I have updated the PR based on comments. Thank you everyone for your interests in this PR!

@rdblue rdblue merged commit b018e21 into apache:master Jan 20, 2021
@rdblue
Copy link
Contributor

rdblue commented Jan 20, 2021

Thank you for fixing this, @yyanyy! Great work.

Thanks to @jbapple, @jackye1995, and @kbendick for reviewing!

XuQianJin-Stars pushed a commit to XuQianJin-Stars/iceberg that referenced this pull request Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update evaluators to consider NaN for evaluating upper/lower bounds
5 participants