Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: handle NaN as min/max stats in evaluators #2069

Merged
merged 2 commits into from
Jan 20, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
import org.apache.iceberg.types.Conversions;
import org.apache.iceberg.types.Types.StructType;
import org.apache.iceberg.util.BinaryUtil;
import org.apache.iceberg.util.NaNUtil;

import static org.apache.iceberg.expressions.Expressions.rewriteNot;

Expand Down Expand Up @@ -184,15 +185,20 @@ public <T> Boolean notNaN(BoundReference<T> ref) {
public <T> Boolean lt(BoundReference<T> ref, Literal<T> lit) {
Integer id = ref.fieldId();

if (containsNullsOnly(id)) {
if (containsNullsOnly(id) || containsNaNsOnly(id)) {
return ROWS_CANNOT_MATCH;
}

if (lowerBounds != null && lowerBounds.containsKey(id)) {
T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));

int cmp = lit.comparator().compare(lower, lit.value());
if (cmp >= 0) {

// Due to the comparison implementation of ORC stats, for float/double columns in ORC files,
// if the first value in a file is NaN, metrics of this file will report NaN for both upper and
// lower bound despite that the column could contain non-NaN data.
// Without this NaN check below, we may skip including a file that contains matching data.
if (cmp >= 0 && !NaNUtil.isNaN(lower)) {
return ROWS_CANNOT_MATCH;
}
}
Expand All @@ -204,15 +210,20 @@ public <T> Boolean lt(BoundReference<T> ref, Literal<T> lit) {
public <T> Boolean ltEq(BoundReference<T> ref, Literal<T> lit) {
Integer id = ref.fieldId();

if (containsNullsOnly(id)) {
if (containsNullsOnly(id) || containsNaNsOnly(id)) {
return ROWS_CANNOT_MATCH;
}

if (lowerBounds != null && lowerBounds.containsKey(id)) {
T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));

int cmp = lit.comparator().compare(lower, lit.value());
if (cmp > 0) {

// Due to the comparison implementation of ORC stats, for float/double columns in ORC files,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing this. Instead of having this paragraph everywhere, can we abstract this to a method like checkNaNLowerBound, and make this the documentation of that method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick review! Yeah I do realize that repeating the same comment over and over again is a bit annoying, but I wasn't sure where the right balance is. Since I'm hoping to check for isNaN after comparing to avoid unnecessary checking in lt/ltEq, the only thing I can abstract out is NaNUtil.isNaN(lower), that we are essentially wrapping around a wrapper; and also I guess that might not help much with readability since the actual explanation in this case will be outside of the logic flow here, so the reader will have to jump around to understand the full intention. Maybe we can shorten this comment everywhere and have the full version at the start of the class? Do you/other people have any suggestion?

Copy link
Contributor

@jackye1995 jackye1995 Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I did not see the comment, I think it is valuable enough to abstract out the method given the amount of comments, and have something like:

/*
 * all the comments...
 */
private boolean isLowerBoundNaN(lower) {
  return NaNUtil.isNaN(lower);
}

I am not good at naming, you can probably come up with a better name...

Copy link
Contributor

@rdblue rdblue Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't that there is a need for an extra method that has just one method call. I'd probably do it like this:

        T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
        if (NaNUtil.isNaN(lower)) {
          // NaN indicates unreliable bounds. See the InclusiveMetricsEvaluator docs for more.
          return ROWS_MIGHT_MATCH;
        }

        int cmp = lit.comparator().compare(lower, lit.value());
        if (cmp > 0) {
          return ROWS_CANNOT_MATCH;
        }

The docs would go in the javadoc for the whole class, and each NaN check could simply refer back to it. I also moved the NaN check above the comparison to keep the logic simple: if the value is NaN, the bound is invalid.

// if the first value in a file is NaN, metrics of this file will report NaN for both upper and
// lower bound despite that the column could contain non-NaN data.
// Without this NaN check below, we may skip including a file that contains matching data.
if (cmp > 0 && !NaNUtil.isNaN(lower)) {
return ROWS_CANNOT_MATCH;
}
}
Expand All @@ -224,7 +235,7 @@ public <T> Boolean ltEq(BoundReference<T> ref, Literal<T> lit) {
public <T> Boolean gt(BoundReference<T> ref, Literal<T> lit) {
Integer id = ref.fieldId();

if (containsNullsOnly(id)) {
if (containsNullsOnly(id) || containsNaNsOnly(id)) {
return ROWS_CANNOT_MATCH;
}

Expand All @@ -244,7 +255,7 @@ public <T> Boolean gt(BoundReference<T> ref, Literal<T> lit) {
public <T> Boolean gtEq(BoundReference<T> ref, Literal<T> lit) {
Integer id = ref.fieldId();

if (containsNullsOnly(id)) {
if (containsNullsOnly(id) || containsNaNsOnly(id)) {
return ROWS_CANNOT_MATCH;
}

Expand All @@ -264,13 +275,21 @@ public <T> Boolean gtEq(BoundReference<T> ref, Literal<T> lit) {
public <T> Boolean eq(BoundReference<T> ref, Literal<T> lit) {
Integer id = ref.fieldId();

if (containsNullsOnly(id)) {
if (containsNullsOnly(id) || containsNaNsOnly(id)) {
return ROWS_CANNOT_MATCH;
}

if (lowerBounds != null && lowerBounds.containsKey(id)) {
T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));

// Due to the comparison implementation of ORC stats, for float/double columns in ORC files,
// if the first value in a file is NaN, metrics of this file will report NaN for both upper and
// lower bound despite that the column could contain non-NaN data.
// Without this NaN check below, we may skip including a file that contains matching data.
if (NaNUtil.isNaN(lower)) {
return ROWS_MIGHT_MATCH;
}

int cmp = lit.comparator().compare(lower, lit.value());
if (cmp > 0) {
return ROWS_CANNOT_MATCH;
Expand Down Expand Up @@ -300,7 +319,7 @@ public <T> Boolean notEq(BoundReference<T> ref, Literal<T> lit) {
public <T> Boolean in(BoundReference<T> ref, Set<T> literalSet) {
Integer id = ref.fieldId();

if (containsNullsOnly(id)) {
if (containsNullsOnly(id) || containsNaNsOnly(id)) {
return ROWS_CANNOT_MATCH;
}

Expand All @@ -313,6 +332,15 @@ public <T> Boolean in(BoundReference<T> ref, Set<T> literalSet) {

if (lowerBounds != null && lowerBounds.containsKey(id)) {
T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));

// Due to the comparison implementation of ORC stats, for float/double columns in ORC files,
// if the first value in a file is NaN, metrics of this file will report NaN for both upper and
// lower bound despite that the column could contain non-NaN data.
// Without this NaN check below, we may skip including a file that contains matching data.
if (NaNUtil.isNaN(lower)) {
return ROWS_MIGHT_MATCH;
}

literals = literals.stream().filter(v -> ref.comparator().compare(lower, v) <= 0).collect(Collectors.toList());
if (literals.isEmpty()) { // if all values are less than lower bound, rows cannot match.
return ROWS_CANNOT_MATCH;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
import org.apache.iceberg.types.Conversions;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.types.Types.StructType;
import org.apache.iceberg.util.NaNUtil;

import static org.apache.iceberg.expressions.Expressions.rewriteNot;

Expand Down Expand Up @@ -179,7 +180,7 @@ public <T> Boolean lt(BoundReference<T> ref, Literal<T> lit) {
Types.NestedField field = struct.field(id);
Preconditions.checkNotNull(field, "Cannot filter by nested column: %s", schema.findField(id));

if (canContainNulls(id)) {
if (canContainNulls(id) || canContainNaNs(id)) {
return ROWS_MIGHT_NOT_MATCH;
}

Expand All @@ -202,7 +203,7 @@ public <T> Boolean ltEq(BoundReference<T> ref, Literal<T> lit) {
Types.NestedField field = struct.field(id);
Preconditions.checkNotNull(field, "Cannot filter by nested column: %s", schema.findField(id));

if (canContainNulls(id)) {
if (canContainNulls(id) || canContainNaNs(id)) {
return ROWS_MIGHT_NOT_MATCH;
}

Expand All @@ -225,15 +226,20 @@ public <T> Boolean gt(BoundReference<T> ref, Literal<T> lit) {
Types.NestedField field = struct.field(id);
Preconditions.checkNotNull(field, "Cannot filter by nested column: %s", schema.findField(id));

if (canContainNulls(id)) {
if (canContainNulls(id) || canContainNaNs(id)) {
return ROWS_MIGHT_NOT_MATCH;
}

if (lowerBounds != null && lowerBounds.containsKey(id)) {
T lower = Conversions.fromByteBuffer(field.type(), lowerBounds.get(id));

int cmp = lit.comparator().compare(lower, lit.value());
if (cmp > 0) {

// Due to the comparison implementation of ORC stats, for float/double columns in ORC files,
// if the first value in a file is NaN, metrics of this file will report NaN for both upper and
// lower bound despite that the column could contain non-NaN data.
// Without this NaN check below, we may include a file that contains rows that don't match.
if (cmp > 0 && !NaNUtil.isNaN(lower)) {
return ROWS_MUST_MATCH;
}
}
Expand All @@ -248,15 +254,20 @@ public <T> Boolean gtEq(BoundReference<T> ref, Literal<T> lit) {
Types.NestedField field = struct.field(id);
Preconditions.checkNotNull(field, "Cannot filter by nested column: %s", schema.findField(id));

if (canContainNulls(id)) {
if (canContainNulls(id) || canContainNaNs(id)) {
return ROWS_MIGHT_NOT_MATCH;
}

if (lowerBounds != null && lowerBounds.containsKey(id)) {
T lower = Conversions.fromByteBuffer(field.type(), lowerBounds.get(id));

int cmp = lit.comparator().compare(lower, lit.value());
if (cmp >= 0) {

// Due to the comparison implementation of ORC stats, for float/double columns in ORC files,
// if the first value in a file is NaN, metrics of this file will report NaN for both upper and
// lower bound despite that the column could contain non-NaN data.
// Without this NaN check below, we may include a file that contains rows that don't match.
if (cmp >= 0 && !NaNUtil.isNaN(lower)) {
return ROWS_MUST_MATCH;
}
}
Expand All @@ -271,7 +282,7 @@ public <T> Boolean eq(BoundReference<T> ref, Literal<T> lit) {
Types.NestedField field = struct.field(id);
Preconditions.checkNotNull(field, "Cannot filter by nested column: %s", schema.findField(id));

if (canContainNulls(id)) {
if (canContainNulls(id) || canContainNaNs(id)) {
return ROWS_MIGHT_NOT_MATCH;
}

Expand Down Expand Up @@ -304,13 +315,21 @@ public <T> Boolean notEq(BoundReference<T> ref, Literal<T> lit) {
Types.NestedField field = struct.field(id);
Preconditions.checkNotNull(field, "Cannot filter by nested column: %s", schema.findField(id));

if (containsNullsOnly(id)) {
if (containsNullsOnly(id) || containsNaNsOnly(id)) {
return ROWS_MUST_MATCH;
}

if (lowerBounds != null && lowerBounds.containsKey(id)) {
T lower = Conversions.fromByteBuffer(struct.field(id).type(), lowerBounds.get(id));

// Due to the comparison implementation of ORC stats, for float/double columns in ORC files,
// if the first value in a file is NaN, metrics of this file will report NaN for both upper and
// lower bound despite that the column could contain non-NaN data.
// Thus we don't have visibility into the stats when lower bound is NaN.
if (NaNUtil.isNaN(lower)) {
return ROWS_MIGHT_NOT_MATCH;
}

int cmp = lit.comparator().compare(lower, lit.value());
if (cmp > 0) {
return ROWS_MUST_MATCH;
Expand All @@ -335,7 +354,7 @@ public <T> Boolean in(BoundReference<T> ref, Set<T> literalSet) {
Types.NestedField field = struct.field(id);
Preconditions.checkNotNull(field, "Cannot filter by nested column: %s", schema.findField(id));

if (canContainNulls(id)) {
if (canContainNulls(id) || canContainNaNs(id)) {
return ROWS_MIGHT_NOT_MATCH;
}

Expand Down Expand Up @@ -371,7 +390,7 @@ public <T> Boolean notIn(BoundReference<T> ref, Set<T> literalSet) {
Types.NestedField field = struct.field(id);
Preconditions.checkNotNull(field, "Cannot filter by nested column: %s", schema.findField(id));

if (containsNullsOnly(id)) {
if (containsNullsOnly(id) || containsNaNsOnly(id)) {
return ROWS_MUST_MATCH;
}

Expand All @@ -380,6 +399,14 @@ public <T> Boolean notIn(BoundReference<T> ref, Set<T> literalSet) {
if (lowerBounds != null && lowerBounds.containsKey(id)) {
T lower = Conversions.fromByteBuffer(struct.field(id).type(), lowerBounds.get(id));

// Due to the comparison implementation of ORC stats, for float/double columns in ORC files,
// if the first value in a file is NaN, metrics of this file will report NaN for both upper and
// lower bound despite that the column could contain non-NaN data.
// Thus we don't have visibility into the stats when lower bound is NaN.
if (NaNUtil.isNaN(lower)) {
return ROWS_MIGHT_NOT_MATCH;
}

literals = literals.stream().filter(v -> ref.comparator().compare(lower, v) <= 0).collect(Collectors.toList());
if (literals.isEmpty()) { // if all values are less than lower bound, rows must match (notIn).
return ROWS_MUST_MATCH;
Expand All @@ -406,6 +433,11 @@ private boolean canContainNulls(Integer id) {
return nullCounts == null || (nullCounts.containsKey(id) && nullCounts.get(id) > 0);
}

private boolean canContainNaNs(Integer id) {
// nan counts might be null for early version writers when nan counters are not populated.
return nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) > 0;
}

private boolean containsNullsOnly(Integer id) {
return valueCounts != null && valueCounts.containsKey(id) &&
nullCounts != null && nullCounts.containsKey(id) &&
Expand Down
Loading