PARQUET-34: Add #contains FilterPredicate for Array columns #1328

clairemcginty · 2024-04-23T19:43:52Z

Proposal to add a new FilterPredicate, Contains, that can be applied to List types, and check if the specified element is present among the repeated values. It can be composed using And or Or:

FilterPredicate predicate = contains(
  eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("cell"))
)

FilterPredicate predicate = or(
  contains(eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("cell"))),
  contains(eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("home")))
)

FilterPredicate predicate = and(
  contains(eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("cell"))),
  contains(eq(binaryColumn("phoneNumbers.phone.kind"), Binary.fromString("home")))
)

The filtering logic is largely based on existing Eq predicates to apply filtering at the page/rowgroup level using statistics/dictionaries, with a specialized implementation in IncrementallyUpdatedFilterPredicateBuilder to do individual record-level filtering.

Jira

My PR addresses the following Parquet Jira issues and references
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-34
- In case you are adding a dependency, check if the license complies with
  the ASF 3rd Party License Policy.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Style

My contribution adheres to the code style guidelines and Spotless passes.
- To apply the necessary changes, run mvn spotless:apply -Pvector-plugins

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

parquet-hadoop/src/main/java/org/apache/parquet/filter2/bloomfilterlevel/BloomFilterImpl.java

clairemcginty · 2024-04-24T13:44:23Z

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java

@@ -257,6 +266,16 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N
    return new NotIn<>(column, values);
  }

+  public static <T extends Comparable<T>, C extends Column<T> & SupportsContains> Contains<T> contains(


Thinking to the future, it could eventually be useful to support Array FilterPredicates like "repeated field X contains an int field greater than Y". We could refactor this API to be written like:

FilterApi.contains( FilterApi.eq( FilterApi.arrayColumn(FilterApi.intColumn("repeated_int_field")), 100 ) )

& we could eventually support FilterApi.contains(FilterApi.lt(...)), FilterApi.contains(FilterApi.gt(...)), etc...

It seems that Contains/DoesNotContain is not a standard SQL function. In which case they are used?

BTW, now these new functions only accept single value. Does it make sense to support variable number of values? I also see some databases support CONTAINS(a OR b) and CONTAINS(a AND b). They can be composited by logical operators but the cost would be high compared to support them natively,

It's not a standard SQL function, but I've seen it in SQL extension languages such as BigQuery Standard SQL, and I've gotten several requests to support this by users of the Scio Parquet library!

that's a good point about making this composable, I think it would be more efficient to do CONTAINS(a or b) than CONTAINS(a) or CONTAINS(b). What do you think about supporting lt/gt in addition to eq-based Contains? for example, CONTAINS(eq(a) OR gt(b)) ? It would make this PR a lot more complex but I'm happy to try. We could probably re-use a lot of the existing filter code for eq, lt/gt, etc...

Yes, I agree that it worths making the function composable. Perhaps we can define all the necessary API (with implementation of the most common ones) in the PR and implementations of all other kinds of inputs can be split into separate PRs. Let's not hurry for the coming release. It still takes time to stabilize.

It's not a standard SQL function, but I've seen it in SQL extension languages such as BigQuery Standard SQL, and I've gotten several requests to support this by users of the Scio Parquet library!

that's a good point about making this composable, I think it would be more efficient to do CONTAINS(a or b) than CONTAINS(a) or CONTAINS(b). What do you think about supporting lt/gt in addition to eq-based Contains? for example, CONTAINS(eq(a) OR gt(b)) ? It would make this PR a lot more complex but I'm happy to try. We could probably re-use a lot of the existing filter code for eq, lt/gt, etc...

I took a second look, it seems that you were mentioning IN expression which we already have: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L209-L258

Ok, I've pushed my changes to support composable contains predicates! example usage:

FilterApi.containsOr( FilterApi.containsEq(longColumn("phoneNumbers.phone.number"), 5555555555L), FilterApi.containsOr( FilterApi.containsEq(longColumn("phoneNumbers.phone.number"), -10000000L), FilterApi.containsEq(longColumn("phoneNumbers.phone.number"), 2222222222L)));

I left out implementing DoesNotContain/ContainsNotEq for now to keep the PR simpler to parse. Let me know if the API looks ok 👍 Then I can fill out the rest of the implementations/make the unit tests more thorough.

Sorry I didn't make it clear. I thought we could still leverage existing composite expression and combine CONTAINS(x) sub-expressions under the same level of AND or OR expression during rewrite process like https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/filter2/compat/FilterCompat.java#L79. It would be harder to maintain the code if introduce new specialized composite expression like containsOr.

ahh, ok. So we’d have a single public Contains predicate class that’s registered with the Visitor API, that can support one or more sub-predicates… and the Rewriter would inspect any Ands or Ors to see see if the left/right side is an instance of Contains… If so, we rewrite it into a single Contains predicate containing both clauses?

I would expect so, but it seems requiring a lot more work. We can support the simplest case at this point.

ok, I think I mostly got this working, supporting just Contains Eq (+ composition using And/Or) to start. hopefully this is closer to what you had in mind!

wgtmac · 2024-04-24T14:37:29Z

Thanks for the new feature! I will try to take a look soon and make it to the 1.14.0 release.

wgtmac · 2024-04-25T06:04:01Z

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java

@@ -257,6 +266,16 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N
    return new NotIn<>(column, values);
  }

+  public static <T extends Comparable<T>, C extends Column<T> & SupportsContains> Contains<T> contains(


It seems that Contains/DoesNotContain is not a standard SQL function. In which case they are used?

wgtmac · 2024-04-25T06:06:24Z

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java

@@ -257,6 +266,16 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N
    return new NotIn<>(column, values);
  }

+  public static <T extends Comparable<T>, C extends Column<T> & SupportsContains> Contains<T> contains(


BTW, now these new functions only accept single value. Does it make sense to support variable number of values? I also see some databases support CONTAINS(a OR b) and CONTAINS(a AND b). They can be composited by logical operators but the cost would be high compared to support them natively,

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java

parquet-hadoop/src/main/java/org/apache/parquet/filter2/bloomfilterlevel/BloomFilterImpl.java

pom.xml

wgtmac · 2024-04-26T09:49:19Z

As requested on the dev@parquet ML, I'll wait for this PR before starting the releasing process of 1.14.0.

gszadovszky

Thanks @clairemcginty for working on this. This is a great improvement to support nested structures in filtering!

I agree with @wgtmac's comments/questions. Added some more.

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java

parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java

wgtmac · 2024-04-29T06:17:36Z

BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready.

gszadovszky · 2024-04-29T07:22:16Z

BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready.

I agree with @wgtmac. Let's create smaller PRs and make improvements then release when we feel it stable. During the development, we may advertise this feature so others may start experimenting on it and give feedback before actually releasing it. A too early release might make later improvements harder because we need to be backward compatible.

clairemcginty · 2024-04-29T08:28:37Z

BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready.

I agree with @wgtmac. Let's create smaller PRs and make improvements then release when we feel it stable. During the development, we may advertise this feature so others may start experimenting on it and give feedback before actually releasing it. A too early release might make later improvements harder because we need to be backward compatible.

Sounds good to me! this PR might take another week or two to get right. It would also be nice to release support for operations like Array#size at the same time, so pushing it to 0.15 would give us time to do that 👍

...-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java

wgtmac · 2024-05-15T05:24:13Z

Sorry for the delay. I will try to finish another pass by the end of this week.

wgtmac

This overall LGTM! Thanks @clairemcginty for working on this and adding exhaustive test!

cc @gszadovszky @julienledem @rdblue Sorry for bothering. I think this PR requires visibility to more reviewers.

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/ContainsRewriter.java

...r/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java

...-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java

clairemcginty · 2024-05-17T12:22:21Z

This overall LGTM! Thanks @clairemcginty for working on this and adding exhaustive test!

great, I'm glad this implementation looks ok! I have a few more tests that I'd like to add around null handling + behavior of the Contains predicate on map types (I think that they should just work, but I haven't tried it out yet...). Will try to add those + address PR comments on Monday or Tuesday next week 👍

…tion

clairemcginty · 2024-05-20T12:11:32Z

I tried adding a test case to TestRecordLevelFilters to test contains(eq(null)). My expectation was that if you have an array schema with an optional element type, this should return true if the array contains one or more null elements. However, I don't think this is possible to make work--I set a debugger on ValueInspector#update and ValueInspector#updateNull. and ValueInspector#update is only invoked for non-null elements, and ValueInspector#updateNull is only invoked if the entire array is null, which isn't exactly what we want, either.

So based on my current understanding, I don't think we can support a contains(eq(null)) predicate, and we can probably add a precondition check to the Contains constructor against a null predicate value. Wdyt @wgtmac ?

…lder

clairemcginty · 2024-05-21T17:15:44Z

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java

@@ -257,6 +258,10 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N
    return new NotIn<>(column, values);
  }

+  public static <T extends Comparable<T>> Contains<T> contains(Eq<T> pred) {


In a fast-follow-up PR we can just replace this signature with:

public static <T extends Comparable<T>> Contains<T> contains(ColumnFilterPredicate<T> pred) { ... }

and remove the SupportsContains annotation. I did a few tests of Contains(lt) and Contains(gt) and everything is working as expected 👍

Sounds good, but keep in mind that this one and the follow ups shall be released together otherwise we'll have API compatibility issues.

understood -- I can commit to working on the follow-up PR right away 👍

clairemcginty

I did some ad-hoc testing of this on larger datasets (1m+ elements with various page/row group distributions) and confirmed that everything looks right! so I think this PR is finally complete/ready for review.

The only outstanding task is to include support for NotEq/Lt/Gt/Lte/Gte/In/NotIn with Contains, which is just updating the FilterApi signature and adding a few lines to IncrementallyUpdatedFilterPredicateGenerator. I think that can be a followup PR 👍

gszadovszky · 2024-05-22T07:14:18Z

@wgtmac, completely agree to have more people reviewing this. Thanks for pinging.
I'll try to take a look this week.

gszadovszky

Great work, @clairemcginty!
I've added some comments.

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/ContainsRewriter.java

gszadovszky · 2024-05-24T08:25:33Z

parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java

@@ -257,6 +258,10 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N
    return new NotIn<>(column, values);
  }

+  public static <T extends Comparable<T>> Contains<T> contains(Eq<T> pred) {


Sounds good, but keep in mind that this one and the follow ups shall be released together otherwise we'll have API compatibility issues.

gszadovszky · 2024-05-24T10:55:26Z

...-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java

+            public int nextInt() {
+              int result = next;
+              next = null;
+              return result;
+            }


I don't think this implementation follows the contract of an iterator. Iterators do not enforce the user to call hasNext before calling next. If you sure about the size, you may just call one next after another. It may throw a NoSuchElementException if there are no more values.
The trick is to calculate the next value beforehand and check the existence of that value at hasNext and calculate the "next-next" value at next before returning the pre-calculated next value. (Hope it makes sense 😄)

See IndexIterator as an example. You may even put static methods there for intersection and union and it will be more readable here.
I would not use boxing/unboxing here only to have a additional null value. Indices here can only be non-negative so you may use -1 to mark the case of no more values.

makes sense! I had put some of the next-loading logic into hasNext() because as far as I could tell, this iterator would be used via the forEachRemaining() method, which IIRC does sequentially call hasNext()/next()... but it does make this a bit unstable 😅 I can rewrite it and move into static methods in IndexIterator.

parquet-column/src/test/java/org/apache/parquet/filter2/predicate/TestFilterApiMethods.java

parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java

gszadovszky · 2024-05-24T13:00:44Z

parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/TestRecordLevelFilters.java

@@ -20,6 +20,7 @@

 import static org.apache.parquet.filter2.predicate.FilterApi.and;
 import static org.apache.parquet.filter2.predicate.FilterApi.binaryColumn;
+import static org.apache.parquet.filter2.predicate.FilterApi.contains;


During checking the code I've found that this record level testing class does not seem to be correct. In several places (just like in your code) the method PhoneBookWriter.readFile is used which calls PhoneBookWriter.createReader to create the reader instance. The issue is, by default all the filtering (statistics/dictionary/column index) are working. As a result, we cannot be sure that the read results we are getting are really filtered only by the record level filter.
I know this is not your fault but could you please try using a properly set up reader for this test?

// ... Configuration conf = new Configuration(); GroupWriteSupport.setSchema(schema, conf); return ParquetReader.builder(new GroupReadSupport(), file) .withConf(conf) .withFilter(filter) .withAllocator(allocator) .useBloomFilter(false) .useDictionaryFilter(false) .useStatsFilter(false) .useColumnIndexFilter(false) .build();

If unrelated errors may occur, let's handle this separately, but I want to avoid not catching a potential issue in your code by leaving this test as is.

Ha, I did notice that -- I'd been refactoring StatisticsFilter, made a mistake, and was confused why TestRecordLevelFilter suddenly started failing.

updated! All tests continued to pass after disabling the other filters.

clairemcginty · 2024-06-03T13:03:02Z

hey @gszadovszky! all your requested changes have been addressed - anything else that's missing?

gszadovszky

I've found an issue missed before.

parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/IndexIterator.java

…usted

gszadovszky

Sorry @clairemcginty, my most important comment got lost from the previous review, somehow.

parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/IndexIterator.java

gszadovszky

Thanks, @clairemcginty. It looks good to me. Let's wait for the tests to pass.

@wgtmac, since no one else chimed in this review, I think, we can step forward and merge it (after CI approval). There will be follow up changes anyway and the next release is probably a couple months away that gives time for others to comment on this feature.

wgtmac · 2024-06-04T01:51:58Z

Thanks @gszadovszky for the detail review! I'll take another pass shortly to be familiar with the latest change and then merge it if no concern.

wgtmac

+1

clairemcginty commented Apr 23, 2024

View reviewed changes

parquet-hadoop/src/main/java/org/apache/parquet/filter2/bloomfilterlevel/BloomFilterImpl.java Outdated Show resolved Hide resolved

clairemcginty marked this pull request as ready for review April 24, 2024 13:40

clairemcginty commented Apr 24, 2024

View reviewed changes

wgtmac requested a review from gszadovszky April 24, 2024 14:18

wgtmac reviewed Apr 25, 2024

View reviewed changes

pom.xml Outdated Show resolved Hide resolved

gszadovszky requested changes Apr 26, 2024

View reviewed changes

clairemcginty force-pushed the parquet-34 branch 2 times, most recently from a597fcf to 9cd2711 Compare April 28, 2024 10:25

clairemcginty commented May 3, 2024

View reviewed changes

...-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java Show resolved Hide resolved

wgtmac reviewed May 17, 2024

View reviewed changes

clairemcginty added 7 commits May 17, 2024 14:31

PARQUET-34: Add support for #contains predicate on repeated column

ec704ef

PARQUET-34: Update japicmp exclusions

10efa49

PARQUET-34: Make Contains() a composable operator

861eb7a

PARQUET-34: Simplify operator inheritance

a59a497

PARQUET-34: Implement single Contains predicate that supports composi…

d570261

…tion

PARQUET-34: Improve toString for Contains predicates

c6e0c18

PARQUET-34: Test Contains predicate on Map logical types

908f9b4

clairemcginty force-pushed the parquet-34 branch from f2acc89 to 908f9b4 Compare May 17, 2024 18:31

PARQUET-34: Supply new field value for PhoneBookWriter.User

5196d1b

clairemcginty added 2 commits May 20, 2024 10:49

PARQUET-34: Throw exception for Not() in ContainsRewriter

a50f3ac

PARQUET-34: Add no-op addContainsEnd() to record-level filter

a1c8762

clairemcginty added 4 commits May 20, 2024 12:24

PARQUET-34: Implement contains.and and contains.or for ColumnIndexBui…

62e71d2

…lder

PARQUET-34: Disallow null values for Contains predicate

dc70219

PARQUET-34: Fix and test Contains construction and toString

2feabfd

PARQUET-34: Fix type inference issue with FilterApi.contains constructor

645577b

clairemcginty commented May 21, 2024

View reviewed changes

gszadovszky requested changes May 24, 2024

View reviewed changes

clairemcginty added 5 commits May 24, 2024 10:00

PARQUET-34: Refactor ColumnIndexBuilder Iterators

a32b594

PARQUET-34: Use AssertThrows to test invalid Contains construction

fce6e11

PARQUET-34: Use MAP logical type in PhoneBookWriter

1fbc9e1

PARQUET-34: Disable non-record-level filters in PhoneBookWriter#readFile

dfd7b0f

PARQUET-34: Fix typo in ContainsRewriter doc

b776f6c

gszadovszky requested changes Jun 3, 2024

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/IndexIterator.java Show resolved Hide resolved

parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/IndexIterator.java Show resolved Hide resolved

PARQUET-34: Throw NoSuchElementException when IndexIterators are exha…

0d1e589

…usted

gszadovszky requested changes Jun 3, 2024

View reviewed changes

parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/IndexIterator.java Outdated Show resolved Hide resolved

clairemcginty added 2 commits June 3, 2024 11:36

PARQUET-34: Fix IndexIterator#intersection outer loop to exhaust RHS

206ac07

PARQUET-34: fix typo

badc5f0

gszadovszky approved these changes Jun 3, 2024

View reviewed changes

wgtmac approved these changes Jun 4, 2024

View reviewed changes

wgtmac merged commit dab5aae into apache:master Jun 4, 2024
9 checks passed

clairemcginty mentioned this pull request Jun 4, 2024

PARQUET-34: Extend Contains support to all ColumnFilterPredicate types #1370

Merged

5 tasks

clairemcginty deleted the parquet-34 branch June 5, 2024 12:15

asfimport mentioned this pull request Jun 14, 2024

Add support for repeated columns in the filter2 API #1452

Open

wgtmac added this to the 1.15.0 milestone Sep 30, 2024

PARQUET-34: Add #contains FilterPredicate for Array columns #1328

PARQUET-34: Add #contains FilterPredicate for Array columns #1328

Conversation

clairemcginty commented Apr 23, 2024 • edited Loading

Jira

Tests

Commits

Style

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairemcginty Apr 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac commented Apr 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac commented Apr 26, 2024

gszadovszky left a comment

Choose a reason for hiding this comment

wgtmac commented Apr 29, 2024

gszadovszky commented Apr 29, 2024

clairemcginty commented Apr 29, 2024

wgtmac commented May 15, 2024

wgtmac left a comment

Choose a reason for hiding this comment

clairemcginty commented May 17, 2024

clairemcginty commented May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairemcginty left a comment

Choose a reason for hiding this comment

gszadovszky commented May 22, 2024

gszadovszky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairemcginty commented Jun 3, 2024

gszadovszky left a comment

Choose a reason for hiding this comment

gszadovszky left a comment

Choose a reason for hiding this comment

gszadovszky left a comment

Choose a reason for hiding this comment

wgtmac commented Jun 4, 2024

wgtmac left a comment

Choose a reason for hiding this comment

clairemcginty commented Apr 23, 2024 •

edited

Loading

clairemcginty Apr 28, 2024 •

edited

Loading

clairemcginty commented May 20, 2024 •

edited

Loading