-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-34: Add #contains FilterPredicate for Array columns #1328
Conversation
parquet-hadoop/src/main/java/org/apache/parquet/filter2/bloomfilterlevel/BloomFilterImpl.java
Outdated
Show resolved
Hide resolved
@@ -257,6 +266,16 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N | |||
return new NotIn<>(column, values); | |||
} | |||
|
|||
public static <T extends Comparable<T>, C extends Column<T> & SupportsContains> Contains<T> contains( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking to the future, it could eventually be useful to support Array FilterPredicates like "repeated field X contains an int field greater than Y". We could refactor this API to be written like:
FilterApi.contains(
FilterApi.eq(
FilterApi.arrayColumn(FilterApi.intColumn("repeated_int_field")),
100
)
)
& we could eventually support FilterApi.contains(FilterApi.lt(...))
, FilterApi.contains(FilterApi.gt(...))
, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that Contains/DoesNotContain is not a standard SQL function. In which case they are used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, now these new functions only accept single value. Does it make sense to support variable number of values? I also see some databases support CONTAINS(a OR b) and CONTAINS(a AND b). They can be composited by logical operators but the cost would be high compared to support them natively,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a standard SQL function, but I've seen it in SQL extension languages such as BigQuery Standard SQL, and I've gotten several requests to support this by users of the Scio Parquet library!
that's a good point about making this composable, I think it would be more efficient to do CONTAINS(a or b)
than CONTAINS(a) or CONTAINS(b)
. What do you think about supporting lt/gt
in addition to eq
-based Contains? for example, CONTAINS(eq(a) OR gt(b))
? It would make this PR a lot more complex but I'm happy to try. We could probably re-use a lot of the existing filter code for eq
, lt/gt
, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that it worths making the function composable. Perhaps we can define all the necessary API (with implementation of the most common ones) in the PR and implementations of all other kinds of inputs can be split into separate PRs. Let's not hurry for the coming release. It still takes time to stabilize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a standard SQL function, but I've seen it in SQL extension languages such as BigQuery Standard SQL, and I've gotten several requests to support this by users of the Scio Parquet library!
that's a good point about making this composable, I think it would be more efficient to do
CONTAINS(a or b)
thanCONTAINS(a) or CONTAINS(b)
. What do you think about supportinglt/gt
in addition toeq
-based Contains? for example,CONTAINS(eq(a) OR gt(b))
? It would make this PR a lot more complex but I'm happy to try. We could probably re-use a lot of the existing filter code foreq
,lt/gt
, etc...
I took a second look, it seems that you were mentioning IN
expression which we already have: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L209-L258
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've pushed my changes to support composable contains predicates! example usage:
FilterApi.containsOr( FilterApi.containsEq(longColumn("phoneNumbers.phone.number"), 5555555555L), FilterApi.containsOr( FilterApi.containsEq(longColumn("phoneNumbers.phone.number"), -10000000L), FilterApi.containsEq(longColumn("phoneNumbers.phone.number"), 2222222222L)));I left out implementing DoesNotContain/ContainsNotEq for now to keep the PR simpler to parse. Let me know if the API looks ok 👍 Then I can fill out the rest of the implementations/make the unit tests more thorough.
Sorry I didn't make it clear. I thought we could still leverage existing composite expression and combine CONTAINS(x)
sub-expressions under the same level of AND
or OR
expression during rewrite process like https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/filter2/compat/FilterCompat.java#L79. It would be harder to maintain the code if introduce new specialized composite expression like containsOr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, ok. So we’d have a single public Contains
predicate class that’s registered with the Visitor API, that can support one or more sub-predicates… and the Rewriter would inspect any Ands or Ors to see see if the left/right side is an instance of Contains… If so, we rewrite it into a single Contains
predicate containing both clauses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect so, but it seems requiring a lot more work. We can support the simplest case at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I think I mostly got this working, supporting just Contains Eq (+ composition using And/Or) to start. hopefully this is closer to what you had in mind!
Thanks for the new feature! I will try to take a look soon and make it to the 1.14.0 release. |
@@ -257,6 +266,16 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N | |||
return new NotIn<>(column, values); | |||
} | |||
|
|||
public static <T extends Comparable<T>, C extends Column<T> & SupportsContains> Contains<T> contains( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that Contains/DoesNotContain is not a standard SQL function. In which case they are used?
@@ -257,6 +266,16 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N | |||
return new NotIn<>(column, values); | |||
} | |||
|
|||
public static <T extends Comparable<T>, C extends Column<T> & SupportsContains> Contains<T> contains( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, now these new functions only accept single value. Does it make sense to support variable number of values? I also see some databases support CONTAINS(a OR b) and CONTAINS(a AND b). They can be composited by logical operators but the cost would be high compared to support them natively,
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/filter2/bloomfilterlevel/BloomFilterImpl.java
Outdated
Show resolved
Hide resolved
As requested on the dev@parquet ML, I'll wait for this PR before starting the releasing process of 1.14.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @clairemcginty for working on this. This is a great improvement to support nested structures in filtering!
I agree with @wgtmac's comments/questions. Added some more.
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/Operators.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java
Outdated
Show resolved
Hide resolved
a597fcf
to
9cd2711
Compare
BTW, should we postpone this feature to 1.15.0 release? We can always release new version as needed and I can volunteer to be the release manager once this is ready. |
I agree with @wgtmac. Let's create smaller PRs and make improvements then release when we feel it stable. During the development, we may advertise this feature so others may start experimenting on it and give feedback before actually releasing it. A too early release might make later improvements harder because we need to be backward compatible. |
Sounds good to me! this PR might take another week or two to get right. It would also be nice to release support for operations like Array#size at the same time, so pushing it to 0.15 would give us time to do that 👍 |
...-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
Show resolved
Hide resolved
Sorry for the delay. I will try to finish another pass by the end of this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overall LGTM! Thanks @clairemcginty for working on this and adding exhaustive test!
cc @gszadovszky @julienledem @rdblue Sorry for bothering. I think this PR requires visibility to more reviewers.
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/ContainsRewriter.java
Outdated
Show resolved
Hide resolved
...r/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
Show resolved
Hide resolved
...-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
Show resolved
Hide resolved
great, I'm glad this implementation looks ok! I have a few more tests that I'd like to add around null handling + behavior of the |
I tried adding a test case to So based on my current understanding, I don't think we can support a |
@@ -257,6 +258,10 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N | |||
return new NotIn<>(column, values); | |||
} | |||
|
|||
public static <T extends Comparable<T>> Contains<T> contains(Eq<T> pred) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a fast-follow-up PR we can just replace this signature with:
public static <T extends Comparable<T>> Contains<T> contains(ColumnFilterPredicate<T> pred) {
...
}
and remove the SupportsContains
annotation. I did a few tests of Contains(lt)
and Contains(gt)
and everything is working as expected 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, but keep in mind that this one and the follow ups shall be released together otherwise we'll have API compatibility issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understood -- I can commit to working on the follow-up PR right away 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some ad-hoc testing of this on larger datasets (1m+ elements with various page/row group distributions) and confirmed that everything looks right! so I think this PR is finally complete/ready for review.
The only outstanding task is to include support for NotEq/Lt/Gt/Lte/Gte/In/NotIn
with Contains, which is just updating the FilterApi
signature and adding a few lines to IncrementallyUpdatedFilterPredicateGenerator
. I think that can be a followup PR 👍
@wgtmac, completely agree to have more people reviewing this. Thanks for pinging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @clairemcginty!
I've added some comments.
parquet-column/src/main/java/org/apache/parquet/filter2/predicate/ContainsRewriter.java
Outdated
Show resolved
Hide resolved
@@ -257,6 +258,10 @@ public static <T extends Comparable<T>, C extends Column<T> & SupportsEqNotEq> N | |||
return new NotIn<>(column, values); | |||
} | |||
|
|||
public static <T extends Comparable<T>> Contains<T> contains(Eq<T> pred) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, but keep in mind that this one and the follow ups shall be released together otherwise we'll have API compatibility issues.
public int nextInt() { | ||
int result = next; | ||
next = null; | ||
return result; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this implementation follows the contract of an iterator. Iterators do not enforce the user to call hasNext
before calling next
. If you sure about the size, you may just call one next
after another. It may throw a NoSuchElementException
if there are no more values.
The trick is to calculate the next value beforehand and check the existence of that value at hasNext
and calculate the "next-next" value at next
before returning the pre-calculated next value. (Hope it makes sense 😄)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See IndexIterator as an example. You may even put static methods there for intersection
and union
and it will be more readable here.
I would not use boxing/unboxing here only to have a additional null
value. Indices here can only be non-negative so you may use -1
to mark the case of no more values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense! I had put some of the next-loading logic into hasNext()
because as far as I could tell, this iterator would be used via the forEachRemaining()
method, which IIRC does sequentially call hasNext()/next()
... but it does make this a bit unstable 😅 I can rewrite it and move into static methods in IndexIterator
.
parquet-column/src/test/java/org/apache/parquet/filter2/predicate/TestFilterApiMethods.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
Outdated
Show resolved
Hide resolved
@@ -20,6 +20,7 @@ | |||
|
|||
import static org.apache.parquet.filter2.predicate.FilterApi.and; | |||
import static org.apache.parquet.filter2.predicate.FilterApi.binaryColumn; | |||
import static org.apache.parquet.filter2.predicate.FilterApi.contains; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During checking the code I've found that this record level testing class does not seem to be correct. In several places (just like in your code) the method PhoneBookWriter.readFile
is used which calls PhoneBookWriter.createReader to create the reader instance. The issue is, by default all the filtering (statistics/dictionary/column index) are working. As a result, we cannot be sure that the read results we are getting are really filtered only by the record level filter.
I know this is not your fault but could you please try using a properly set up reader for this test?
// ...
Configuration conf = new Configuration();
GroupWriteSupport.setSchema(schema, conf);
return ParquetReader.builder(new GroupReadSupport(), file)
.withConf(conf)
.withFilter(filter)
.withAllocator(allocator)
.useBloomFilter(false)
.useDictionaryFilter(false)
.useStatsFilter(false)
.useColumnIndexFilter(false)
.build();
If unrelated errors may occur, let's handle this separately, but I want to avoid not catching a potential issue in your code by leaving this test as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, I did notice that -- I'd been refactoring StatisticsFilter, made a mistake, and was confused why TestRecordLevelFilter
suddenly started failing.
updated! All tests continued to pass after disabling the other filters.
hey @gszadovszky! all your requested changes have been addressed - anything else that's missing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found an issue missed before.
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/IndexIterator.java
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/IndexIterator.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @clairemcginty, my most important comment got lost from the previous review, somehow.
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/IndexIterator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @clairemcginty. It looks good to me. Let's wait for the tests to pass.
@wgtmac, since no one else chimed in this review, I think, we can step forward and merge it (after CI approval). There will be follow up changes anyway and the next release is probably a couple months away that gives time for others to comment on this feature.
Thanks @gszadovszky for the detail review! I'll take another pass shortly to be familiar with the latest change and then merge it if no concern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Proposal to add a new FilterPredicate,
Contains
, that can be applied to List types, and check if the specified element is present among the repeated values. It can be composed using And or Or:The filtering logic is largely based on existing Eq predicates to apply filtering at the page/rowgroup level using statistics/dictionaries, with a specialized implementation in
IncrementallyUpdatedFilterPredicateBuilder
to do individual record-level filtering.Jira
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
the ASF 3rd Party License Policy.
Tests
Commits
from "How to write a good git commit message":
Style
mvn spotless:apply -Pvector-plugins
Documentation