Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add native 'array contains element' filter #15366

Merged
merged 10 commits into from
Nov 29, 2023

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Nov 14, 2023

Description

Alternative mentioned in #15350, which is simplified and only matches a single array element rather than being a full implementation of 'ARRAY_CONTAINS'. Decided to do this one when I realized that multi-value dimensions would be handled incorrectly with the other filter when using array type element matches.

Array element indexes are already hiding inside of array type columns, this PR adds a native arrayContainsElement filter so we can actually use them to implement things like ARRAY_CONTAINS (in this PR) and ARRAY_OVERLAP (not in this PR).

No documentation currently added, will do that in a follow-up PR.

As expected, this is a giant perf improvement when using ARRAY_CONTAINS on array columns. For the query added in the benchmarks:

SELECT string1, long1 FROM foo WHERE ARRAY_CONTAINS("multi-string3", 100) GROUP BY 1,2

Before:

Benchmark                        (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt     Score    Error  Units
SqlExpressionBenchmark.querySql       38           5000000      auto        false  avgt    5  4839.508 ± 89.496  ms/op

After:

Benchmark                        (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt    Score   Error  Units
SqlExpressionBenchmark.querySql       38           5000000      auto        false  avgt    5  133.610 ± 3.750  ms/op

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Copy link
Contributor

@soumyava soumyava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took one pass. LGTM so far. Will run locally today and do a 2nd pass

@@ -413,7 +413,24 @@ public class SqlBenchmark
"SELECT APPROX_COUNT_DISTINCT_BUILTIN(dimZipf) FROM foo",
"SELECT APPROX_COUNT_DISTINCT_DS_HLL(dimZipf) FROM foo",
"SELECT APPROX_COUNT_DISTINCT_DS_HLL_UTF8(dimZipf) FROM foo",
"SELECT APPROX_COUNT_DISTINCT_DS_THETA(dimZipf) FROM foo"
"SELECT APPROX_COUNT_DISTINCT_DS_THETA(dimZipf) FROM foo",
// 32: LATEST aggregator long
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving these from expression benchmarks to here

Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems good. Though tbh, the whole matchers and processors classes are a maze of abstractions that I find difficult to navigate through. So I might have missed a thing here and there.

Comment on lines -3330 to 3335
ExpressionType type = ExpressionType.LONG;
ExpressionType type = null;
for (Expr arg : args) {
type = ExpressionTypeConversion.function(type, arg.getOutputType(inspector));
type = ExpressionTypeConversion.leastRestrictiveType(type, arg.getOutputType(inspector));
}
return ExpressionType.asArrayType(type);
return type == null ? null : ExpressionTypeFactory.getInstance().ofArray(type);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was wrong in some cases and written before leastRestrictiveType existed, for example if you try to use the array constructor function on an array type input, you end up with

org.apache.druid.segment.column.Types$IncompatibleTypeException: Cannot implicitly cast [LONG] to [ARRAY<STRING>]

which doesn't happen after the change. Will add a test.

@@ -53,7 +53,8 @@
@JsonSubTypes.Type(name = "equals", value = EqualityFilter.class),
@JsonSubTypes.Type(name = "range", value = RangeFilter.class),
@JsonSubTypes.Type(name = "isfalse", value = IsFalseDimFilter.class),
@JsonSubTypes.Type(name = "istrue", value = IsTrueDimFilter.class)
@JsonSubTypes.Type(name = "istrue", value = IsTrueDimFilter.class),
@JsonSubTypes.Type(name = "arrayContainsElement", value = ArrayContainsElementFilter.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a shorter name such as arrayContains sounds better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to confuse this native filter with the array_contains function which this filter can be used to implement but doesn't completely implement (see PR description), so I'd prefer to leave it like this to avoid confusion

}
}

private static class TypedConstantElementValueMatcherFactory extends EqualityFilter.TypedConstantValueMatcherFactory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some javadocs here would be useful.

this.doublePredicateSupplier = equalityPredicateFactory::makeDoublePredicate;
this.floatPredicateSupplier = equalityPredicateFactory::makeFloatPredicate;
}
this.objectPredicateSupplier = makeObjectPredicateSupplier();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method results in a call stack that refers to typeDetectingArrayPredicateSupplier which is not initialized yet.
makeObjectPredicateSupplier -> makeArrayPredicate -> typeDetectingArrayPredicateSupplier

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing will get these predicates until after the constructor though, what is the problem you're worried about?

@Nullable ColumnCapabilities columnCapabilities
)
{
return super.makeArrayProcessor(selector, columnCapabilities);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while it kinda does the right thing right now, it feels too tightly coupled with EqualityFilter. It confused me for some time till I realized that you are passing the predicateMatcherFactory to super class. The way I read it as that elementMatchValue can be compared to input array through "equals" comparison. I think it would be better to just call predicateMatcherFactory.makeArrayProcessor(selector, columnCapabilities); and in makeComplexProcessor

Copy link
Member Author

@clintropolis clintropolis Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while it kinda does the right thing right now, it feels too tightly coupled with EqualityFilter

that is the point though, if the filter is matching a scalar column and the match value is a non-null scalar value then this filter should behave identical to equality filter, the ask during review of #15350 was to re-use as much of the equality filter as possible because in certain cases it is the equality filter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, i just removed these overrides since they were pointless

@abhishekagarwal87
Copy link
Contributor

btw how do the benchmark numbers look like if you disable bitmap indexes?

@clintropolis
Copy link
Member Author

btw how do the benchmark numbers look like if you disable bitmap indexes?

like with filterTuning? I didn't check because SQL has no way to control filterTuning, I would guess somewhere between the expression filter numbers and index numbers, but can check in a bit

@clintropolis
Copy link
Member Author

btw how do the benchmark numbers look like if you disable bitmap indexes?

ok re-ran benchmarks (my laptop a bit busier this time than last time so stuff a bit slower)

expression filter:

Benchmark                        (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt     Score    Error  Units
SqlExpressionBenchmark.querySql       38           5000000      auto        false  avgt    5  5135.236 ± 77.511  ms/op

arrayContainsElement filter (indexes commented out):

Benchmark                        (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt     Score     Error  Units
SqlExpressionBenchmark.querySql       38           5000000      auto        false  avgt    5  5140.199 ± 141.967  ms/op

arrayContainsElement filter (with indexes):

Benchmark                        (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt    Score   Error  Units
SqlExpressionBenchmark.querySql       38           5000000      auto        false  avgt    5  149.909 ± 2.783  ms/op

@clintropolis clintropolis merged commit 64fcb32 into apache:master Nov 29, 2023
83 checks passed
@clintropolis clintropolis deleted the array-contains-element-filter branch November 29, 2023 11:33
abhishekagarwal87 pushed a commit that referenced this pull request Nov 30, 2023
…mns (#15451)

Updates ARRAY_OVERLAP to use the same ArrayContainsElement filter added in #15366 when filtering ARRAY typed columns so that it can also use indexes like ARRAY_CONTAINS.
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
* add native arrayContainsElement filter to use array column element indexes
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
…mns (apache#15451)

Updates ARRAY_OVERLAP to use the same ArrayContainsElement filter added in apache#15366 when filtering ARRAY typed columns so that it can also use indexes like ARRAY_CONTAINS.
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
* add native arrayContainsElement filter to use array column element indexes
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
…mns (apache#15451)

Updates ARRAY_OVERLAP to use the same ArrayContainsElement filter added in apache#15366 when filtering ARRAY typed columns so that it can also use indexes like ARRAY_CONTAINS.
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants