-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support IN
lists with more than three constants in predicates for bloom filters
#8436
Comments
I'd like to have a try. |
@my-vegetable-has-exploded -- thank you -- can you please implement it in terms of |
@alamb @my-vegetable-has-exploded I found if there are more than three values the bloomfilter also can't work. This works SELECT * FROM tbl where (trace_id='3c7dbf90d1a66e3faffa344519c3bac3' OR trace_id='1' OR trace_id='2') LIMIT 150; This doesn't works SELECT * FROM tbl where (trace_id='3c7dbf90d1a66e3faffa344519c3bac3' OR trace_id='1' OR trace_id='2' OR trace_id='3') LIMIT 150; This is the same problem, because more than three values will convert |
I think we can use Short-circuit evaluation here, if left satisfy some condition, we can skip right. |
I am in the middle of rewriting the bloom filter implementation to be more general (see #8442). I believe the new (not yet merged) code correctly handles predicates like However, the new code does not handle explicit This ticket was perhaps a bit over eager -- basically I recommend not changing the existing implementation as I am rewriting it. However, if you would like to change the existing code, that is also fine, I will manage the conflicts as part of my PRs. |
Get it. |
The relevant PRs have been merged now, so I think it would be possible to add support to |
thanks @hengfeiyang @alamb |
if items more than 10 , it still can’t use bloom filter |
Can you provide an example of reproducing the problem? Thanks. |
SELECT * FROM tbl where (trace_id IN('3c7dbf90d1a66e3faffa344519c3bac3', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29')) LIMIT 10; |
How do you know the bloom filter isn't being used? Is there a reproducer (a parquet file) you can share? It appears that there is no good way to know if the bloom filter code is working via logging or metrics 🤔 |
I wonder if #8669 from @yahoNanJing is related (which basically adds IN list pruning based on |
I conducted a test locally by writing 200GB of data. When using a Bloom filter for queries, I observed that the query only takes 0.1 seconds, whereas without using the Bloom filter, the query takes 1 second. If a query takes 1 second, I can infer that it is not using the Bloom filter because using the Bloom filter should yield results within 0.1 seconds. |
Is your feature request related to a problem or challenge?
BloomFilter support was added in #7821 by @hengfeiyang ❤️
There is partial support for optimizing queries that have
IN
List predicates,. as suggested by @Ted-Jiang : #7821 (comment) and tested via https://github.com/apache/arrow-datafusion/blob/0d7cab055cb39d6df751e070af5a0bf5444e3849/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L1056-L1084However, this only supports queries where there are three or fewer items in the IN list:
It only works for small numbers of constants because the current implementation only checks for predicates like
col = 'foo' OR col = 'bar'
. The reason this works forInList
s is that with small numbers of items (3
) are rewritten toOR
chains) by this code in the optimizer:https://github.com/apache/arrow-datafusion/blob/0d7cab055cb39d6df751e070af5a0bf5444e3849/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L500-L549
Thus, the the current bloom filter code will not work for queries with large numbers (more than the
THRESHOLD_INLINE_INLIST
) of constants in theIN
list, such asDescribe the solution you'd like
I would like the bloom filter code to directly support
InListExpr
and thus also supportIN
/NOT IN
queries with large numbers of constantsIn terms of implementation, after #8437 is merged and #8376 is closed, this should be a straightforward matter of:
LiteralGurantee
code (see AddLiteralGuarantee
on columns to extract conditions required forPhysicalExpr
expressions to evaluate to true #8437 )LiteralGurantee
datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs
Describe alternatives you've considered
No response
Additional context
Found while I was working on #8376
The text was updated successfully, but these errors were encountered: