Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC Make BloomFilter application general, add PruningPredicate::contains #8397

Closed
wants to merge 20 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 1, 2023

Which issue does this PR close?

POC for #8376

Rationale for this change

See #8376

TLDR is I would like to use PruningPredicate for "bloom filter like" operations downstream in IOx, and thus I would like to refactor the code that applies Bloom Filters today into the PruningPredicate.

What changes are included in this PR?

  1. New PruningStatistics::contains API proposed in Support general pruning based on <col> = 'const' in PruningPredicate #8376
  2. Add LiteralGuarantee for extracting col = <const> predicates, along with many new tests
  3. Rewrite the BloomFilter code to use the new predicate API

TODO

Are these changes tested?

  • New tests for PruningPredicate contains (=, !=, = with OR, IN and NOT IN)
  • Existing tests pass

Are there any user-facing changes?

The PruningPredicate has a new API

Plans for merging

I do not plan to propose merging this PR as is. Instead I plan a series of PRs:

  1. A PR with some new bloom filter tests
  2. A PR that introduces the contains API with tests
  3. A second PR that refactors the the row group pruning code to use it
  4. A third PR that refactors the RowGroupPruning code so it runs on more than one RowGroupMetadatat at a time.

🤔 musing: At the moment, the row group pruning happens in two phases -- first the min/max statistics are applied, and then the bloom filters are applied which means the potentially bloom filters are only fetched for groups that passed the first predicate.

@github-actions github-actions bot added the core Core DataFusion crate label Dec 1, 2023
Rewrite BloomFilterPruningPredicate in terms of BloomFilterPruningPredicate
@alamb alamb force-pushed the alamb/general_bloom_filter branch from 298da7f to 76ee618 Compare December 4, 2023 21:28
@github-actions github-actions bot added the physical-expr Physical Expressions label Dec 4, 2023
@alamb
Copy link
Contributor Author

alamb commented Dec 18, 2023

The pieces of this PR are in, so closing this PR and focusing on merging the parts

@alamb alamb closed this Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant