Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite simple regex expressions #4370

Closed
crepererum opened this issue Nov 25, 2022 · 1 comment · Fixed by #4646
Closed

Rewrite simple regex expressions #4370

crepererum opened this issue Nov 25, 2022 · 1 comment · Fixed by #4646
Labels
enhancement New feature or request

Comments

@crepererum
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In InfluxDB IOx, we have some users that query the data with simple regex expressions that don't really need a regex but (I guess) regexes are used for convenience or technical reasons (e.g. auto-generated expressions). For "regex match" and "regex not match", we have the following cases:

Case Example Description Logical Rewrite (for "match")
Empty '' Match all col IS NOT NULL
OR-chain 'foo|bar|baz' Any of (col = 'foo') OR (col = 'bar') OR (col = 'baz')

col IN ('foo', 'bar', 'baz')

Now the fact that they are expressed as regex instead of a simple rewritten form has a bunch of performance consequences. These regex predicates are NOT considered for pruning (because how would you prune an arbitrary regex):

https://github.com/apache/arrow-datafusion/blob/e1204a5bf72c119123404463befb716adbdcff25/datafusion/core/src/physical_optimizer/pruning.rs#L818-L871

Finally they are NOT pushed down into ParquetExec.

Describe the solution you'd like
Transform simple regex expressions into their equivalent logical expression.

Describe alternatives you've considered
Extend the pruning expression framework and ParquetExec to handle regexes. However this seems unnecessary complex and maybe even counterproductive, since regexes per se can be really expensive+complex to evaluate.

Additional context
-

@crepererum crepererum added the enhancement New feature or request label Nov 25, 2022
@tustvold
Copy link
Contributor

This sounds good to me, certain expressions could also potentially be rewritten into LIKE expressions.

FWIW I would rewrite 'foo|bar|baz' to col IN ('foo', 'bar', 'baz') as we already have an expression rewriter that can rewrite small IN into disjunctive expressions.

crepererum added a commit to crepererum/arrow-datafusion that referenced this issue Dec 15, 2022
crepererum added a commit to crepererum/arrow-datafusion that referenced this issue Dec 15, 2022
crepererum added a commit to crepererum/arrow-datafusion that referenced this issue Dec 16, 2022
alamb added a commit that referenced this issue Dec 29, 2022
* feat: simplify regex expressions

Closes #4370.

* Fix typo in constant name, add coverage

* cleanups

* fmt

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants