-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite simple regex expressions #4370
Labels
enhancement
New feature or request
Comments
This sounds good to me, certain expressions could also potentially be rewritten into FWIW I would rewrite 'foo|bar|baz' to |
This was referenced Dec 14, 2022
Merged
crepererum
added a commit
to crepererum/arrow-datafusion
that referenced
this issue
Dec 15, 2022
crepererum
added a commit
to crepererum/arrow-datafusion
that referenced
this issue
Dec 15, 2022
crepererum
added a commit
to crepererum/arrow-datafusion
that referenced
this issue
Dec 16, 2022
alamb
added a commit
that referenced
this issue
Dec 29, 2022
* feat: simplify regex expressions Closes #4370. * Fix typo in constant name, add coverage * cleanups * fmt Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In InfluxDB IOx, we have some users that query the data with simple regex expressions that don't really need a regex but (I guess) regexes are used for convenience or technical reasons (e.g. auto-generated expressions). For "regex match" and "regex not match", we have the following cases:
''
col IS NOT NULL
'foo|bar|baz'
(col = 'foo') OR (col = 'bar') OR (col = 'baz')
col IN ('foo', 'bar', 'baz')
Now the fact that they are expressed as regex instead of a simple rewritten form has a bunch of performance consequences. These regex predicates are NOT considered for pruning (because how would you prune an arbitrary regex):
https://github.com/apache/arrow-datafusion/blob/e1204a5bf72c119123404463befb716adbdcff25/datafusion/core/src/physical_optimizer/pruning.rs#L818-L871
Finally they are NOT pushed down into
ParquetExec
.Describe the solution you'd like
Transform simple regex expressions into their equivalent logical expression.
Describe alternatives you've considered
Extend the pruning expression framework and
ParquetExec
to handle regexes. However this seems unnecessary complex and maybe even counterproductive, since regexes per se can be really expensive+complex to evaluate.Additional context
-
The text was updated successfully, but these errors were encountered: