-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add regexp_like scalar function #9102
Comments
Looks like arrow_rs actually does implement a function that can be used in this issue - regexp_is_match_utf8: |
DataFusion also has Also potentially interesting is that we have a version of this function in InfluxDB: (though that one is designed to match the behavior of the Go regexp library) |
I actually ran into the same thing this morning as I tried to move the regexp code into a different crate -- #9101. In general the code in regexp_expressions.rs is quite messy and I think could be significantly simplified now that that arrow-rs has more functionality @Omega359 is there any chance you might be willing to try and remove the copy? Otherwise I will try and find time to do it over the next few days |
Don't worry about it -- I think it will make things easier -- the current logic to figure out when the arguments are all constants seems overly convoluted so the fewer things going on in that module the better |
That is definitely interesting. I spent some time today looking into the differences between the postgresql, Java and Rust implementations of regex. There is as expected a very large amount of overlap but some advanced features are only found in one implementation or another. Postgresql has an expanded version of one of the posix definitions, Java I think is more based on the Perl regex, and the Rust crate specifically calls out that it isn't posix based. Essentially, I was thinking of just sticking with what is in use already and documenting the syntax via references to the rust crate's documentation and noting that anyone expecting to have 100% compatibility with either postgresql, Java's, Perl's, Go's, Posix, etc is bound to be disappointed. Once we have proper separation between default, postgres and spark syntaxes then others can have a go at specific versions. Primarily I suspect this might possible impact the comet contribution being made - I haven't checked to see if they did any work in this area. |
BTW, from the looks of things the rust regex crate now allows escapes on a lot more things - see rust-lang/regex#501 (comment) . The code you referenced may no longer be required. |
I think this is a solid plan FWIW |
Additional note - datafusion does have support for postgres sql regex operators such as text ~ text. That support however isn't easily exposed to dataframe use cases to my knowledge. It does however have the advantage that it does have optimizer support for simplifying expressions |
Is your feature request related to a problem or challenge?
Currently there is regexp_match and regexp_replace however there isn't a corresponding regexp_like function that could be used in when(..) dataframe method or sql case statements.
Describe the solution you'd like
An implementation of regexp_like that matches the syntax and style of the postgresql implementation as close as possible. Note that the Spark version of the regexp_like function is very similar but does not include any flags in the function signature.
Not all of the flags that postgresql supports may be included in the initial implementation - likely just 'i' may be implemented.
It is noted that the implementation for the existing regexp_match currently resides in datafusion however with apache/arrow-rs#5235 this functionality was moved into arrow_rs (but has yet to be removed from datafusion). The implementation for regexp_like may take a similar path - implement in datafusion first then move to arrow_rs if the community thinks that would be a good idea.
Describe alternatives you've considered
You can use regexp_match and test the return list for empty to imitate this function however that is not the most performant way to implement this as that methodology cannot return immediately after the first match.
Additional context
No response
The text was updated successfully, but these errors were encountered: