-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24884][SQL] add regexp_extract_all support #21985
Conversation
$groupArray.add(UTF8String.fromString($matchResult.group($idx))); | ||
} | ||
} | ||
${ev.value} = new $arrayClass($groupArray.toArray(new UTF8String[$groupArray.size()])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need consider about setting ev.isNull?
> SELECT _FUNC_('100-200,300-400', '(\\d+)-(\\d+)', 1); | ||
[100, 300] | ||
""") | ||
case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an abstract class to reduce duplicated code between RegExpExtractAll
and RegExpExtract
?
@xueyumusic gentle ping |
Thanks for ping, @kiszk , I will update the code these days. |
thanks for the patch @xueyumusic, we have been waiting for such feature in Spark for awhile :) any update when your hard work will be publicly available. |
2a96238
to
2707d59
Compare
Hi, @kiszk @afsanjar very sorry for long delay update. I updated to address feedback of @xuanyuanking , thanks |
} | ||
|
||
(classNamePattern, matcher, matchResult, termLastRegex, termPattern, setEvNotNull) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: delete empty line
|
||
nullSafeCodeGen(ctx, ev, (subject, regexp, idx) => { | ||
s""" | ||
if (!$regexp.equals($termLastRegex)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you please use the following pattern?
s"""
|
|
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @kiszk , updated.
We recently updated the PR template: https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE, so could you please follow that, first? |
ok to test |
Also, could you check if the other databases support the similar function, e.g., oracle, potgresql, mysql, ..., and then describe the result in the PR description? |
Test build #109723 has finished for PR 21985 at commit
|
Hi, @maropu , I updated PR description based on new template, thanks. |
protected def getMatcher(s: Any, p: Any) = { | ||
if (!p.equals(lastRegex)) { | ||
// regex value changed | ||
lastRegex = p.asInstanceOf[UTF8String].clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we need to clone p
?
nvm, it follows the original code
Have you checked this? pig and presto only support this function? |
Can one of the admins verify this patch? |
ping @xueyumusic |
Hey, was this feature merged in the master. If yes, in what version of spark? |
@justgfather Not merged yet. If this would be merged now, you can use this function in Spark 3.0. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
cc @beliefer Could you take this over? |
@gatorsmile OK. |
### What changes were proposed in this pull request? `regexp_extract_all` is a very useful function expanded the capabilities of `regexp_extract`. There are some description of this function. ``` SELECT regexp_extract('1a 2b 14m', '\d+', 0); -- 1 SELECT regexp_extract_all('1a 2b 14m', '\d+', 0); -- [1, 2, 14] SELECT regexp_extract('1a 2b 14m', '(\d+)([a-z]+)', 2); -- 'a' SELECT regexp_extract_all('1a 2b 14m', '(\d+)([a-z]+)', 2); -- ['a', 'b', 'm'] ``` There are some mainstream database support the syntax. **Presto:** https://prestodb.io/docs/current/functions/regexp.html **Pig:** https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html Note: This PR pick up the work of #21985 ### Why are the changes needed? `regexp_extract_all` is a very useful function and make work easier. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT Closes #27507 from beliefer/support-regexp_extract_all. Lead-authored-by: beliefer <[email protected]> Co-authored-by: gengjiaan <[email protected]> Co-authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR tried to add
regexp_extract_all
sql function. It is supported in presto and pigAlso it refactored regexp_extract since there were some duplicated codes.
Why are the changes needed?
Added
regexp_extract_all
functionDoes this PR introduce any user-facing change?
No
How was this patch tested?
Unit test