Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24884][SQL] add regexp_extract_all support #21985

Closed
wants to merge 4 commits into from

Conversation

xueyumusic
Copy link
Contributor

@xueyumusic xueyumusic commented Aug 3, 2018

What changes were proposed in this pull request?

This PR tried to add regexp_extract_all sql function. It is supported in presto and pig
Also it refactored regexp_extract since there were some duplicated codes.

Why are the changes needed?

Added regexp_extract_all function

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

$groupArray.add(UTF8String.fromString($matchResult.group($idx)));
}
}
${ev.value} = new $arrayClass($groupArray.toArray(new UTF8String[$groupArray.size()]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need consider about setting ev.isNull?

> SELECT _FUNC_('100-200,300-400', '(\\d+)-(\\d+)', 1);
[100, 300]
""")
case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an abstract class to reduce duplicated code between RegExpExtractAll and RegExpExtract?

@kiszk
Copy link
Member

kiszk commented Jul 19, 2019

@xueyumusic gentle ping

@xueyumusic
Copy link
Contributor Author

Thanks for ping, @kiszk , I will update the code these days.

@afsanjar
Copy link

thanks for the patch @xueyumusic, we have been waiting for such feature in Spark for awhile :) any update when your hard work will be publicly available.

@xueyumusic
Copy link
Contributor Author

Hi, @kiszk @afsanjar very sorry for long delay update. I updated to address feedback of @xuanyuanking , thanks

}

(classNamePattern, matcher, matchResult, termLastRegex, termPattern, setEvNotNull)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: delete empty line


nullSafeCodeGen(ctx, ev, (subject, regexp, idx) => {
s"""
if (!$regexp.equals($termLastRegex)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you please use the following pattern?

s"""
   |
   |
 """

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @kiszk , updated.

@maropu
Copy link
Member

maropu commented Aug 26, 2019

We recently updated the PR template: https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE, so could you please follow that, first?

@maropu
Copy link
Member

maropu commented Aug 26, 2019

ok to test

@maropu
Copy link
Member

maropu commented Aug 26, 2019

Also, could you check if the other databases support the similar function, e.g., oracle, potgresql, mysql, ..., and then describe the result in the PR description?

@SparkQA
Copy link

SparkQA commented Aug 26, 2019

Test build #109723 has finished for PR 21985 at commit 4373a82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expression)
  • case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression)

@xueyumusic
Copy link
Contributor Author

Hi, @maropu , I updated PR description based on new template, thanks.

protected def getMatcher(s: Any, p: Any) = {
if (!p.equals(lastRegex)) {
// regex value changed
lastRegex = p.asInstanceOf[UTF8String].clone()
Copy link
Member

@kiszk kiszk Sep 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need to clone p?
nvm, it follows the original code

@maropu
Copy link
Member

maropu commented Sep 4, 2019

Also, could you check if the other databases support the similar function, e.g., oracle, potgresql, mysql, ..., and then describe the result in the PR description?

Have you checked this? pig and presto only support this function?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

ping @xueyumusic

@justgfather
Copy link

Hey, was this feature merged in the master. If yes, in what version of spark?

@kiszk
Copy link
Member

kiszk commented Sep 27, 2019

@justgfather Not merged yet. If this would be merged now, you can use this function in Spark 3.0.

@github-actions
Copy link

github-actions bot commented Jan 8, 2020

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 8, 2020
@maropu maropu closed this Jan 8, 2020
@gatorsmile
Copy link
Member

cc @beliefer Could you take this over?

@beliefer
Copy link
Contributor

beliefer commented Feb 8, 2020

@gatorsmile OK.

cloud-fan pushed a commit that referenced this pull request Aug 3, 2020
### What changes were proposed in this pull request?
`regexp_extract_all` is a very useful function expanded the capabilities of `regexp_extract`.
There are some description of this function.
```
SELECT regexp_extract('1a 2b 14m', '\d+', 0); -- 1
SELECT regexp_extract_all('1a 2b 14m', '\d+', 0); -- [1, 2, 14]
SELECT regexp_extract('1a 2b 14m', '(\d+)([a-z]+)', 2); -- 'a'
SELECT regexp_extract_all('1a 2b 14m', '(\d+)([a-z]+)', 2); -- ['a', 'b', 'm']
```
There are some mainstream database support the syntax.
**Presto:**
https://prestodb.io/docs/current/functions/regexp.html

**Pig:**
https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html

Note: This PR pick up the work of #21985
### Why are the changes needed?
`regexp_extract_all` is a very useful function and make work easier.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New UT

Closes #27507 from beliefer/support-regexp_extract_all.

Lead-authored-by: beliefer <[email protected]>
Co-authored-by: gengjiaan <[email protected]>
Co-authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.