-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24884][SQL] Support regexp function regexp_extract_all #27507
Conversation
Test build #118091 has finished for PR 27507 at commit
|
Test build #118096 has finished for PR 27507 at commit
|
"" | ||
} | ||
|
||
(classNamePattern, matcher, matchResult, termLastRegex, termPattern, setEvNotNull) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a little hard to read at the caller side.
can we implement doGenCode
in the base class, which calls an abstract method. Sub-classes need to implement the abstract method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Good idea.
* NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status. | ||
*/ | ||
@ExpressionDescription( | ||
usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we explain the semantic of idx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Test build #118274 has finished for PR 27507 at commit
|
s"${ev.isNull} = false;" | ||
} else { | ||
"" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I don't think there is much common code to share. Maybe we can have a
protected def setNotNullCode(ev: ExprCode) = ...
but that's all.
How about we just let each sub-class implement doGenCode
individually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
fallback to the Spark 1.6 behavior regarding string literal parsing. For example, | ||
if the config is enabled, the `regexp` that can match "\abc" is "^\abc$". | ||
* idx - a int expression. The regex maybe contains multiple groups. `idx` represents the | ||
index of regex group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idx indicates which regex group to extract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
/** | ||
* Extract a specific(idx) group identified by a Java regex. | ||
* | ||
* NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status. | ||
*/ | ||
@ExpressionDescription( | ||
usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.", | ||
arguments = """ | ||
Arguments: | ||
* str - a string expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a string expression of the input string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
arguments = """ | ||
Arguments: | ||
* str - a string expression | ||
* regexp - a string expression. The regex string should be a Java regular expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a string expression of the regex string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is Java regular expression
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. There just references the comment of RLIKE
.
There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to | ||
fallback to the Spark 1.6 behavior regarding string literal parsing. For example, | ||
if the config is enabled, the `regexp` that can match "\abc" is "^\abc$". | ||
* idx - a int expression. The regex maybe contains multiple groups. `idx` represents the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an int expression of the regex group index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -508,3 +568,96 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio | |||
}) | |||
} | |||
} | |||
|
|||
/** | |||
* Extract all specific(idx) group identified by a Java regex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group
-> groups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -2383,6 +2383,17 @@ object functions { | |||
RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr) | |||
} | |||
|
|||
/** | |||
* Extract all specific group matched by a Java regex, from the specified string column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -2383,6 +2383,17 @@ object functions { | |||
RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr) | |||
} | |||
|
|||
/** | |||
* Extract all specific group matched by a Java regex, from the specified string column. | |||
* If the regex did not match, or the specified group did not match, an empty array is returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior seems to be
- If the regex does not match, return an empty array
- if the specified group does not match, put an empty string to the result array.
Can we document the behavior in SQL expression? And can you verify this is the standard behavior in other databases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- should throw a IllegalArgumentException.
[SPARK-30763][SQL] Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract #27508
the behavior of Hive is :
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments ‘2’: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer) on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract@2cf5e0f0 of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments {x=a3&x=18abc&x=2&y=3&x=4:java.lang.String, x=([0-9]+)[a-z]:java.lang.String, 2:java.lang.Integer} of size 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's document the behavior clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Test build #118297 has finished for PR 27507 at commit
|
There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to | ||
fallback to the Spark 1.6 behavior regarding string literal parsing. For example, | ||
if the config is enabled, the `regexp` that can match "\abc" is "^\abc$". | ||
* idx - an int expression of the regex group index. The regex maybe contains multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe contains
-> may contain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
I have a high-level question. Do we have huge advantage to generate Java code? One advantage is to store the result of |
override def nullSafeEval(s: Any, p: Any, r: Any): Any = { | ||
val m = getLastMatcher(s, p) | ||
val matchResults = new ArrayBuffer[UTF8String]() | ||
val mr: MatchResult = m.toMatchResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we use this mr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove it.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Outdated
Show resolved
Hide resolved
Test build #118304 has finished for PR 27507 at commit
|
Test build #118306 has finished for PR 27507 at commit
|
LIKE and RLIKE cache the result of spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala Line 438 in 5e3c092
If the pattern string is a constant, the two approaches to the same goal. If the pattern string is a variable, the performance issue seems cannot to avoid. |
// invalid group index | ||
val row8 = create_row("100-200,300-400,500-600", "(\\d+)-(\\d+)", 3) | ||
val row9 = create_row("100-200,300-400,500-600", "(\\d+).*", 2) | ||
val row10 = create_row("100-200,300-400,500-600", "\\d+", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about negative group index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Line 418 in 5250f98
throw new IllegalArgumentException("The specified group index cannot be less than zero") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
// This is a hack way to enable the codegen, thus the codegen is enable by default, | ||
// it will still use the interpretProjection if projection followed by a LocalRelation, | ||
// hence we add a filter operator. | ||
// See the optimizer rule `ConvertToLocalRelation` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is out-dated. We already disabled ConvertToLocalRelation
in testing. We can remove these code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
SELECT regexp_extract_all('1a 2b 14m', '(\\d+)([a-z]+)', 0); | ||
SELECT regexp_extract_all('1a 2b 14m', '(\\d+)([a-z]+)', 1); | ||
SELECT regexp_extract_all('1a 2b 14m', '(\\d+)([a-z]+)', 2); | ||
SELECT regexp_extract_all('1a 2b 14m', '(\\d+)([a-z]+)', 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test optional group here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Test build #126823 has finished for PR 27507 at commit
|
Test build #126853 has finished for PR 27507 at commit
|
Test build #126858 has finished for PR 27507 at commit
|
@@ -322,6 +322,48 @@ class RegexpExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
RegExpExtract(Literal("\"quote"), Literal("\"quote"), Literal(1)) :: Nil) | |||
} | |||
|
|||
test("RegexExtractAll") { | |||
val row1 = create_row("100-200,300-400,500-600", "(\\d+)-(\\d+)", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test group 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Test build #126891 has finished for PR 27507 at commit
|
retest this please |
Test build #126919 has finished for PR 27507 at commit
|
Test build #126916 has finished for PR 27507 at commit
|
retest this please |
Test build #126924 has finished for PR 27507 at commit
|
retest this please |
Test build #126943 has finished for PR 27507 at commit
|
thanks, merging to master! |
@cloud-fan @kiszk Thanks for your review. |
…aSuite to sql-expression-schema.md ### What changes were proposed in this pull request? `sql-expression-schema.md` automatically generated by `ExpressionsSchemaSuite`, but only expressions entries are checked in `ExpressionsSchemaSuite`. So if we manually modify the contents of the file, `ExpressionsSchemaSuite` does not necessarily guarantee the correctness of the it some times. For example, [Spark-24884](#27507) added `regexp_extract_all` expression support, and manually modify the `sql-expression-schema.md` but not change the content of `Number of queries` cause file content inconsistency. Some additional checks have been added to `ExpressionsSchemaSuite` to improve the correctness guarantee of `sql-expression-schema.md` as follow: - `Number of queries` should equals size of `expressions entries` in `sql-expression-schema.md` - `Number of expressions that missing example` should equals size of `Expressions missing examples` in `sql-expression-schema.md` - `MissExamples` from case should same as `expectedMissingExamples` from `sql-expression-schema.md` ### Why are the changes needed? Ensure the correctness of `sql-expression-schema.md` content. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Enhanced ExpressionsSchemaSuite Closes #29608 from LuciferYang/sql-expression-schema. Authored-by: yangjie <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>
…t_all ### What changes were proposed in this pull request? #27507 implements `regexp_extract_all` and added the scala function version of it. According https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59, it seems good for remove the scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description. ### Why are the changes needed? `regexp_extract_all` is less common. ### Does this PR introduce _any_ user-facing change? 'No'. `regexp_extract_all` was added in Spark 3.1.0 which isn't released yet. ### How was this patch tested? Jenkins test. Closes #31346 from beliefer/SPARK-24884-followup. Authored-by: beliefer <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…t_all ### What changes were proposed in this pull request? #27507 implements `regexp_extract_all` and added the scala function version of it. According https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59, it seems good for remove the scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description. ### Why are the changes needed? `regexp_extract_all` is less common. ### Does this PR introduce _any_ user-facing change? 'No'. `regexp_extract_all` was added in Spark 3.1.0 which isn't released yet. ### How was this patch tested? Jenkins test. Closes #31346 from beliefer/SPARK-24884-followup. Authored-by: beliefer <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 99b6af2) Signed-off-by: Dongjoon Hyun <[email protected]>
…t_all ### What changes were proposed in this pull request? apache#27507 implements `regexp_extract_all` and added the scala function version of it. According https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59, it seems good for remove the scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description. ### Why are the changes needed? `regexp_extract_all` is less common. ### Does this PR introduce _any_ user-facing change? 'No'. `regexp_extract_all` was added in Spark 3.1.0 which isn't released yet. ### How was this patch tested? Jenkins test. Closes apache#31346 from beliefer/SPARK-24884-followup. Authored-by: beliefer <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
regexp_extract_all
is a very useful function expanded the capabilities ofregexp_extract
.There are some description of this function.
There are some mainstream database support the syntax.
Presto:
https://prestodb.io/docs/current/functions/regexp.html
Pig:
https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html
BigQuery
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_extract_all
Note: This PR pick up the work of #21985
Why are the changes needed?
regexp_extract_all
is a very useful function and make work easier.Does this PR introduce any user-facing change?
No
How was this patch tested?
New UT