[SPARK-24884][SQL] Support regexp function regexp_extract_all #27507

beliefer · 2020-02-09T08:05:34Z

What changes were proposed in this pull request?

regexp_extract_all is a very useful function expanded the capabilities of regexp_extract.
There are some description of this function.

SELECT regexp_extract('1a 2b 14m', '\d+', 0); -- 1
SELECT regexp_extract_all('1a 2b 14m', '\d+', 0); -- [1, 2, 14]
SELECT regexp_extract('1a 2b 14m', '(\d+)([a-z]+)', 2); -- 'a'
SELECT regexp_extract_all('1a 2b 14m', '(\d+)([a-z]+)', 2); -- ['a', 'b', 'm']

There are some mainstream database support the syntax.
Presto:
https://prestodb.io/docs/current/functions/regexp.html

Pig:
https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html

BigQuery
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_extract_all

Note: This PR pick up the work of #21985

Why are the changes needed?

regexp_extract_all is a very useful function and make work easier.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New UT

SparkQA · 2020-02-09T11:48:56Z

Test build #118091 has finished for PR 27507 at commit 06bb690.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class RegExpExtractBase extends TernaryExpression with ImplicitCastInputTypes
case class RegExpExtractAll(subject: Expression, regexp: Expression, idx: Expression)

SparkQA · 2020-02-09T16:54:48Z

Test build #118096 has finished for PR 27507 at commit 7a1c6d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-11T12:49:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+      ""
+    }
+
+    (classNamePattern, matcher, matchResult, termLastRegex, termPattern, setEvNotNull)


this is a little hard to read at the caller side.

can we implement doGenCode in the base class, which calls an abstract method. Sub-classes need to implement the abstract method.

OK. Good idea.

cloud-fan · 2020-02-11T12:49:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+ * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(str, regexp[, idx]) - Extracts all group that matches `regexp`.",


shall we explain the semantic of idx?

SparkQA · 2020-02-12T08:05:02Z

Test build #118274 has finished for PR 27507 at commit 1ed159f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-12T13:05:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+      s"${ev.isNull} = false;"
+    } else {
+      ""
+    }


TBH I don't think there is much common code to share. Maybe we can have a
protected def setNotNullCode(ev: ExprCode) = ... but that's all.

How about we just let each sub-class implement doGenCode individually?

cloud-fan · 2020-02-12T13:07:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - a int expression. The regex maybe contains multiple groups. `idx` represents the
+          index of regex group.


idx indicates which regex group to extract.

cloud-fan · 2020-02-12T13:08:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

 /**
 * Extract a specific(idx) group identified by a Java regex.
 *
 * NOTE: this expression is not THREAD-SAFE, as it has some internal mutable status.
 */
 @ExpressionDescription(
  usage = "_FUNC_(str, regexp[, idx]) - Extracts a group that matches `regexp`.",
+  arguments = """
+    Arguments:
+      * str - a string expression


a string expression of the input string.

cloud-fan · 2020-02-12T13:08:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+  arguments = """
+    Arguments:
+      * str - a string expression
+      * regexp - a string expression. The regex string should be a Java regular expression.


a string expression of the regex string.

what is Java regular expression?

OK. There just references the comment of RLIKE.

cloud-fan · 2020-02-12T13:09:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - a int expression. The regex maybe contains multiple groups. `idx` represents the


an int expression of the regex group index.

cloud-fan · 2020-02-12T13:12:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

@@ -508,3 +568,96 @@ case class RegExpExtract(subject: Expression, regexp: Expression, idx: Expressio
    })
  }
 }
+
+/**
+ * Extract all specific(idx) group identified by a Java regex.


group -> groups

cloud-fan · 2020-02-12T13:12:26Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

@@ -2383,6 +2383,17 @@ object functions {
    RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
  }

+  /**
+   * Extract all specific group matched by a Java regex, from the specified string column.


cloud-fan · 2020-02-12T13:15:18Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

@@ -2383,6 +2383,17 @@ object functions {
    RegExpExtract(e.expr, lit(exp).expr, lit(groupIdx).expr)
  }

+  /**
+   * Extract all specific group matched by a Java regex, from the specified string column.
+   * If the regex did not match, or the specified group did not match, an empty array is returned.


The behavior seems to be

If the regex does not match, return an empty array

if the specified group does not match, put an empty string to the result array.

Can we document the behavior in SQL expression? And can you verify this is the standard behavior in other databases?

should throw a IllegalArgumentException.
[SPARK-30763][SQL] Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract #27508
the behavior of Hive is :
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments ‘2’: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer) on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract@2cf5e0f0 of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments {x=a3&x=18abc&x=2&y=3&x=4:java.lang.String, x=([0-9]+)[a-z]:java.lang.String, 2:java.lang.Integer} of size 3

let's document the behavior clearly.

SparkQA · 2020-02-12T16:23:52Z

Test build #118297 has finished for PR 27507 at commit ea29d66.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2020-02-12T16:38:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+          There is a SQL config 'spark.sql.parser.escapedStringLiterals' that can be used to
+          fallback to the Spark 1.6 behavior regarding string literal parsing. For example,
+          if the config is enabled, the `regexp` that can match "\abc" is "^\abc$".
+      * idx - an int expression of the regex group index. The regex maybe contains multiple


nit: maybe contains -> may contain

kiszk · 2020-02-12T16:49:00Z

I have a high-level question. Do we have huge advantage to generate Java code?

One advantage is to store the result of Pattern.compile() into each global variable for caching while the non-generated code shares one variable for cache.
On the other hand, the size of the result is not small. Which trade-off do we select? Space or performance?

cloud-fan · 2020-02-12T17:55:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+  override def nullSafeEval(s: Any, p: Any, r: Any): Any = {
+    val m = getLastMatcher(s, p)
+    val matchResults = new ArrayBuffer[UTF8String]()
+    val mr: MatchResult = m.toMatchResult


where do we use this mr?

I will remove it.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

SparkQA · 2020-02-12T19:41:55Z

Test build #118304 has finished for PR 27507 at commit e50f010.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-12T19:43:58Z

Test build #118306 has finished for PR 27507 at commit 517251a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-02-13T02:30:30Z

I have a high-level question. Do we have huge advantage to generate Java code?

One advantage is to store the result of Pattern.compile() into each global variable for caching while the non-generated code shares one variable for cache.
On the other hand, the size of the result is not small. Which trade-off do we select? Space or performance?

LIKE and RLIKE cache the result of Pattern.compile().
RegExpReplace and RegExpExtract use another way

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

Line 438 in 5e3c092

if (!p.equals(lastRegex)) {

.
If the pattern string is a constant, the two approaches to the same goal.
If the pattern string is a variable, the performance issue seems cannot to avoid.

cloud-fan · 2020-07-30T13:34:51Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala

+    // invalid group index
+    val row8 = create_row("100-200,300-400,500-600", "(\\d+)-(\\d+)", 3)
+    val row9 = create_row("100-200,300-400,500-600", "(\\d+).*", 2)
+    val row10 = create_row("100-200,300-400,500-600", "\\d+", 1)


how about negative group index?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

Line 418 in 5250f98

throw new IllegalArgumentException("The specified group index cannot be less than zero")

can we test it?

cloud-fan · 2020-07-30T13:36:28Z

sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala

+    // This is a hack way to enable the codegen, thus the codegen is enable by default,
+    // it will still use the interpretProjection if projection followed by a LocalRelation,
+    // hence we add a filter operator.
+    // See the optimizer rule `ConvertToLocalRelation`


This is out-dated. We already disabled ConvertToLocalRelation in testing. We can remove these code.

cloud-fan · 2020-07-30T13:37:05Z

sql/core/src/test/resources/sql-tests/inputs/regexp-functions.sql

+SELECT regexp_extract_all('1a 2b 14m', '(\\d+)([a-z]+)', 0);
+SELECT regexp_extract_all('1a 2b 14m', '(\\d+)([a-z]+)', 1);
+SELECT regexp_extract_all('1a 2b 14m', '(\\d+)([a-z]+)', 2);
+SELECT regexp_extract_all('1a 2b 14m', '(\\d+)([a-z]+)', 3);


can we test optional group here?

SparkQA · 2020-07-31T05:19:31Z

Test build #126823 has finished for PR 27507 at commit c096de4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-31T07:05:02Z

Test build #126853 has finished for PR 27507 at commit af88ad1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-31T09:23:53Z

Test build #126858 has finished for PR 27507 at commit 3a98bbe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-31T09:40:39Z

...talyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala

@@ -322,6 +322,48 @@ class RegexpExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
      RegExpExtract(Literal("\"quote"), Literal("\"quote"), Literal(1)) :: Nil)
  }

+  test("RegexExtractAll") {
+    val row1 = create_row("100-200,300-400,500-600", "(\\d+)-(\\d+)", 1)


can we test group 0?

SparkQA · 2020-07-31T16:48:14Z

Test build #126891 has finished for PR 27507 at commit 82a384e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-08-01T12:40:12Z

retest this please

SparkQA · 2020-08-01T14:35:14Z

Test build #126919 has finished for PR 27507 at commit 2125bff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-01T16:56:43Z

Test build #126916 has finished for PR 27507 at commit 82a384e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-08-02T00:26:08Z

retest this please

SparkQA · 2020-08-02T03:53:49Z

Test build #126924 has finished for PR 27507 at commit 2125bff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2020-08-02T18:11:48Z

retest this please

SparkQA · 2020-08-02T22:33:26Z

Test build #126943 has finished for PR 27507 at commit 2125bff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-03T06:03:52Z

thanks, merging to master!

beliefer · 2020-08-03T07:35:02Z

@cloud-fan @kiszk Thanks for your review.

…aSuite to sql-expression-schema.md ### What changes were proposed in this pull request? `sql-expression-schema.md` automatically generated by `ExpressionsSchemaSuite`, but only expressions entries are checked in `ExpressionsSchemaSuite`. So if we manually modify the contents of the file, `ExpressionsSchemaSuite` does not necessarily guarantee the correctness of the it some times. For example, [Spark-24884](#27507) added `regexp_extract_all` expression support, and manually modify the `sql-expression-schema.md` but not change the content of `Number of queries` cause file content inconsistency. Some additional checks have been added to `ExpressionsSchemaSuite` to improve the correctness guarantee of `sql-expression-schema.md` as follow: - `Number of queries` should equals size of `expressions entries` in `sql-expression-schema.md` - `Number of expressions that missing example` should equals size of `Expressions missing examples` in `sql-expression-schema.md` - `MissExamples` from case should same as `expectedMissingExamples` from `sql-expression-schema.md` ### Why are the changes needed? Ensure the correctness of `sql-expression-schema.md` content. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Enhanced ExpressionsSchemaSuite Closes #29608 from LuciferYang/sql-expression-schema. Authored-by: yangjie <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]>

…t_all ### What changes were proposed in this pull request? #27507 implements `regexp_extract_all` and added the scala function version of it. According https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59, it seems good for remove the scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description. ### Why are the changes needed? `regexp_extract_all` is less common. ### Does this PR introduce _any_ user-facing change? 'No'. `regexp_extract_all` was added in Spark 3.1.0 which isn't released yet. ### How was this patch tested? Jenkins test. Closes #31346 from beliefer/SPARK-24884-followup. Authored-by: beliefer <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…t_all ### What changes were proposed in this pull request? #27507 implements `regexp_extract_all` and added the scala function version of it. According https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59, it seems good for remove the scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description. ### Why are the changes needed? `regexp_extract_all` is less common. ### Does this PR introduce _any_ user-facing change? 'No'. `regexp_extract_all` was added in Spark 3.1.0 which isn't released yet. ### How was this patch tested? Jenkins test. Closes #31346 from beliefer/SPARK-24884-followup. Authored-by: beliefer <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 99b6af2) Signed-off-by: Dongjoon Hyun <[email protected]>

…t_all ### What changes were proposed in this pull request? apache#27507 implements `regexp_extract_all` and added the scala function version of it. According https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59, it seems good for remove the scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description. ### Why are the changes needed? `regexp_extract_all` is less common. ### Does this PR introduce _any_ user-facing change? 'No'. `regexp_extract_all` was added in Spark 3.1.0 which isn't released yet. ### How was this patch tested? Jenkins test. Closes apache#31346 from beliefer/SPARK-24884-followup. Authored-by: beliefer <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Support regexp_extract_all

06bb690

Update description

7a1c6d3

beliefer changed the title ~~[SPARK-24884][SQL] Support string function regexp_extract_all~~ [SPARK-24884][SQL] Support regexp function regexp_extract_all Feb 10, 2020

dongjoon-hyun added the SQL label Feb 10, 2020

cloud-fan reviewed Feb 11, 2020

View reviewed changes

Optimize code

1ed159f

Resolve conflict.

ea29d66

cloud-fan reviewed Feb 12, 2020

View reviewed changes

beliefer added 2 commits February 12, 2020 22:53

Optimize code

e50f010

Optimize code

517251a

kiszk reviewed Feb 12, 2020

View reviewed changes

cloud-fan reviewed Feb 12, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala Outdated Show resolved Hide resolved

Add test case for group value is null

5e3c092

cloud-fan reviewed Jul 30, 2020

View reviewed changes

beliefer and others added 3 commits July 30, 2020 23:13

Optimize code

c096de4

Remove unreachable code path

af88ad1

Add test case for optional group.

3a98bbe

cloud-fan reviewed Jul 31, 2020

View reviewed changes

Add test cases.

82a384e

Add test cases

2125bff

cloud-fan approved these changes Aug 3, 2020

View reviewed changes

cloud-fan closed this in 42f9ee4 Aug 3, 2020

LuciferYang mentioned this pull request Sep 1, 2020

[SPARK-32762][SQL][TEST] Enhance the verification of ExpressionsSchemaSuite to sql-expression-schema.md #29608

Closed

beliefer mentioned this pull request Jan 26, 2021

[SPARK-34244][SQL] Remove the Scala function version of regexp_extract_all #31346

Closed

[SPARK-24884][SQL] Support regexp function regexp_extract_all #27507

[SPARK-24884][SQL] Support regexp function regexp_extract_all #27507

Conversation

beliefer commented Feb 9, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Feb 9, 2020

SparkQA commented Feb 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Feb 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk commented Feb 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 12, 2020

SparkQA commented Feb 12, 2020

beliefer commented Feb 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 31, 2020

SparkQA commented Jul 31, 2020

SparkQA commented Jul 31, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 31, 2020

beliefer commented Aug 1, 2020

SparkQA commented Aug 1, 2020

SparkQA commented Aug 1, 2020

beliefer commented Aug 2, 2020

SparkQA commented Aug 2, 2020

kiszk commented Aug 2, 2020

SparkQA commented Aug 2, 2020

cloud-fan commented Aug 3, 2020

beliefer commented Aug 3, 2020

beliefer commented Feb 9, 2020 •

edited

Loading

cloud-fan Feb 12, 2020 •

edited

Loading

beliefer commented Feb 13, 2020 •

edited

Loading