[SPARK-35329][SQL] Split generated switch code into pieces in ExpandExec #32457

maropu · 2021-05-06T15:12:30Z

What changes were proposed in this pull request?

This PR intends to split generated switch code into smaller ones in ExpandExec. In the current master, even a simple query like the one below generates a large method whose size (maxMethodCodeSize:7448) is close to 8000 (CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT);

scala> val df = Seq(("2016-03-27 19:39:34", 1, "a"), ("2016-03-27 19:39:56", 2, "a"), ("2016-03-27 19:39:27", 4, "b")).toDF("time", "value", "id")
scala> val rdf = df.select(window($"time", "10 seconds", "3 seconds", "0 second"), $"value").orderBy($"window.start".asc, $"value".desc).select("value")
scala> sql("SET spark.sql.adaptive.enabled=false")
scala> import org.apache.spark.sql.execution.debug._ 
scala> rdf.debugCodegen

Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxMethodCodeSize:7448; maxConstantPoolSize:189(0.29% used); numInnerClasses:0) ==
                                    ^^^^
*(1) Project [window#34.start AS _gen_alias_39#39, value#11]
+- *(1) Filter ((isnotnull(window#34) AND (cast(time#10 as timestamp) >= window#34.start)) AND (cast(time#10 as timestamp) < window#34.end))
   +- *(1) Expand [List(named_struct(start, precisetimestampcon...

/* 028 */   private void expand_doConsume_0(InternalRow localtablescan_row_0, UTF8String expand_expr_0_0, boolean expand_exprIsNull_0_0, int expand_expr_1_0) throws java.io.IOException {
/* 029 */     boolean expand_isNull_0 = true;
/* 030 */     InternalRow expand_value_0 =
/* 031 */     null;
/* 032 */     for (int expand_i_0 = 0; expand_i_0 < 4; expand_i_0 ++) {
/* 033 */       switch (expand_i_0) {
/* 034 */       case 0:
                  (too many code lines)
/* 517 */         break;
/* 518 */
/* 519 */       case 1:
                  (too many code lines)
/* 1002 */         break;
/* 1003 */
/* 1004 */       case 2:
                  (too many code lines)
/* 1487 */         break;
/* 1488 */
/* 1489 */       case 3:
                  (too many code lines)
/* 1972 */         break;
/* 1973 */       }
/* 1974 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[33] /* numOutputRows */).add(1);
/* 1975 */
/* 1976 */       do {
/* 1977 */         boolean filter_value_2 = !expand_isNull_0;
/* 1978 */         if (!filter_value_2) continue;

The fix in this PR can make the method smaller as follows;

Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxMethodCodeSize:1713; maxConstantPoolSize:210(0.32% used); numInnerClasses:0) ==
                                    ^^^^
*(1) Project [window#17.start AS _gen_alias_32#32, value#11]
+- *(1) Filter ((isnotnull(window#17) AND (cast(time#10 as timestamp) >= window#17.start)) AND (cast(time#10 as timestamp) < window#17.end))
   +- *(1) Expand [List(named_struct(start, precisetimestampcon...

/* 032 */   private void expand_doConsume_0(InternalRow localtablescan_row_0, UTF8String expand_expr_0_0, boolean expand_exprIsNull_0_0, int expand_expr_1_0) throws java.io.IOException {
/* 033 */     for (int expand_i_0 = 0; expand_i_0 < 4; expand_i_0 ++) {
/* 034 */       switch (expand_i_0) {
/* 035 */       case 0:
/* 036 */         expand_switchCaseCode_0(expand_exprIsNull_0_0, expand_expr_0_0);
/* 037 */         break;
/* 038 */
/* 039 */       case 1:
/* 040 */         expand_switchCaseCode_1(expand_exprIsNull_0_0, expand_expr_0_0);
/* 041 */         break;
/* 042 */
/* 043 */       case 2:
/* 044 */         expand_switchCaseCode_2(expand_exprIsNull_0_0, expand_expr_0_0);
/* 045 */         break;
/* 046 */
/* 047 */       case 3:
/* 048 */         expand_switchCaseCode_3(expand_exprIsNull_0_0, expand_expr_0_0);
/* 049 */         break;
/* 050 */       }
/* 051 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[33] /* numOutputRows */).add(1);
/* 052 */
/* 053 */       do {
/* 054 */         boolean filter_value_2 = !expand_resultIsNull_0;
/* 055 */         if (!filter_value_2) continue;
/* 056 */
...

Why are the changes needed?

For better generated code.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

GA passed.

SparkQA · 2021-05-06T15:56:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42736/

SparkQA · 2021-05-06T15:57:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42736/

sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala

SparkQA · 2021-05-06T19:26:44Z

Test build #138214 has finished for PR 32457 at commit 5448708.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-07T00:58:55Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42742/

SparkQA · 2021-05-07T00:58:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42742/

SparkQA · 2021-05-07T04:22:07Z

Test build #138220 has finished for PR 32457 at commit cb182b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2021-05-07T04:41:26Z

This code seems to generate # of methods that is equal to # of columns. Am I correct?
Does this splitting cause no performance degradation? If there is a possibility, it would be good to introduce a threshold.

viirya · 2021-05-07T06:43:42Z

The number of added methods should be number of columns (different output) * number of projections? Maybe as @kiszk said, adding a threshold?

viirya

Looks okay otherwise.

viirya

Another idea. Maybe adding one method for all columns per projection, instead of per column?

kiszk · 2021-05-07T06:52:19Z

Totally. Two points.

Whether to create many methods may degrade performance or not
To create many methods may exceed JVM limitation.

I am a bit afraid that the number of methods may be over the limit of a class if there are multiple projections. For example, 100 columns x 700 projections (65536 is the upper limit at https://docs.oracle.com/javase/specs/jvms/se16/html/jvms-4.html#jvms-4.11).

maropu · 2021-05-07T14:52:09Z

@kiszk @viirya Yea, you're right; the previous approach seems bad. So, I applied two improvements based on your suggestion;

Checks if eval code size goes over the limit (spark.sql.codegen.methodSplitThreshold)
Creates a single method per projection (@viirya suggested above)

Could you check the latest code?

SparkQA · 2021-05-07T15:49:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42779/

SparkQA · 2021-05-07T15:49:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42779/

SparkQA · 2021-05-07T17:32:43Z

Test build #138257 has finished for PR 32457 at commit 18e87d2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-05-08T00:48:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42791/

SparkQA · 2021-05-08T00:48:34Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42791/

SparkQA · 2021-05-08T04:40:41Z

Test build #138269 has finished for PR 32457 at commit a1873d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2021-05-09T18:40:48Z

Is the description updated? Looks good except it.

viirya · 2021-05-10T19:05:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala

+      val inputVars = inputVarSets.foldLeft(Set.empty[VariableValue])(_ ++ _)
+      val paramLength = CodeGenerator.calculateParamLengthFromExprValues(inputVars.toSeq)
+      val maybeSplitUpdateCode = if (CodeGenerator.isValidParamLength(paramLength) &&
+          exprCodesWithIndices.map(_._2.code.length).sum > splitThreshold) {


Here we only check if the code under current switch case is under threshold. Seems to me, we need to accumulate non-split code too. If accumulated code is over the threshold, we need to split it out on later switch cases.

I see. That's a good insight. More simply, how about checking if the overall size of switch cases goes over the limit? Could you check the latest code?

viirya

lgtm if CI passes.

SparkQA · 2021-05-11T01:06:34Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42868/

SparkQA · 2021-05-11T01:12:10Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42868/

SparkQA · 2021-05-11T04:38:42Z

Test build #138346 has finished for PR 32457 at commit b01bb8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-05-11T21:28:23Z

@kiszk Do you want to take another look before I merge this? Since you approved it, there is some change.

kiszk · 2021-05-12T02:56:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala

-        val argList = inputVars.map { v =>
-          s"${CodeGenerator.typeName(v.javaType)} ${v.variableName}"
+    val splitThreshold = SQLConf.get.methodSplitThreshold
+    val cases = if (switchCaseExprs.flatMap(_._2.map(_._2.code.length)).sum > splitThreshold) {


Can we call generateUpdateCode only once before line 198?
IMHO, code in all three cases (line 203-, line 216, and line 225-) is generated by generateUpdateCode().

SparkQA · 2021-05-12T05:20:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42943/

SparkQA · 2021-05-12T05:20:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42943/

SparkQA · 2021-05-12T06:37:48Z

Test build #138421 has finished for PR 32457 at commit cdbbffe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2021-05-12T06:41:56Z

Looks like a related test failure.

SparkQA · 2021-05-12T08:33:52Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42958/

SparkQA · 2021-05-12T08:33:53Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42958/

SparkQA · 2021-05-12T11:53:14Z

Test build #138437 has finished for PR 32457 at commit 9e74ca6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2021-05-13T18:36:44Z

LGTM if failures disappear

viirya

lgtm

viirya · 2021-05-14T00:52:13Z

Thanks @maropu and @kiszk! Merging to master.

maropu · 2021-05-14T01:13:07Z

Thank you, @viirya and @kiszk ~

Fix

5448708

github-actions bot added the SQL label May 6, 2021

kiszk reviewed May 6, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/ExpandExec.scala Show resolved Hide resolved

review

cb182b8

viirya reviewed May 7, 2021

View reviewed changes

maropu added 2 commits May 7, 2021 22:31

Review

ba0945d

Review

18e87d2

Fix

a1873d9

kiszk approved these changes May 10, 2021

View reviewed changes

viirya reviewed May 10, 2021

View reviewed changes

review

b01bb8c

viirya approved these changes May 11, 2021

View reviewed changes

kiszk reviewed May 12, 2021

View reviewed changes

review

9e74ca6

maropu force-pushed the splitSwitchCode branch from cdbbffe to 9e74ca6 Compare May 12, 2021 07:07

viirya approved these changes May 14, 2021

View reviewed changes

viirya closed this in 8fa739f May 14, 2021

maropu deleted the splitSwitchCode branch June 23, 2021 13:25

singhpk234 mentioned this pull request May 24, 2023

Spark 3.4: Codegen support for UpdateRowsExec apache/iceberg#7691

Closed

[SPARK-35329][SQL] Split generated switch code into pieces in ExpandExec #32457

[SPARK-35329][SQL] Split generated switch code into pieces in ExpandExec #32457

Conversation

maropu commented May 6, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented May 6, 2021

SparkQA commented May 6, 2021

SparkQA commented May 6, 2021

SparkQA commented May 7, 2021

SparkQA commented May 7, 2021

SparkQA commented May 7, 2021

kiszk commented May 7, 2021 • edited Loading

viirya commented May 7, 2021

viirya left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

kiszk commented May 7, 2021

maropu commented May 7, 2021

SparkQA commented May 7, 2021

SparkQA commented May 7, 2021

SparkQA commented May 7, 2021

SparkQA commented May 8, 2021

SparkQA commented May 8, 2021

SparkQA commented May 8, 2021

kiszk commented May 9, 2021

viirya May 10, 2021

Choose a reason for hiding this comment

maropu May 11, 2021

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented May 11, 2021

SparkQA commented May 11, 2021

SparkQA commented May 11, 2021

viirya commented May 11, 2021 • edited Loading

kiszk May 12, 2021

Choose a reason for hiding this comment

SparkQA commented May 12, 2021

SparkQA commented May 12, 2021

SparkQA commented May 12, 2021

viirya commented May 12, 2021

SparkQA commented May 12, 2021

SparkQA commented May 12, 2021

SparkQA commented May 12, 2021

kiszk commented May 13, 2021 • edited Loading

viirya left a comment

Choose a reason for hiding this comment

viirya commented May 14, 2021

maropu commented May 14, 2021

maropu commented May 6, 2021 •

edited

Loading

kiszk commented May 7, 2021 •

edited

Loading

viirya commented May 11, 2021 •

edited

Loading

kiszk commented May 13, 2021 •

edited

Loading