[SPARK-29655][SQL] Read bucketed tables obeys spark.sql.shuffle.partitions #26409

wangyum · 2019-11-06T08:55:44Z

What changes were proposed in this pull request?

In order to avoid frequently changing the value of spark.sql.adaptive.shuffle.maxNumPostShufflePartitions, we usually set spark.sql.adaptive.shuffle.maxNumPostShufflePartitions much larger than spark.sql.shuffle.partitions after enabling adaptive execution, which causes some bucket map join lose efficacy and add more ShuffleExchange.

How to reproduce:

val bucketedTableName = "bucketed_table"
spark.range(10000).write.bucketBy(500, "id").sortBy("id").mode(org.apache.spark.sql.SaveMode.Overwrite).saveAsTable(bucketedTableName)
val bucketedTable = spark.table(bucketedTableName)
val df = spark.range(8)

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
// Spark 2.4. spark.sql.adaptive.enabled=false
// We set spark.sql.shuffle.partitions <= 500 every time based on our data in this case.
spark.conf.set("spark.sql.shuffle.partitions", 500)
bucketedTable.join(df, "id").explain()
// Since 3.0. We enabled adaptive execution and set spark.sql.adaptive.shuffle.maxNumPostShufflePartitions to a larger values to fit more cases.
spark.conf.set("spark.sql.adaptive.enabled", true)
spark.conf.set("spark.sql.adaptive.shuffle.maxNumPostShufflePartitions", 1000)
bucketedTable.join(df, "id").explain()

scala> bucketedTable.join(df, "id").explain()
== Physical Plan ==
*(4) Project [id#5L]
+- *(4) SortMergeJoin [id#5L], [id#7L], Inner
   :- *(1) Sort [id#5L ASC NULLS FIRST], false, 0
   :  +- *(1) Project [id#5L]
   :     +- *(1) Filter isnotnull(id#5L)
   :        +- *(1) ColumnarToRow
   :           +- FileScan parquet default.bucketed_table[id#5L] Batched: true, DataFilters: [isnotnull(id#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/apache-spark/spark-3.0.0-SNAPSHOT-bin-3.2.0/spark-warehou..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 500 out of 500
   +- *(3) Sort [id#7L ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(id#7L, 500), true, [id=#49]
         +- *(2) Range (0, 8, step=1, splits=16)

vs

scala> bucketedTable.join(df, "id").explain()
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- Project [id#5L]
   +- SortMergeJoin [id#5L], [id#7L], Inner
      :- Sort [id#5L ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(id#5L, 1000), true, [id=#93]
      :     +- Project [id#5L]
      :        +- Filter isnotnull(id#5L)
      :           +- FileScan parquet default.bucketed_table[id#5L] Batched: true, DataFilters: [isnotnull(id#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/apache-spark/spark-3.0.0-SNAPSHOT-bin-3.2.0/spark-warehou..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 500 out of 500
      +- Sort [id#7L ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(id#7L, 1000), true, [id=#92]
            +- Range (0, 8, step=1, splits=16)

This PR makes read bucketed tables always obeys spark.sql.shuffle.partitions even enabling adaptive execution and set spark.sql.adaptive.shuffle.maxNumPostShufflePartitions to avoid add more ShuffleExchange.

Why are the changes needed?

Do not degrade performance after enabling adaptive execution.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

SparkQA · 2019-11-06T11:41:27Z

Test build #113307 has finished for PR 26409 at commit 16e8f5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-06T16:04:45Z

Test build #113316 has finished for PR 26409 at commit 1ba9edf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-11-06T16:44:48Z

cc @cloud-fan

cloud-fan · 2019-11-06T17:35:01Z

can you give a step-by-step explanation of how this happens? I can't get it from either PR description or the code changes.

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

cloud-fan · 2019-11-08T06:34:45Z

This is is not an AQE problem. IIUC we will add extra shuffle when spark.sql.shuffle.partitions is larger than the num buckets.

Can we make EnsureRequirements smarter and fix the underlying problem?

cloud-fan · 2019-11-08T06:36:14Z

And this is not a trivial problem. Image that a table has only 1 bucket, it may be faster if we add an extra shuffle to increase the parallelism.

cloud-fan · 2019-11-11T15:59:03Z

let's think about the expected behavior. This is truly a cost problem but we should figure out a simple rule as estimating cost is not realistic in Spark for now.

we should not blindly avoid shuffle. very few buckets can lead to poor performance as parallelism is low.
we should avoid shuffle if the number of buckets is reasonable.

Image we join a bucketed table with a big table. We can avoid shuffle if the number of buckets is reasonable as the number of partitions to shuffle the big table. It's hard to define reasonable here and I think it's OK to take the value of spark.sql.shuffle.partitions as the reasonable number.

cloud-fan · 2019-11-11T16:31:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

+      // maxNumPostShufflePartitions is usually larger than numShufflePartitions,
+      // which causes some bucket map join lose efficacy after enabling adaptive execution.
+      // Please see SPARK-29655 for more details.
+      val expectedChildrenNumPartitions = if (conf.adaptiveExecutionEnabled) {


The logical is convoluted here as we've already added the shuffles with maxNumPostShufflePartitions, and we need to revert it.

Can we make the implementation clearer? Basically we are picking the targetNumPartitions as:

if there are no non-shuffle children, keep the previous behavior.

for non-shuffle children, get the max num partitions among them.
2.1. if the max num partitions is larger than conf.numShufflePartitions, pick it as targetNumPartitions.
2.2. otherwise, pick conf.numShufflePartitions as targetNumPartitions

Yes. This implementation could make it clearer.

What do you think of another implementation: 45109c7? It may be more reasonable. But personally, I do not like add new conf.

it's indeed clearer to have a dedicated config, but I'd like to have a better solution that can make decisions automatically instead of relying on configs. We can think about it later.

cloud-fan · 2019-11-12T15:49:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

-      val targetNumPartitions = requiredNumPartitions.getOrElse(childrenNumPartitions.max)
+      val nonShuffleChildrenNumPartitions =
+        childrenIndexes.filterNot(children(_).isInstanceOf[ShuffleExchangeExec])
+          .map(children(_).outputPartitioning.numPartitions).toSet


nit: childrenIndexes.map(children).filterNot(_.isInstanceOf[ShuffleExchangeExec])...

cloud-fan · 2019-11-12T15:50:20Z

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

@@ -795,4 +804,22 @@ abstract class BucketedReadSuite extends QueryTest with SQLTestUtils {
    }
  }

+  test("Read bucketed tables obeys numShufflePartitions") {


let's add the JIRA ID in the test name as this is a regression

SparkQA · 2019-11-12T16:02:29Z

Test build #113634 has started for PR 26409 at commit 73a4943.

cloud-fan · 2019-11-12T16:44:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

-      val targetNumPartitions = requiredNumPartitions.getOrElse(childrenNumPartitions.max)
+      val nonShuffleChildrenNumPartitions =
+        childrenIndexes.map(children).filterNot(_.isInstanceOf[ShuffleExchangeExec])
+          .map(_.outputPartitioning.numPartitions).toSet


nit: in practice there will be 2 children at most, toSet is not really needed.

SparkQA · 2019-11-12T17:18:29Z

Test build #113638 has started for PR 26409 at commit 8ec1518.

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

SparkQA · 2019-11-13T08:05:02Z

Test build #113670 has finished for PR 26409 at commit a50122b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2019-11-13T08:05:46Z

retest this please

SparkQA · 2019-11-13T11:01:43Z

Test build #113679 has finished for PR 26409 at commit a50122b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…ionsSuite

wangyum · 2019-11-13T11:45:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

+        childrenIndexes.map(children).filterNot(_.isInstanceOf[ShuffleExchangeExec])
+          .map(_.outputPartitioning.numPartitions)
+      val expectedChildrenNumPartitions = if (nonShuffleChildrenNumPartitions.nonEmpty &&
+        conf.maxNumPostShufflePartitions > conf.numShufflePartitions) {


@cloud-fan @viirya I add conf.maxNumPostShufflePartitions > conf.numShufflePartitions to fix test error:

org.apache.spark.sql.execution.ReduceNumShufflePartitionsSuite.determining the number of reducers: plan already partitioned(minNumPostShufflePartitions: 5) org.apache.spark.sql.execution.ReduceNumShufflePartitionsSuite.determining the number of reducers: plan already partitioned

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/113679/testReport/

SparkQA · 2019-11-13T15:38:52Z

Test build #113701 has finished for PR 26409 at commit 57c50b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-14T12:54:41Z

Test build #113768 has finished for PR 26409 at commit a4f7611.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-14T13:22:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

-      val targetNumPartitions = requiredNumPartitions.getOrElse(childrenNumPartitions.max)
+      // Read bucketed tables always obeys numShufflePartitions because maxNumPostShufflePartitions
+      // is usually much larger than numShufflePartitions,
+      // which causes some bucket map join lose efficacy after enabling adaptive execution.


The comment is hard to understand. How about

If there are non-shuffle children that satisfy the required distribution, we have some tradeoffs when picking the expected number of shuffle partitions: 1. we should avoid shuffling these children 2. we should have a reasonable parallelism Here we pick the max number of partitions among these non-shuffle children as the expected number of shuffle partitions. However, if it's smaller than `conf.numShufflePartitions`, we pick `conf.numShufflePartitions` as the expected number of shuffle partitions.

SparkQA · 2019-11-14T19:23:19Z

Test build #113795 has finished for PR 26409 at commit eb4f65f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-15T07:49:36Z

thanks, merging to master!

Enable adaptive execution should not add ShuffleExchange

16e8f5b

Fix test error

1ba9edf

dongjoon-hyun added the SQL label Nov 6, 2019

viirya reviewed Nov 7, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 11, 2019

View reviewed changes

wangyum added 2 commits November 12, 2019 18:32

Ignores Thrift server ThriftServerPageSuite

45109c7

Read bucketed tables obeys numShufflePartitions

73a4943

wangyum changed the title ~~[SPARK-29655][SQL] Enable adaptive execution should not add more ShuffleExchange~~ [SPARK-29655][SQL] Read bucketed tables obeys spark.sql.shuffle.partitions Nov 12, 2019

cloud-fan reviewed Nov 12, 2019

View reviewed changes

Address comment

59afeab

cloud-fan reviewed Nov 12, 2019

View reviewed changes

cloud-fan approved these changes Nov 12, 2019

View reviewed changes

Remove toSet

8ec1518

viirya reviewed Nov 12, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala Show resolved Hide resolved

Add comment

a50122b

viirya approved these changes Nov 13, 2019

View reviewed changes

Fix test error: org.apache.spark.sql.execution.ReduceNumShufflePartit…

57c50b8

…ionsSuite

wangyum commented Nov 13, 2019

View reviewed changes

Fix test error

a4f7611

cloud-fan reviewed Nov 14, 2019

View reviewed changes

Update comment

eb4f65f

cloud-fan closed this in 4f10e54 Nov 15, 2019

wangyum deleted the SPARK-29655 branch November 15, 2019 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29655][SQL] Read bucketed tables obeys spark.sql.shuffle.partitions #26409

[SPARK-29655][SQL] Read bucketed tables obeys spark.sql.shuffle.partitions #26409

wangyum commented Nov 6, 2019 •

edited

Loading

SparkQA commented Nov 6, 2019

SparkQA commented Nov 6, 2019

wangyum commented Nov 6, 2019

cloud-fan commented Nov 6, 2019

cloud-fan commented Nov 8, 2019

cloud-fan commented Nov 8, 2019

cloud-fan commented Nov 11, 2019 •

edited

Loading

cloud-fan Nov 11, 2019

wangyum Nov 12, 2019

wangyum Nov 12, 2019

cloud-fan Nov 12, 2019

cloud-fan Nov 12, 2019 •

edited

Loading

cloud-fan Nov 12, 2019

SparkQA commented Nov 12, 2019

cloud-fan Nov 12, 2019

SparkQA commented Nov 12, 2019

SparkQA commented Nov 13, 2019

wangyum commented Nov 13, 2019

SparkQA commented Nov 13, 2019

wangyum Nov 13, 2019

SparkQA commented Nov 13, 2019

SparkQA commented Nov 14, 2019

cloud-fan Nov 14, 2019 •

edited

Loading

wangyum Nov 14, 2019

SparkQA commented Nov 14, 2019

cloud-fan commented Nov 15, 2019

[SPARK-29655][SQL] Read bucketed tables obeys spark.sql.shuffle.partitions #26409

[SPARK-29655][SQL] Read bucketed tables obeys spark.sql.shuffle.partitions #26409

Conversation

wangyum commented Nov 6, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Nov 6, 2019

SparkQA commented Nov 6, 2019

wangyum commented Nov 6, 2019

cloud-fan commented Nov 6, 2019

cloud-fan commented Nov 8, 2019

cloud-fan commented Nov 8, 2019

cloud-fan commented Nov 11, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Nov 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 12, 2019

Choose a reason for hiding this comment

SparkQA commented Nov 12, 2019

SparkQA commented Nov 13, 2019

wangyum commented Nov 13, 2019

SparkQA commented Nov 13, 2019

Choose a reason for hiding this comment

SparkQA commented Nov 13, 2019

SparkQA commented Nov 14, 2019

cloud-fan Nov 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 14, 2019

cloud-fan commented Nov 15, 2019

wangyum commented Nov 6, 2019 •

edited

Loading

cloud-fan commented Nov 11, 2019 •

edited

Loading

cloud-fan Nov 12, 2019 •

edited

Loading

cloud-fan Nov 14, 2019 •

edited

Loading