[SPARK-9853][Core] Optimize shuffle fetch of continuous partition IDs #26040

xuanyuanking · 2019-10-07T05:32:17Z

This PR takes over #19788. After we split the shuffle fetch protocol from OpenBlock in #24565, this optimization can be extended in the new shuffle protocol. Credit to @yucai, closes #19788.

What changes were proposed in this pull request?

This PR adds the support for continuous shuffle block fetching in batch:

Shuffle client changes:
- Add new feature tag spark.shuffle.fetchContinuousBlocksInBatch, implement the decision logic in BlockStoreShuffleReader.
- Merge the continuous shuffle block ids in batch if needed in ShuffleBlockFetcherIterator.
Shuffle server changes:
- Add support in ExternalBlockHandler for the external shuffle service side.
- Make ShuffleBlockResolver.getBlockData accept getting block data by range.
Protocol changes:
- Add new block id type ShuffleBlockBatchId represent continuous shuffle block ids.
- Extend FetchShuffleBlocks and OneForOneBlockFetcher.
- After the new shuffle fetch protocol completed in [SPARK-27665][Core] Split fetch shuffle blocks protocol from OpenBlocks #24565, the backward compatibility for external shuffle service can be controlled by spark.shuffle.useOldFetchProtocol.

Why are the changes needed?

In adaptive execution, one reducer may fetch multiple continuous shuffle blocks from one map output file. However, as the original approach, each reducer needs to fetch those 10 reducer blocks one by one. This way needs many IO and impacts performance. This PR is to support fetching those continuous shuffle blocks in one IO (batch way). See below example:

The shuffle block is stored like below:

The ShuffleId format is s"shuffle_$shuffleId_$mapId_$reduceId", referring to BlockId.scala.

In adaptive execution, one reducer may want to read output for reducer 5 to 14, whose block Ids are from shuffle_0_x_5 to shuffle_0_x_14.
Before this PR, Spark needs 10 disk IOs + 10 network IOs for each output file.
After this PR, Spark only needs 1 disk IO and 1 network IO. This way can reduce IO dramatically.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add new UT.
Integrate test with setting spark.sql.adaptive.enabled=true.

xuanyuanking · 2019-10-07T05:51:55Z

The difference between this PR and the original one is mainly caused by the following proposals from my side:

Instead of considering merged batch fetch as several blocks, this approach treats the combined batch id as a single block, so that we can not only keep the interfaces of FetchResult with no change in all places, but also delete the duplicated code of merge continuous block id in BlockManager and ExternalShuffleBlockHandler.
Extend the newly added shuffle fetch protocol by reusing the reduceIds for ShuffleBlockBatchId, it keeps both start and end reduce id for the range [startReduceId, endReduceId).

xuanyuanking · 2019-10-07T05:52:43Z

cc @yucai @cloud-fan @jiangxb1987

SparkQA · 2019-10-07T07:05:01Z

Test build #111831 has finished for PR 26040 at commit 413120f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ShuffleBlockBatchId(

cloud-fan · 2019-10-07T07:07:05Z

retest this please

SparkQA · 2019-10-07T08:46:28Z

Test build #111835 has finished for PR 26040 at commit 413120f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ShuffleBlockBatchId(

...work-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/FetchShuffleBlocks.java

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java

core/src/main/scala/org/apache/spark/internal/config/package.scala

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

SparkQA · 2019-10-08T14:34:03Z

Test build #111891 has finished for PR 26040 at commit 407b1e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java

...work-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/FetchShuffleBlocks.java

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala

yucai · 2019-10-09T11:20:23Z

@xuanyuanking @cloud-fan thanks for taking care of this!

advancedxy · 2019-10-11T06:33:17Z

core/src/main/scala/org/apache/spark/storage/BlockId.scala

+    startReduceId: Int,
+    endReduceId: Int) extends BlockId {
+  override def name: String = {
+    "shuffle_" + shuffleId + "_" + mapId + "_" + startReduceId + "_" + endReduceId


Shall we use "_" + startReduceId + "-" + endReduceId instead? Since startReduceId and endReduceId belongs to the same semantic group. This also gives more space to extend ShuffleBlockBatchId in the feature.

The format will also influence the protocol side. I suggest doing the change when it needed, within the same patch with extension.

advancedxy · 2019-10-11T06:54:17Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

+        if (curBlocks.isEmpty) {
+          curBlocks += info
+        } else {
+          if (curBlockId.mapId != curBlocks.head.blockId.asInstanceOf[ShuffleBlockId].mapId) {


How about we keep tracking with preMapId, startReduceId, endReduceId and mergedBlockSize and avoid using curBlocks: ArrayBuffer[FetchBlockInfo] since we don't need all the info in curBlocks.

I don't think it makes much difference, except using more vars.

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala

...work-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/FetchShuffleBlocks.java

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/ShuffleIndexInformation.java

SparkQA · 2019-10-16T07:59:30Z

Test build #112148 has finished for PR 26040 at commit d9ea5af.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala

cloud-fan · 2019-10-16T13:53:38Z

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala

+    val res = featureEnabled && fetchMultiPartitions &&
+      serializerRelocatable && (!compressed || codecConcatenation)
+    if (featureEnabled && !res) {
+      logDebug("The feature tag of continuous shuffle block fetching is set to true, but " +


AQE is off by default, and the continuous fetch is on by default. This means we always do log. I think we should log only if the compressor/serializer is not suitable.

e.g. the code can be

val shouldBatchFetch = fetchMultiPartitions && context.getLocalProperties... ... val doBatchFetch = ... if (shouldBatchFetch && !doBatchFetch) ...

Make sense, done in afd74e5.

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

cloud-fan · 2019-10-16T14:05:27Z

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala

@@ -35,11 +35,35 @@ private[spark] class BlockStoreShuffleReader[K, C](
    readMetrics: ShuffleReadMetricsReporter,
    serializerManager: SerializerManager = SparkEnv.get.serializerManager,
    blockManager: BlockManager = SparkEnv.get.blockManager,
-    mapOutputTracker: MapOutputTracker = SparkEnv.get.mapOutputTracker)
+    mapOutputTracker: MapOutputTracker = SparkEnv.get.mapOutputTracker,
+    fetchMultiPartitions: Boolean)


to reduce diff, how about we give it a default value true? Existing tests won't enable batch fetch as they don't set the local property.

We can even remove this parameter. The caller side can drop the special local property if it doesn't need to fetch multiple partitions.

hmmm, it's not good to handle the special local property in too many places. How about we read the local property in the caller side, check the # of partitions need to fetch, and pass a single "shouldBatchFetch" flag to BlockStoreShuffleReader?

Yeah, pick the last approach, that's clearer, also move the property's key into SortShuffleManager.

core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala

cloud-fan · 2019-10-16T14:12:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -355,6 +355,16 @@ object SQLConf {
      .bytesConf(ByteUnit.BYTE)
      .createWithDefault(64 * 1024 * 1024)

+
+  val SHUFFLE_FETCH_CONTINUOUS_BLOCKS_IN_BATCH_ENABLED =
+    buildConf("spark.sql.adaptive.shuffle.fetchContinuousBlocksInBatch.enabled")


instead of creating the "shuffle" namespace, how about spark.sql.adaptive.fetchShuffleBlocksInBatch.enabled?

Sure, done in afd74e5.
There's already has spark.sql.adaptive.shuffle namespace.

cloud-fan

LGTM except a few code style comments

SparkQA · 2019-10-16T15:14:07Z

Test build #112170 has finished for PR 26040 at commit fa7c272.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-17T03:29:00Z

Test build #112196 has finished for PR 26040 at commit afd74e5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala

SparkQA · 2019-10-17T05:48:57Z

Test build #112197 has finished for PR 26040 at commit 8f855a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-10-17T06:25:39Z

I'm merging it. We can have a followup PR to address the last code style comment. Thanks for the work @yucai and @xuanyuanking !

xuanyuanking · 2019-10-17T08:07:44Z

Thanks for the review and help.

yucai · 2019-10-17T10:39:36Z

Thanks for solving this! @xuanyuanking @cloud-fan

…ns related to adaptive execution ### What changes were proposed in this pull request? 1. Regularize all the shuffle configurations related to adaptive execution. 2. Add default value for `BlockStoreShuffleReader.shouldBatchFetch`. ### Why are the changes needed? It's a follow-up PR for #26040. Regularize the existing `spark.sql.adaptive.shuffle` namespace in SQLConf. ### Does this PR introduce any user-facing change? Rename one released user config `spark.sql.adaptive.minNumPostShufflePartitions` to `spark.sql.adaptive.shuffle.minNumPostShufflePartitions`, other changed configs is not released yet. ### How was this patch tested? Existing UT. Closes #26147 from xuanyuanking/SPARK-9853. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan reviewed Oct 7, 2019

View reviewed changes

...work-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/FetchShuffleBlocks.java Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 7, 2019

View reviewed changes

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 7, 2019

View reviewed changes

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 7, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala Outdated Show resolved Hide resolved

dongjoon-hyun added SPARK CORE SQL labels Oct 7, 2019

cloud-fan reviewed Oct 8, 2019

View reviewed changes

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 8, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 8, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 8, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 9, 2019

View reviewed changes

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 9, 2019

View reviewed changes

...work-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/FetchShuffleBlocks.java Show resolved Hide resolved

cloud-fan reviewed Oct 9, 2019

View reviewed changes

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java Show resolved Hide resolved

cloud-fan reviewed Oct 9, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 9, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala Outdated Show resolved Hide resolved

advancedxy reviewed Oct 11, 2019

View reviewed changes

yucai and others added 6 commits October 16, 2019 15:56

mock commit to give credit to yucai

0a897eb

[SPARK-9853][Core] Optimize shuffle fetch of contiguous partition IDs

c0a36cc

Fix tests and address comment

1330ec4

comment address

22ec5bb

comment address from wenchen

183d1fd

register new shuffle block id in seralizer manager

2124e90

Move config to SQLConf

fa7c272

cloud-fan reviewed Oct 16, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/network/netty/NettyBlockRpcServer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 16, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 16, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 16, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 16, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala Show resolved Hide resolved

cloud-fan reviewed Oct 16, 2019

View reviewed changes

cloud-fan approved these changes Oct 16, 2019

View reviewed changes

xuanyuanking added 2 commits October 17, 2019 11:14

comment address

afd74e5

fix

8f855a7

cloud-fan reviewed Oct 17, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/shuffle/BlockStoreShuffleReader.scala Show resolved Hide resolved

cloud-fan closed this in 239ee3f Oct 17, 2019

xuanyuanking deleted the SPARK-9853 branch October 17, 2019 08:07

xuanyuanking mentioned this pull request Oct 17, 2019

[SPARK-9853][Core][Follow-up] Regularize all the shuffle configurations related to adaptive execution #26147

Closed

WangGuangxin mentioned this pull request Jun 25, 2023

[GLUTEN-2051] Make ColumnarBatchSerializer supports relocation so that continuous shuffle block fetching can be enabled apache/incubator-gluten#2052

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9853][Core] Optimize shuffle fetch of continuous partition IDs #26040

[SPARK-9853][Core] Optimize shuffle fetch of continuous partition IDs #26040

xuanyuanking commented Oct 7, 2019 •

edited

Loading

xuanyuanking commented Oct 7, 2019

xuanyuanking commented Oct 7, 2019

SparkQA commented Oct 7, 2019

cloud-fan commented Oct 7, 2019

SparkQA commented Oct 7, 2019

SparkQA commented Oct 8, 2019

yucai commented Oct 9, 2019

advancedxy Oct 11, 2019

xuanyuanking Oct 16, 2019

advancedxy Oct 11, 2019

xuanyuanking Oct 16, 2019

SparkQA commented Oct 16, 2019

cloud-fan Oct 16, 2019 •

edited

Loading

cloud-fan Oct 16, 2019

xuanyuanking Oct 17, 2019

cloud-fan Oct 16, 2019

cloud-fan Oct 16, 2019

cloud-fan Oct 16, 2019

xuanyuanking Oct 17, 2019

cloud-fan Oct 16, 2019

xuanyuanking Oct 17, 2019

cloud-fan left a comment

SparkQA commented Oct 16, 2019

SparkQA commented Oct 17, 2019

SparkQA commented Oct 17, 2019

cloud-fan commented Oct 17, 2019

xuanyuanking commented Oct 17, 2019

yucai commented Oct 17, 2019

[SPARK-9853][Core] Optimize shuffle fetch of continuous partition IDs #26040

[SPARK-9853][Core] Optimize shuffle fetch of continuous partition IDs #26040

Conversation

xuanyuanking commented Oct 7, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

xuanyuanking commented Oct 7, 2019

xuanyuanking commented Oct 7, 2019

SparkQA commented Oct 7, 2019

cloud-fan commented Oct 7, 2019

SparkQA commented Oct 7, 2019

SparkQA commented Oct 8, 2019

yucai commented Oct 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 16, 2019

cloud-fan Oct 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

SparkQA commented Oct 16, 2019

SparkQA commented Oct 17, 2019

SparkQA commented Oct 17, 2019

cloud-fan commented Oct 17, 2019

xuanyuanking commented Oct 17, 2019

yucai commented Oct 17, 2019

xuanyuanking commented Oct 7, 2019 •

edited

Loading

cloud-fan Oct 16, 2019 •

edited

Loading