[SPARK-25341][Core] Support rolling back a shuffle map stage and re-generate the shuffle files #25620

xuanyuanking · 2019-08-29T14:23:25Z

After the newly added shuffle block fetching protocol in #24565, we can keep this work by extending the FetchShuffleBlocks message.

What changes were proposed in this pull request?

In this patch, we achieve the indeterminate shuffle rerun by reusing the task attempt id(unique id within an application) in shuffle id, so that each shuffle write attempt has a different file name. For the indeterministic stage, when the stage resubmits, we'll clear all existing map status and rerun all partitions.

All changes are summarized as follows:

Change the mapId to mapTaskAttemptId in shuffle related id.
Record the mapTaskAttemptId in MapStatus.
Still keep mapId in ShuffleFetcherIterator for fetch failed scenario.
Add the determinate flag in Stage and use it in DAGScheduler and the cleaning work for the intermediate stage.

Why are the changes needed?

This is a follow-up work for #22112's future improvment[1]: Currently we can't rollback and rerun a shuffle map stage, and just fail.

Spark will rerun a finished shuffle write stage while meeting fetch failures, currently, the rerun shuffle map stage will only resubmit the task for missing partitions and reuse the output of other partitions. This logic is fine in most scenarios, but for indeterministic operations(like repartition), multiple shuffle write attempts may write different data, only rerun the missing partition will lead a correctness bug. So for the shuffle map stage of indeterministic operations, we need to support rolling back the shuffle map stage and re-generate the shuffle files.

Does this PR introduce any user-facing change?

Yes, after this PR, the indeterminate stage rerun will be accepted by rerunning the whole stage. The original behavior is aborting the stage and fail the job.

How was this patch tested?

UT: Add UT for all changing code and newly added function.
Manual Test: Also providing a manual test to verify the effect.

import scala.sys.process._
import org.apache.spark.TaskContext

val determinateStage0 = sc.parallelize(0 until 1000 * 1000 * 100, 10)
val indeterminateStage1 = determinateStage0.repartition(200)
val indeterminateStage2 = indeterminateStage1.repartition(200)
val indeterminateStage3 = indeterminateStage2.repartition(100)
val indeterminateStage4 = indeterminateStage3.repartition(300)
val fetchFailIndeterminateStage4 = indeterminateStage4.map { x =>
if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId == 190 && 
  TaskContext.get.stageAttemptNumber == 0) {
  throw new Exception("pkill -f -n java".!!)
  }
  x
}
val indeterminateStage5 = fetchFailIndeterminateStage4.repartition(200)
val finalStage6 = indeterminateStage5.repartition(100).collect().distinct.length

It's a simple job with multi indeterminate stage, it will get a wrong answer while using old Spark version like 2.2/2.3, and will be killed after #22112. With this fix, the job can retry all indeterminate stage as below screenshot and get the right result.

SparkQA · 2019-08-29T14:40:54Z

Test build #109907 has finished for PR 25620 at commit c6cbb06.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FetchRequest(address: BlockManagerId, blocks: Seq[(BlockId, Long, Int)])

xuanyuanking · 2019-08-29T14:41:53Z

core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala

   */
-  private[this] val numMapsForShuffle = new ConcurrentHashMap[Int, Int]()
+  private[this] val taskIdMapsForShuffle = new ConcurrentHashMap[Int, ArrayBuffer[Long]]()


After using the map task attempt id as part of shuffle file, here we must record all the map task attempt id for each shuffle id. So comparing with the original implement, it's a memory waste here. But consider the shuffle map task number, it's maybe an accessible change?

mostly it's hundreds or thousands of mappers, which is about several KB. I think it's fine. If there are many many mappers, our DAG scheduler will burn out first.

Well, @xuanyuanking should already be aware of that 100K Mappers is not that rare for large production jobs. That would be ~10MB for single one map stage.

Maybe we should removes old shuffleId's data just like the scheduler removes old stages. However I do believe it's fine for now. Let's revise this when it actually hits.

Yeah that's why I highlight this question here, just because we see the huge job before. :)
For current implement of ContextCleaner, the in-memory shuffle metrics bind with JVM gc, the config spark.cleaner.periodicGC.interval can help us.
Sure, let's keep tracking.

but this map is only touched on executors, right? so its only the number of tasks which run on one executor which matter here.

This doesn't seem too worrying at first (even 10MB per stage isn't that much overhead if it's cleaned up eventually). But perhaps using OpenHashSet can help with larger stages (vs. an ArrayBuffer), although it will use more memory for smaller ones.

Thanks for the guidance, change it to use OpenHashSet, and gives a smaller initial size 16.

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

xuanyuanking · 2019-08-29T14:50:09Z

As @squito's suggestion in #24892 (comment), this PR reuse the map task attempt id as part of the shuffle file name.
cc @cloud-fan @vanzin

cloud-fan · 2019-08-29T14:58:13Z

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java

@@ -300,7 +300,7 @@ public ShuffleMetrics() {
    }

    ManagedBufferIterator(FetchShuffleBlocks msg, int numBlockIds) {
-      final int[] mapIdAndReduceIds = new int[2 * numBlockIds];
+      final long[] mapIdAndReduceIds = new long[2 * numBlockIds];


maybe we should one long[] for map id and one int[] for reduce id.

Actually we already have long[] for map id and int[] for reduce id in the message, here we need is kinda assemble work to flatten reduce id and its corresponding mapid.
The current way waste memory, we can also do it in a cpu consuming way, which is for each index, calculate which map id and reduce id corresponding with the idx.

After taking a further look, I split the new protocol managed buffer iterator in
539d725, that make us more flexible to control the iterator and no more array created.

cloud-fan · 2019-08-29T14:59:45Z

core/src/main/java/org/apache/spark/shuffle/api/ShuffleExecutorComponents.java

   */
  ShuffleMapOutputWriter createMapOutputWriter(
      int shuffleId,
-      int mapId,
      long mapTaskAttemptId,


shall we name it mapId? To be consistent with the codebase.

The original mapId still used in mapOutputTracker and scheduler, I doubt anybody will confused by these two ids use same name?

I changed all mapTaskAttemptId stuff to mapId in 539d725.
So after the change, the mapId is the unique id for a map task. If we think it's confused to have a mapId represent the map index within a stage or a task set, mapIndex maybe a much better name.

cloud-fan · 2019-08-29T15:08:15Z

project/MimaExcludes.scala

+    ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.image.ImageSchema.readImages"),
+
+    // [SPARK-25341][CORE] Support rolling back a shuffle map stage and re-generate the shuffle files
+    ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.shuffle.sort.UnsafeShuffleWriter.this"),


I'm surprised that this is tracked by mima. It's obviously an internal class. cc @srowen @HyukjinKwon

You can probably customize what is ignored by adding some logic to GenerateMIMAIgnore. I think it's OK to just add exclusions here too.

Maybe GenerateMIMAIgnore isn't really ignoring classes annotated with @Private.

SparkQA · 2019-08-29T17:09:34Z

Test build #109909 has finished for PR 25620 at commit fa25005.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-30T10:13:16Z

Test build #109940 has finished for PR 25620 at commit 539d725.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-30T12:47:38Z

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java

@@ -106,7 +106,7 @@ protected void handleMessage(
            numBlockIds += ids.length;
          }
          streamId = streamManager.registerStream(client.getClientId(),
-            new ManagedBufferIterator(msg, numBlockIds), client.getChannel());
+            new ShuffleManagedBufferIterator(msg), client.getChannel());


we can also remove

numBlockIds = 0; for (int[] ids: msg.reduceIds) { numBlockIds += ids.length; }

The numBlockIds used in callback:

callback.onSuccess(new StreamHandle(streamId, numBlockIds).toByteBuffer());

core/src/main/java/org/apache/spark/shuffle/api/ShuffleExecutorComponents.java

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

cloud-fan · 2019-08-30T13:12:27Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

@@ -706,6 +714,7 @@ object ShuffleBlockFetcherIterator {
   */
  private[storage] case class SuccessFetchResult(
      blockId: BlockId,
+      mapId: Int,


why we need the map index here？

if I follow correctly, the reason is that even a SuccessFetchResult still sometimes results in a FetchFailure back to driver (eg. error decompressing the buffer). And the FetchFailure needs the mapIndex, because the mapstatus is still stored by mapIndex, so this tells us what we need to remove in the handling in DAGScheduler.

Yeah, that's right. Here we need to guarantee all paths to throwFetchFailedException has mapIndex pass though, even a SuccessFetchResult still can trigger fetch failed exception.

squito

do I understand right that here in DAGScheduler:

spark/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Lines 1534 to 1537 in ea90ea6

    
           } else if (mapId != -1) { 
        
             // Mark the map whose fetch failed as broken in the map stage 
        
             mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress) 
        
           }

you'd only have mapId == 1 if using the old shuffle protocol? is that worth an assert?

I also find it tough to follow whether an id is the index within the stage or the global task id -- @cloud-fan pointed out a couple of cases where things could be named mapIndex. Its unfortunate we already have confusing names here ... what do you think of using mapTid consistently for all the places you mean the global id? I am at least used to seeing "TID" in spark logs for the global id, so maybe that would make it more clear?

squito · 2019-08-30T18:45:29Z

core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala

-      handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
+    taskIdMapsForShuffle.synchronized {
+      taskIdMapsForShuffle.putIfAbsent(handle.shuffleId, ArrayBuffer.empty[Long])
+      taskIdMapsForShuffle.get(handle.shuffleId).append(context.taskAttemptId())


you're trying to protect concurrent access to the ArrayBuffer[Long] with that synchronized block, right? minor, but you could avoid locking the entire map, and instead do

val mapTaskIds = taskIdMapsForShuffle.putIfAbsent(handle.shuffleId, ArrayBuffer.empty[Long]) mapTaskIds.synchronized { mapTaskIds.append(context.taskAttemptId()) }

Yep, great thanks for your advice for the optimization! Done in 4bd9e00.
(I think you mean taskIdMapsForShuffle.computeIfAbsent, use it in the new commit.)

squito · 2019-08-30T18:47:06Z

core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala

   */
-  private[this] val numMapsForShuffle = new ConcurrentHashMap[Int, Int]()
+  private[this] val taskIdMapsForShuffle = new ConcurrentHashMap[Int, ArrayBuffer[Long]]()


but this map is only touched on executors, right? so its only the number of tasks which run on one executor which matter here.

squito · 2019-08-30T18:55:30Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

@@ -706,6 +714,7 @@ object ShuffleBlockFetcherIterator {
   */
  private[storage] case class SuccessFetchResult(
      blockId: BlockId,
+      mapId: Int,


if I follow correctly, the reason is that even a SuccessFetchResult still sometimes results in a FetchFailure back to driver (eg. error decompressing the buffer). And the FetchFailure needs the mapIndex, because the mapstatus is still stored by mapIndex, so this tells us what we need to remove in the handling in DAGScheduler.

xuanyuanking · 2019-08-31T13:36:14Z

@squito Thanks for reviewing this, as @cloud-fan's suggestion, I'll do the follow-up work to normalize all the names by using mapIndex and mapId, mapIndex indecate the index of this map task in the task set or stage, mapId refers to the unique id for this task. WDYT?

SparkQA · 2019-08-31T15:43:32Z

Test build #109986 has finished for PR 25620 at commit 4bd9e00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-03T06:17:00Z

I agree with @squito that mapId is vague now. How about mapTaskId and mapIndex?

jiangxb1987 · 2019-09-03T23:13:16Z

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java

+
+    @Override
+    public boolean hasNext() {
+      // mapIds.length must equal to reduceIds.length, and the passed in FetchShuffleBlocks


Shall we add check logic here to be safe?

Does Xingbo mean a double check here? Basically there's existing checking for both the length and non-empty.

spark/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/FetchShuffleBlocks.java

Line 51 in 36f8e53

assert(mapIds.length == reduceIds.length);

spark/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java

Lines 88 to 90 in 36f8e53

if (blockIds.length == 0) {

throw new IllegalArgumentException("Zero-sized blockIds array");

}

Yea if the place is not super performance critical I'd prefer a double check here.

Sure, done the double-check in 00e78b2.

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java

core/src/main/java/org/apache/spark/shuffle/api/ShuffleExecutorComponents.java

jiangxb1987 · 2019-09-03T23:51:29Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+    // Before find missing partition, do the intermediate state clean work first.
+    // The operation here can make sure for the intermediate stage, `findMissingPartitions()`
+    // returns all partitions every time.
+    stage match {


Just curious, why not do unregister during failure handling?

That's for the scenario of ExecutorLost. While executor lost happened, there's possible for the indeterminate stage rerun triggered by submitParentStage.
So if we only unregister during failure handling, only fetch failed stage and its parent stage do unregister, that logic would not cover the scenario of its parent's parent stage is indeterminate and have missing tasks.

It seems we can still put the unregister logic into the block:
https://github.com/apache/spark/pull/25620/files#diff-6a9ff7fb74fd490a50462d45db2d5e11R1626 ?

Actually the place points out by Xingbo is just my first-time attempt :), I found that's not enough to fix this problem during doing the integrate test.
This is because the logic we calculate stagesToRollback in collectStagesToRollback, only care about the downstream stages of the current fetch failed stage. For upstream indeterminate stages, put the unregister logic in failure handling didn't cover. So the correctness bug will still happen.
Also the newly added UT SPARK-25341: retry all the succeeding stages when the map stage is indeterminate also covered this check, if we do the unregister in failure handling, the UT will fail.

I'm a little concerned about putting this here, as you'll see lower down in this method there is some handling for the case that submitMissingTasks is called but there are actually no tasks to run. I'm not seeing how that happens now, but your change would make those cases always re-evaluate all partitions of the stage.

I think @jiangxb1987 suggestion makes sense, couldn't you do it the unregistering there? I agree the logic is currently insufficient as its not building up the full set of stages that need to be recomputed, but maybe we need to combine both.

or maybe we understand the old cases of submitting a stage with no missing partitions and my concern is not relevant?

Thanks for the detailed comment.
I tried to combine both at first but seems kind of duplicated code.
For the concerned about no missing partitions stage submission, I checkedshuffleMapStage.isAvailable for the unregister in aa3a409, what we need here is making sure the partially completed indeterminate stage will be whole stage rerun.

vanzin

I see people already commented on the mapId vs. mapIndex thing, so not gonna touch that.

vanzin · 2019-09-04T18:36:53Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

+      startPartition: Int,
+      endPartition: Int,
+      useOldFetchProtocol: Boolean)
+    : Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])] = {


nit: indent more or less. It should be at a different indent level than the method body.

Thanks, done in 3bfb6e6.
Pretty confuse before about how to address the line starting with :, thanks for your guidance :)

core/src/main/scala/org/apache/spark/internal/config/package.scala

core/src/main/scala/org/apache/spark/scheduler/Stage.scala

vanzin · 2019-09-04T18:52:40Z

core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala

   */
-  private[this] val numMapsForShuffle = new ConcurrentHashMap[Int, Int]()
+  private[this] val taskIdMapsForShuffle = new ConcurrentHashMap[Int, ArrayBuffer[Long]]()


This doesn't seem too worrying at first (even 10MB per stage isn't that much overhead if it's cleaned up eventually). But perhaps using OpenHashSet can help with larger stages (vs. an ArrayBuffer), although it will use more memory for smaller ones.

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala

vanzin · 2019-09-04T19:05:36Z

project/MimaExcludes.scala

+    ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.image.ImageSchema.readImages"),
+
+    // [SPARK-25341][CORE] Support rolling back a shuffle map stage and re-generate the shuffle files
+    ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.shuffle.sort.UnsafeShuffleWriter.this"),


Maybe GenerateMIMAIgnore isn't really ignoring classes annotated with @Private.

vanzin · 2019-09-04T19:07:32Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

@@ -1047,6 +1047,14 @@ package object config {
      .checkValue(v => v > 0, "The value should be a positive integer.")
      .createWithDefault(2000)

+  private[spark] val SHUFFLE_USE_OLD_FETCH_PROTOCOL =


What happens if you connect to an old shuffle service without setting this? Will things just fail (and always fail)?

Probably ok if they do. Although a more user-friendly error, if possible, might be good.

What happens if you connect to an old shuffle service without setting this? Will things just fail (and always fail)?

We'll always fail with the UnsupportedOperationException:Unexpected message: FetchShuffleBlocks. This work is done in #24565.

Although a more user-friendly error, if possible, might be good.

Yeah, in this PR we had detailed log in DAGScheduler.
Let me think how to add a more user-friendly error in the follow-up work for the old shuffle service, currently, we only have some doc in the migration guide. https://github.com/apache/spark/pull/24565/files#diff-3f19ec3d15dcd8cd42bb25dde1c5c1a9R139

xuanyuanking · 2019-09-05T03:15:27Z

Great thanks for all comments, something inserted yesterday. I'll address comments from Xingbo and Vanzin, change all name related by Wenchen and Squito in today(Beijing time).

SparkQA · 2019-09-05T13:13:07Z

Test build #110185 has finished for PR 25620 at commit b527fe7.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ShuffleBlockId(shuffleId: Int, mapTaskId: Long, reduceId: Int) extends BlockId
case class ShuffleDataBlockId(shuffleId: Int, mapTaskId: Long, reduceId: Int) extends BlockId
case class ShuffleIndexBlockId(shuffleId: Int, mapTaskId: Long, reduceId: Int) extends BlockId

SparkQA · 2019-09-05T16:02:29Z

Test build #110186 has finished for PR 25620 at commit 3bfb6e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ShuffleBlockId(shuffleId: Int, mapTaskId: Long, reduceId: Int) extends BlockId
case class ShuffleDataBlockId(shuffleId: Int, mapTaskId: Long, reduceId: Int) extends BlockId
case class ShuffleIndexBlockId(shuffleId: Int, mapTaskId: Long, reduceId: Int) extends BlockId

SparkQA · 2019-09-06T09:48:57Z

Test build #110229 has finished for PR 25620 at commit 8b51720.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java

cloud-fan · 2019-09-17T16:41:33Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

@@ -88,23 +88,23 @@ private class ShuffleStatus(numPartitions: Int) {
   * Register a map output. If there is already a registered location for the map output then it
   * will be replaced by the new location.
   */
-  def addMapOutput(mapId: Int, status: MapStatus): Unit = synchronized {
-    if (mapStatuses(mapId) == null) {
+  def addMapOutput(mapIndex: Int, status: MapStatus): Unit = synchronized {


This is a place that I think mapIndex makes sense.

Thanks for pointing this out.

cloud-fan · 2019-09-17T16:41:44Z

core/src/main/scala/org/apache/spark/MapOutputTracker.scala

  }

  /**
   * Remove the map output which was served by the specified block manager.
   * This is a no-op if there is no registered map output or if the registered output is from a
   * different block manager.
   */
-  def removeMapOutput(mapId: Int, bmAddress: BlockManagerId): Unit = synchronized {
-    if (mapStatuses(mapId) != null && mapStatuses(mapId).location == bmAddress) {
+  def removeMapOutput(mapIndex: Int, bmAddress: BlockManagerId): Unit = synchronized {


cloud-fan · 2019-09-17T17:07:45Z

@xuanyuanking thanks for the renaming work! After taking a quick look, I think we can go further.

It looks to me that we should only use the name mapIndex and mapTaskId when we really mean it. e.g. ShuffleStatus.addMapOutput, MapStatus.mapTaskId, etc. When we refer to an identifier of the map, then we should use mapId.

There are only a few places that we explicitly mean mapIndex and/or mapTaskId, we can keep the name mapId unchange in other places to reduce the diff.

What do you think?

xuanyuanking · 2019-09-18T16:02:20Z

Sure, done this in c86f6cc, keeping mapId works in most cases and we can still get the real meaning by context.

SparkQA · 2019-09-18T16:07:42Z

Test build #110923 has finished for PR 25620 at commit caa949d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan

LGTM except some code style comments

core/src/main/java/org/apache/spark/shuffle/api/ShuffleExecutorComponents.java

core/src/main/java/org/apache/spark/shuffle/sort/UnsafeShuffleWriter.java

core/src/main/scala/org/apache/spark/shuffle/ShuffleManager.scala

core/src/main/scala/org/apache/spark/shuffle/ShuffleWriteProcessor.scala

cloud-fan · 2019-09-18T16:47:55Z

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

   */
-  case class FetchRequest(address: BlockManagerId, blocks: Seq[(BlockId, Long)]) {
+  case class FetchRequest(address: BlockManagerId, blocks: Seq[(BlockId, Long, Int)]) {


Now it's a tuple3 with int and long elements. I think it's better to create a class for it to make the code easier to read.

Thanks, add FetchBlockInfo class for this in d2215b2.

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

SparkQA · 2019-09-18T19:07:14Z

Test build #110924 has finished for PR 25620 at commit c86f6cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-18T20:37:48Z

Test build #110930 has finished for PR 25620 at commit 28c9f9c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-18T21:02:15Z

Test build #110928 has finished for PR 25620 at commit d2215b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FetchRequest(address: BlockManagerId, blocks: Seq[FetchBlockInfo])

cloud-fan · 2019-09-19T01:24:32Z

@squito @vanzin @jiangxb1987 do you have any more comments? This looks good to me now and I'd like to merge it within a few days if none of you object.

cloud-fan · 2019-09-23T08:17:13Z

since there is no objection, I'm merging it, thanks!

xuanyuanking · 2019-09-24T13:03:10Z

Finally! Thank you all for the review.

zsxwing · 2020-02-26T22:30:51Z

...ork-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java

@@ -172,7 +172,7 @@ public ManagedBuffer getBlockData(
      String appId,
      String execId,
      int shuffleId,
-      int mapId,
+      long mapId,


@xuanyuanking why change this from int to long? Is it possible that a mapId can be greater than 2^31?

previous the map id is the index of the mapper, and can get conflicts when we re-run the task. Now the map id is the task id, which is unique. task id needs to be long.

Yes, after this patch, we set mapId by using the taskAttemptId of map task, which is a unique Id within the same SparkContext. You can see the comment #25620 (comment)

xuanyuanking mentioned this pull request Aug 29, 2019

[SPARK-25341][Core] Support rolling back a shuffle map stage and re-generate the shuffle files #24892

Closed

xuanyuanking commented Aug 29, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Show resolved Hide resolved

cloud-fan reviewed Aug 29, 2019

View reviewed changes

dongjoon-hyun added the SPARK CORE label Aug 29, 2019

cloud-fan reviewed Aug 30, 2019

View reviewed changes

core/src/main/java/org/apache/spark/shuffle/api/ShuffleExecutorComponents.java Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 30, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 30, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 30, 2019

View reviewed changes

squito reviewed Aug 30, 2019

View reviewed changes

xuanyuanking force-pushed the SPARK-25341-8.27 branch from 539d725 to 4bd9e00 Compare August 31, 2019 13:24

jiangxb1987 reviewed Sep 4, 2019

View reviewed changes

vanzin reviewed Sep 4, 2019

View reviewed changes

xuanyuanking force-pushed the SPARK-25341-8.27 branch from b527fe7 to 3bfb6e6 Compare September 5, 2019 13:22

cloud-fan reviewed Sep 6, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 6, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 17, 2019

View reviewed changes

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockHandler.java Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 17, 2019

View reviewed changes

further rename

c86f6cc

xuanyuanking force-pushed the SPARK-25341-8.27 branch from caa949d to c86f6cc Compare September 18, 2019 16:05

cloud-fan approved these changes Sep 18, 2019

View reviewed changes

xuanyuanking added 2 commits September 19, 2019 02:05

Keep mapId in shuffle writer and other comments

d2215b2

last comment

28c9f9c

cloud-fan closed this in f725d47 Sep 23, 2019

xuanyuanking mentioned this pull request Sep 24, 2019

[SPARK-25341][Core] Support rolling back a shuffle map stage and re-generate the shuffle files #24110

Closed

xuanyuanking deleted the SPARK-25341-8.27 branch September 24, 2019 13:01

xuanyuanking mentioned this pull request Sep 25, 2019

[SPARK-28625][Core] Indeterminate shuffle support in Shuffle Writer API #25361

Closed

cloud-fan mentioned this pull request Sep 26, 2019

[WIP][SPARK-29257][Core][Shuffle] Use task attempt number as noop reduce id to handle disk failures during shuffle #25941

Closed

zsxwing reviewed Feb 26, 2020

View reviewed changes

abellina mentioned this pull request Jul 6, 2020

[DISCUSS] Shuffle read-side error handling NVIDIA/spark-rapids#326

Closed

xuanyuanking mentioned this pull request Mar 2, 2021

[SPARK-34541][CORE] Fixed an issue where data could not be cleaned up when unregisterShuffle. #31664

Closed

	} else if (mapId != -1) {
	// Mark the map whose fetch failed as broken in the map stage
	mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)
	}

	if (blockIds.length == 0) {
	throw new IllegalArgumentException("Zero-sized blockIds array");
	}

[SPARK-25341][Core] Support rolling back a shuffle map stage and re-generate the shuffle files #25620

[SPARK-25341][Core] Support rolling back a shuffle map stage and re-generate the shuffle files #25620

Conversation

xuanyuanking commented Aug 29, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Aug 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuanyuanking commented Aug 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Aug 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 29, 2019

SparkQA commented Aug 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuanyuanking commented Aug 31, 2019

SparkQA commented Aug 31, 2019

cloud-fan commented Sep 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuanyuanking Sep 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuanyuanking commented Sep 5, 2019

SparkQA commented Sep 5, 2019

SparkQA commented Sep 5, 2019

SparkQA commented Sep 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Sep 17, 2019

xuanyuanking commented Sep 18, 2019 • edited Loading

SparkQA commented Sep 18, 2019

cloud-fan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 18, 2019

SparkQA commented Sep 18, 2019

SparkQA commented Sep 18, 2019

cloud-fan Aug 29, 2019 •

edited

Loading

cloud-fan commented Sep 3, 2019 •

edited

Loading

xuanyuanking Sep 7, 2019 •

edited

Loading

xuanyuanking commented Sep 18, 2019 •

edited

Loading