[SPARK-28699][Core] Fix a corner case for aborting indeterminate stage #25498

xuanyuanking · 2019-08-19T14:11:05Z

What changes were proposed in this pull request?

When collecting the indeterminate stages for handling FetchFailed, we should look at stages from mapStage, instead of failedStage.

Why are the changes needed?

In the fetch failed error handle logic, the original logic of collecting indeterminate stage from the fetch failed stage. And in the scenario of the fetch failed happened in the first task of this stage, this logic will cause the indeterminate stage to resubmit partially. Eventually, we are capable of getting correctness bug.

Does this PR introduce any user-facing change?

It makes the corner case of indeterminate stage abort as expected.

How was this patch tested?

New UT in DAGSchedulerSuite.
Run below integrated test with local-cluster[5, 2, 5120], and set spark.sql.execution.sortBeforeRepartition=false, it will abort the indeterminate stage as expected:

import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 10000 * 10000, 1).map{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).map{ x => (x._1 + 1, x._2)}.repartition(239).map { x =>
  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber == 0) {
    throw new Exception("pkill -f -n java".!!)
  }
  x
}
val r2 = df.distinct.count()

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

cloud-fan · 2019-08-19T15:19:58Z

good catch! LGTM

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

SparkQA · 2019-08-19T16:37:15Z

Test build #109349 has finished for PR 25498 at commit 08bba60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

viirya · 2019-08-19T18:53:59Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

@@ -2741,27 +2741,8 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
      FetchFailed(makeBlockManagerId("hostC"), shuffleId2, 0, 0, "ignored"),
      null))

-    val failedStages = scheduler.failedStages.toSeq
-    assert(failedStages.length == 2)


not big deal, but I think this assert still applies?

After this change, failedStages.length == 0 because we do the cleanup work in failJobAndIndependentStages by cleanupStateForJobAndIndependentStages.

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

SparkQA · 2019-08-20T05:29:49Z

Test build #109378 has finished for PR 25498 at commit 485157a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-20T05:54:05Z

thanks, merging to master!

@xuanyuanking can you send PRs for branch 2.3 and 2.4? the code conflicts.

xuanyuanking · 2019-08-20T06:56:16Z

Sure, I'm doing the backport now.

Change the logic of collecting the indeterminate stage, we should look at stages from mapStage, not failedStage during handle FetchFailed. In the fetch failed error handle logic, the original logic of collecting indeterminate stage from the fetch failed stage. And in the scenario of the fetch failed happened in the first task of this stage, this logic will cause the indeterminate stage to resubmit partially. Eventually, we are capable of getting correctness bug. It makes the corner case of indeterminate stage abort as expected. New UT in DAGSchedulerSuite. Run below integrated test with `local-cluster[5, 2, 5120]`, and set `spark.sql.execution.sortBeforeRepartition`=false, it will abort the indeterminate stage as expected: ``` import scala.sys.process._ import org.apache.spark.TaskContext val res = spark.range(0, 10000 * 10000, 1).map{ x => (x % 1000, x)} // kill an executor in the stage that performs repartition(239) val df = res.repartition(113).map{ x => (x._1 + 1, x._2)}.repartition(239).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber == 0) { throw new Exception("pkill -f -n java".!!) } x } val r2 = df.distinct.count() ``` Closes apache#25498 from xuanyuanking/SPARK-28699-followup. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 0d3a783) Signed-off-by: Yuanjian Li <[email protected]>

dongjoon-hyun · 2019-08-20T15:55:38Z

Thank you, @xuanyuanking , @cloud-fan and @viirya !
I'm waiting for backporting PRs.

Also, cc @kiszk for branch-2.3.

fix corner case for aborting indeterminate stage

08bba60

cloud-fan reviewed Aug 19, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala Outdated Show resolved Hide resolved

xuanyuanking mentioned this pull request Aug 19, 2019

[SPARK-28699][Core] Cache an indeterminate RDD could lead to incorrect result while stage rerun #25420

Closed

viirya reviewed Aug 19, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Show resolved Hide resolved

viirya reviewed Aug 19, 2019

View reviewed changes

viirya approved these changes Aug 19, 2019

View reviewed changes

viirya reviewed Aug 19, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala Show resolved Hide resolved

dongjoon-hyun added the SPARK CORE label Aug 19, 2019

add comments

485157a

cloud-fan mentioned this pull request Aug 20, 2019

[SPARK-28699][SQL] Disable using radix sort for ShuffleExchangeExec in repartition case #25491

Closed

cloud-fan closed this in 0d3a783 Aug 20, 2019

xuanyuanking deleted the SPARK-28699-followup branch August 20, 2019 07:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28699][Core] Fix a corner case for aborting indeterminate stage #25498

[SPARK-28699][Core] Fix a corner case for aborting indeterminate stage #25498

xuanyuanking commented Aug 19, 2019 •

edited by gatorsmile

Loading

cloud-fan commented Aug 19, 2019

SparkQA commented Aug 19, 2019

viirya Aug 19, 2019

xuanyuanking Aug 20, 2019

SparkQA commented Aug 20, 2019

cloud-fan commented Aug 20, 2019

xuanyuanking commented Aug 20, 2019

dongjoon-hyun commented Aug 20, 2019

[SPARK-28699][Core] Fix a corner case for aborting indeterminate stage #25498

[SPARK-28699][Core] Fix a corner case for aborting indeterminate stage #25498

Conversation

xuanyuanking commented Aug 19, 2019 • edited by gatorsmile Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Aug 19, 2019

SparkQA commented Aug 19, 2019

viirya Aug 19, 2019

Choose a reason for hiding this comment

xuanyuanking Aug 20, 2019

Choose a reason for hiding this comment

SparkQA commented Aug 20, 2019

cloud-fan commented Aug 20, 2019

xuanyuanking commented Aug 20, 2019

dongjoon-hyun commented Aug 20, 2019

xuanyuanking commented Aug 19, 2019 •

edited by gatorsmile

Loading