[SPARK-13343] speculative tasks that didn't commit shouldn't be marked as success #21653

hthuynh2 · 2018-06-28T04:29:58Z

Description
Currently Speculative tasks that didn't commit can show up as success (depending on timing of commit). This is a bit confusing because that task didn't really succeed in the sense it didn't write anything.
I think these tasks should be marked as KILLED or something that is more obvious to the user exactly what happened. it is happened to hit the timing where it got a commit denied exception then it shows up as failed and counts against your task failures. It shouldn't count against task failures since that failure really doesn't matter.
MapReduce handles these situation so perhaps we can look there for a model.

How can this issue happen?
When both attempts of a task finish before the driver sends command to kill one of them, both of them send the status update FINISHED to the driver. The driver calls TaskSchedulerImpl to handle one successful task at a time. When it handles the first successful task, it sends the command to kill the other copy of the task, however, because that task is already finished, the executor will ignore the command. After finishing handling the first attempt, it processes the second one, although all actions on the result of this task are skipped, this copy of the task is still marked as SUCCESS. As a result, even though this issue does not affect the result of the job, it might cause confusing to user because both of them appear to be successful.

How does this PR fix the issue?
The simple way to fix this issue is that when taskSetManager handles successful task, it checks if any other attempt succeeded. If this is the case, it will call handleFailedTask with state==KILLED and reason==TaskKilled(“another attempt succeeded”) to handle this task as begin killed.

How was this patch tested?
I tested this manually by running applications, that caused the issue before, a few times, and observed that the issue does not happen again. Also, I added a unit test in TaskSetManagerSuite to test that if we call handleSuccessfulTask to handle status update for 2 copies of a task, only the one that is handled first will be mark as SUCCESS

hthuynh2 · 2018-06-28T04:31:37Z

cc @tgravescs

tgravescs · 2018-06-28T13:51:26Z

ok to test

SparkQA · 2018-06-28T13:55:10Z

Test build #92425 has finished for PR 21653 at commit 8f7d981.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-28T18:42:50Z

Test build #92426 has finished for PR 21653 at commit 980a933.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-07-02T14:48:21Z

+1, changes look good to me.

@squito see any problems with this approach?

tgravescs · 2018-07-02T14:50:39Z

@hthuynh2 can you fix the description "as success of failures" , this is just a copy of my typo in the jira. Can you just change to be "as success"

hthuynh2 · 2018-07-02T14:53:17Z

I updated it. Thanks.

squito

approach makes sense to me, I have some suggestions for making the test a bit better.

squito · 2018-07-02T16:16:25Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+      "exec1" -> "host1",
+      "exec1" -> "host1",
+      "exec2" -> "host2",
+      "exec2" -> "host2")) {


nit: double indent the contents of the List

squito · 2018-07-02T16:18:35Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+    manager.handleSuccessfulTask(3, createTaskResult(3, accumUpdatesByTask(3)))
+    // Verify that it kills other running attempt
+    verify(sched.backend).killTask(4, "exec1", true, "another attempt succeeded")
+    // Complete another attempt for the running task


can you expand this comment to explain why you're doing this? without looking at the bug, it's easy to think this part is wrong, but in fact its the most important part of your test. eg:

There is a race between the scheduler asking to kill the other task, and that task actually finishing. We simulate what happens if the other task finishes before we kill it.

squito · 2018-07-02T16:19:18Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+    manager.handleSuccessfulTask(4, createTaskResult(3, accumUpdatesByTask(3)))
+
+    assert(manager.taskInfos(3).successful == true)
+    assert(manager.taskInfos(4).killed == true)


it seems the main thing you're trying to change here is what gets passed to DAGScheduler.taskEnded, so shouldn't you be verifying that here?

jiangxb1987 · 2018-07-03T15:22:33Z

IIUC this speculative task is not really killed right ? It is actually ignored. Does that worth it to add a new TaskState for this case ?

hthuynh2 · 2018-07-03T15:30:40Z

@jiangxb1987 yes, you are correct that it is actually ignored. I think it doesn't worth to add a new TaskState because we might need to add changes in many places but does not get much benefit from it. Instead, I think we can add some message to the kill reason to differentiate it from task that is actually killed and to inform the user.

…SPARK_13343

SparkQA · 2018-07-06T14:55:20Z

Test build #92688 has finished for PR 21653 at commit f66fcab.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

hthuynh2 · 2018-07-06T14:55:49Z

@squito Thanks for the suggestions. I updated it. Could you please have a look at it to see if there is anything else I need to change? Thanks.

SparkQA · 2018-07-06T19:28:38Z

Test build #92690 has finished for PR 21653 at commit 3c655f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2018-07-07T07:13:24Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -723,6 +723,13 @@ private[spark] class TaskSetManager(
  def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = {
    val info = taskInfos(tid)
    val index = info.index
+    // Check if any other attempt succeeded before this and this attempt has not been handled
+    if (successful(index) && killedByOtherAttempt(index)) {


For completeness, we will also need to 'undo' the changes in enqueueSuccessfulTask : to reverse the counters in canFetchMoreResults.

(Orthogonal to this PR): Looking at use of killedByOtherAttempt, I see that there is a bug in executorLost w.r.t how it is updated - hopefully a fix for SPARK-24755 wont cause issues here.

tgravescs · 2018-07-10T13:55:44Z

lets review #21729 before this since its changing the type on killedByOtherAttempt

tgravescs · 2018-07-19T14:56:49Z

#21729 has been merged @hthuynh2 can you update this one?

…3343 resolve conflict with SPARK-24755

hthuynh2 · 2018-07-20T00:34:58Z

@tgravescs I updated it. Can you please have a look at it when you have time. Thank you.

SparkQA · 2018-07-20T03:57:16Z

Test build #93302 has finished for PR 21653 at commit 2c7d33d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-07-20T15:54:23Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -723,6 +723,21 @@ private[spark] class TaskSetManager(
  def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = {
    val info = taskInfos(tid)
    val index = info.index
+    // Check if any other attempt succeeded before this and this attempt has not been handled
+    if (successful(index) && killedByOtherAttempt.contains(tid)) {
+      calculatedTasks -= 1


please add a comment here about cleaning up things from incremented earlier while handling it as successful

tgravescs · 2018-07-20T21:49:56Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+      val resultSizeAcc = result.accumUpdates.find(a =>
+        a.name == Some(InternalAccumulator.RESULT_SIZE))
+      if (resultSizeAcc.isDefined) {
+        totalResultSize -= resultSizeAcc.get.asInstanceOf[LongAccumulator].value


the downside here is we already incremented and other tasks could have checked and failed before we decrement, but unless someone else has a better idea this is better then it is now.

I agree, I dont see a better option.

mridulm

LGTM, pending @tgravescs's suggestions.

mridulm · 2018-07-21T02:36:38Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

+      val resultSizeAcc = result.accumUpdates.find(a =>
+        a.name == Some(InternalAccumulator.RESULT_SIZE))
+      if (resultSizeAcc.isDefined) {
+        totalResultSize -= resultSizeAcc.get.asInstanceOf[LongAccumulator].value


I agree, I dont see a better option.

jiangxb1987

LGTM

SparkQA · 2018-07-23T04:27:56Z

Test build #93418 has finished for PR 21653 at commit b6585da.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hthuynh2 · 2018-07-23T04:33:00Z

@tgravescs Can you please run the test again, thank you.

tgravescs · 2018-07-23T14:04:36Z

test this please

SparkQA · 2018-07-23T18:01:13Z

Test build #93447 has finished for PR 21653 at commit b6585da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-07-24T04:01:56Z

test this please

SparkQA · 2018-07-24T07:05:01Z

Test build #93474 has finished for PR 21653 at commit b6585da.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-07-24T14:55:09Z

test this please

tgravescs · 2018-07-24T18:42:56Z

test this please

tgravescs · 2018-07-26T13:34:12Z

test this please

tgravescs · 2018-07-26T15:06:48Z

test this please

squito · 2018-07-26T15:21:01Z

I kicked off the test manually at https://spark-prs.appspot.com/users/hthuynh2. I dunno why the test triggering via comments stops workign on some prs

SparkQA · 2018-07-26T20:04:41Z

Test build #4222 has finished for PR 21653 at commit b6585da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-07-26T21:24:40Z

+1

tgravescs · 2018-07-27T17:35:02Z

merged to master, thanks @hthuynh2

Fixed issue and added unit test

8f7d981

Fixed Scala Style Error

980a933

squito reviewed Jul 2, 2018

View reviewed changes

Hieu Huynh added 2 commits July 6, 2018 09:47

Modified Unit Test and Kill Message

7db26f2

Merge branch 'SPARK_13343' of https://github.com/hthuynh2/spark into …

f66fcab

…SPARK_13343

Fix scala Style

3c655f2

mridulm reviewed Jul 7, 2018

View reviewed changes

Hieu Huynh added 2 commits July 19, 2018 16:34

Merge branch 'master' of https://github.com/apache/spark into SPARK_1…

7d9e4bb

…3343 resolve conflict with SPARK-24755

undo effect on totalResultSize and calculatedTasks

2c7d33d

tgravescs requested changes Jul 20, 2018

View reviewed changes

mridulm reviewed Jul 21, 2018

View reviewed changes

jiangxb1987 approved these changes Jul 21, 2018

View reviewed changes

add more comments

b6585da

asfgit closed this in 5828f41 Jul 27, 2018

[SPARK-13343] speculative tasks that didn't commit shouldn't be marked as success #21653

[SPARK-13343] speculative tasks that didn't commit shouldn't be marked as success #21653

Conversation

hthuynh2 commented Jun 28, 2018 • edited Loading

hthuynh2 commented Jun 28, 2018

tgravescs commented Jun 28, 2018

SparkQA commented Jun 28, 2018

SparkQA commented Jun 28, 2018

tgravescs commented Jul 2, 2018

tgravescs commented Jul 2, 2018

hthuynh2 commented Jul 2, 2018

squito left a comment

Choose a reason for hiding this comment

squito Jul 2, 2018

Choose a reason for hiding this comment

squito Jul 2, 2018

Choose a reason for hiding this comment

squito Jul 2, 2018

Choose a reason for hiding this comment

jiangxb1987 commented Jul 3, 2018

hthuynh2 commented Jul 3, 2018 • edited Loading

SparkQA commented Jul 6, 2018

hthuynh2 commented Jul 6, 2018

SparkQA commented Jul 6, 2018

mridulm Jul 7, 2018

Choose a reason for hiding this comment

tgravescs commented Jul 10, 2018

tgravescs commented Jul 19, 2018

hthuynh2 commented Jul 20, 2018

SparkQA commented Jul 20, 2018

tgravescs Jul 20, 2018 • edited Loading

Choose a reason for hiding this comment

tgravescs Jul 20, 2018

Choose a reason for hiding this comment

mridulm Jul 21, 2018

Choose a reason for hiding this comment

mridulm left a comment

Choose a reason for hiding this comment

mridulm Jul 21, 2018

Choose a reason for hiding this comment

jiangxb1987 left a comment

Choose a reason for hiding this comment

SparkQA commented Jul 23, 2018

hthuynh2 commented Jul 23, 2018

tgravescs commented Jul 23, 2018

SparkQA commented Jul 23, 2018

HyukjinKwon commented Jul 24, 2018

SparkQA commented Jul 24, 2018

tgravescs commented Jul 24, 2018

tgravescs commented Jul 24, 2018

tgravescs commented Jul 26, 2018

tgravescs commented Jul 26, 2018

squito commented Jul 26, 2018

SparkQA commented Jul 26, 2018

tgravescs commented Jul 26, 2018

tgravescs commented Jul 27, 2018

hthuynh2 commented Jun 28, 2018 •

edited

Loading

hthuynh2 commented Jul 3, 2018 •

edited

Loading

tgravescs Jul 20, 2018 •

edited

Loading