[SPARK-24755][Core] Executor loss can cause task to not be resubmitted #21729

hthuynh2 · 2018-07-08T18:43:50Z

Description
As described in SPARK-24755, when speculation is enabled, there is scenario that executor loss can cause task to not be resubmitted.
This patch changes the variable killedByOtherAttempt to keeps track of the taskId of tasks that are killed by other attempt. By doing this, we can still prevent resubmitting task killed by other attempt while resubmit successful attempt when executor lost.

How was this patch tested?
A UT is added based on the UT written by @xuanyuanking with modification to simulate the scenario described in SPARK-24755.

hthuynh2 · 2018-07-08T18:45:27Z

cc @mridulm @xuanyuanking

xuanyuanking · 2018-07-09T12:50:14Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -87,7 +87,7 @@ private[spark] class TaskSetManager(
  // Set the coresponding index of Boolean var when the task killed by other attempt tasks,


typo I made before, coresponding -> corresponding.

xuanyuanking · 2018-07-09T12:53:09Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+      ("exec2", "host2"), ("exec3", "host3"))
+    sched.initialize(new FakeSchedulerBackend() {
+      override def killTask(taskId: Long,
+                            executorId: String,


nit: indent

xuanyuanking · 2018-07-09T12:53:35Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+    var resubmittedTasks = new mutable.HashSet[Int]
+    val dagScheduler = new FakeDAGScheduler(sc, sched) {
+      override def taskEnded(task: Task[_],
+                             reason: TaskEndReason,


xuanyuanking

Just some nits in code changes. The added UT copy a lot of code with SPARK-22074, is there any better way to reuse them or combine the two test together?

xuanyuanking · 2018-07-09T15:12:39Z

Please change the title to '[SPARK-24755][Core] Executor loss can cause task to not be resubmitted'

hthuynh2 · 2018-07-09T16:07:17Z

@xuanyuanking Thanks for the comments. I also thought about modifying the UT of SPARK-22074 instead of adding new UT but I was afraid it might cause confusing since they are 2 different issues although they are very close. If you feel it is better to combine them, I can change it. Thanks.

tgravescs · 2018-07-09T22:08:24Z

ok to test

SparkQA · 2018-07-10T02:19:01Z

Test build #92775 has finished for PR 21729 at commit 093e39c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-07-10T13:47:27Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -87,7 +87,7 @@ private[spark] class TaskSetManager(
  // Set the coresponding index of Boolean var when the task killed by other attempt tasks,


comment needs to be changed since no longer array with boolean

I'll update it. Thanks.

Please update comment based on hashset being ok now

tgravescs · 2018-07-10T13:55:04Z

cc @mridulm

jiangxb1987 · 2018-07-10T14:37:18Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

@@ -87,7 +87,7 @@ private[spark] class TaskSetManager(
  // Set the coresponding index of Boolean var when the task killed by other attempt tasks,
  // this happened while we set the `spark.speculation` to true. The task killed by others
  // should not resubmit while executor lost.
-  private val killedByOtherAttempt: Array[Boolean] = new Array[Boolean](numTasks)
+  private val killedByOtherAttempt = new HashSet[Long]


super nit: I prefer an Array[Long], so you know the index corresponding to the taskId, that can provide more information while debug.

Hi @jiangxb1987, thanks for the comment, but I'm not sure if I understand your suggestion correctly. Do you mean: private val killedByOtherAttempt = new Array[Long] ?

Also, the comment "Set the corresponding index of Boolean var when the task killed ..." is not correct anymore. I'm sorry I forgot to update it.

Yea, please also update the comment.

I think we should use ArrayBuffer[Long] instead of Array[Long] because the number of elements can grow when there are more killed tasks.
Also, I think there is a downside of using Array-like data structure for this variable. Lookup operation for array-like data structure takes linear time and that operation is used many times when we check if a task need to be resubmitted (inside executorLost method of TSM). This will not matter much if the size of the array is small, but still I think this is something we might want to consider.

@mridulm 's approach also sounds good to me.

@jiangxb1987 please clarify is it fine as is or are you wanting to use a hashMap and track the index? Can you give an example when this is used for debugging? For instance are you getting a heap dump and looking at the datastructures that might make sense, otherwise its not accessible without you adding in further log statements anyway and its just extra memory usage.

For instance when you have corrupted shuffle data you may want to ensure it's not caused by killing tasks, and that requires track all killed taskIds corresponding to a partition. With a hashMap as @mridulm proposed it shall be easy to add extra log to debug. But actually I just looked at the code again and found that expanding the logInfo in L735 can also resolve my case. So it seems fine to use hashSet to save some memory.

I'm not proposing to expand the logInfo in L735 in this PR, I'm just concern about whether it's convenient enough for me to add extra logs to debug a potential issue. Since there is another way to achieve the same effect, I'm okay with using hashSet here.

Great thanks, seems like we can go with code as is then.

tgravescs · 2018-07-16T16:05:24Z

@hthuynh2 please update based on the comments above. You can leave the type as Hashset and fix the other typos, identations, and comments.

SparkQA · 2018-07-16T18:04:55Z

Test build #93122 has finished for PR 21729 at commit b2affd2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-16T20:53:13Z

Test build #93121 has finished for PR 21729 at commit 9f0e0ae.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-16T22:27:27Z

Test build #93127 has finished for PR 21729 at commit a67bebc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-07-17T14:02:06Z

test this please

SparkQA · 2018-07-17T18:28:23Z

Test build #93176 has finished for PR 21729 at commit a67bebc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2018-07-18T09:57:41Z

Looks good to me, thanks for fixing this @hthuynh2 !

squito · 2018-07-18T18:14:29Z

lgtm

SparkQA · 2018-07-19T01:17:41Z

Test build #93248 has finished for PR 21729 at commit f9ed226.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-19T01:23:01Z

Test build #93249 has finished for PR 21729 at commit 6316e5b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-07-19T14:50:37Z

+1 I'm going to merge, thanks @hthuynh2

**Description** As described in [SPARK-24755](https://issues.apache.org/jira/browse/SPARK-24755), when speculation is enabled, there is scenario that executor loss can cause task to not be resubmitted. This patch changes the variable killedByOtherAttempt to keeps track of the taskId of tasks that are killed by other attempt. By doing this, we can still prevent resubmitting task killed by other attempt while resubmit successful attempt when executor lost. **How was this patch tested?** A UT is added based on the UT written by xuanyuanking with modification to simulate the scenario described in SPARK-24755. Author: Hieu Huynh <“[email protected]”> Closes #21729 from hthuynh2/SPARK_24755. (cherry picked from commit 8d707b0) Signed-off-by: Thomas Graves <[email protected]>

gatorsmile · 2018-07-25T05:11:00Z

For other reviewers, this is merged to master/2.3

**Description** As described in [SPARK-24755](https://issues.apache.org/jira/browse/SPARK-24755), when speculation is enabled, there is scenario that executor loss can cause task to not be resubmitted. This patch changes the variable killedByOtherAttempt to keeps track of the taskId of tasks that are killed by other attempt. By doing this, we can still prevent resubmitting task killed by other attempt while resubmit successful attempt when executor lost. **How was this patch tested?** A UT is added based on the UT written by xuanyuanking with modification to simulate the scenario described in SPARK-24755. Author: Hieu Huynh <“[email protected]”> Closes apache#21729 from hthuynh2/SPARK_24755. (cherry-picked from commit 8d707b0) Ref: LIHADOOP-40171 RB=1414249 BUG=LIHADOOP-40171 R=fli,mshen,yezhou A=yezhou

SPARK-24755 Executor loss can cause task to not be resubmitted

093e39c

xuanyuanking reviewed Jul 9, 2018

View reviewed changes

hthuynh2 changed the title ~~SPARK-24755 Executor loss can cause task to not be resubmitted~~ [SPARK-24755][Core] Executor loss can cause task to not be resubmitted Jul 9, 2018

tgravescs reviewed Jul 10, 2018

View reviewed changes

tgravescs mentioned this pull request Jul 10, 2018

[SPARK-13343] speculative tasks that didn't commit shouldn't be marked as success #21653

Closed

jiangxb1987 reviewed Jul 10, 2018

View reviewed changes

Hieu Huynh added 3 commits July 16, 2018 12:27

Fix indentation and comments

9f0e0ae

Remove extra empty line

b2affd2

Fix another indentation

20a032c

fix scala style error

a67bebc

Hieu Huynh added 2 commits July 18, 2018 15:41

resolve conflict

f9ed226

fix indentation

6316e5b

asfgit closed this in 8d707b0 Jul 19, 2018

		@@ -87,7 +87,7 @@ private[spark] class TaskSetManager(
		// Set the coresponding index of Boolean var when the task killed by other attempt tasks,

[SPARK-24755][Core] Executor loss can cause task to not be resubmitted #21729

[SPARK-24755][Core] Executor loss can cause task to not be resubmitted #21729

Conversation

hthuynh2 commented Jul 8, 2018

hthuynh2 commented Jul 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuanyuanking left a comment

Choose a reason for hiding this comment

xuanyuanking commented Jul 9, 2018

hthuynh2 commented Jul 9, 2018

tgravescs commented Jul 9, 2018

SparkQA commented Jul 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Jul 10, 2018

jiangxb1987 Jul 10, 2018 • edited Loading

Choose a reason for hiding this comment

hthuynh2 Jul 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hthuynh2 Jul 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Jul 16, 2018

SparkQA commented Jul 16, 2018

SparkQA commented Jul 16, 2018

SparkQA commented Jul 16, 2018

tgravescs commented Jul 17, 2018

SparkQA commented Jul 17, 2018

mridulm commented Jul 18, 2018

squito commented Jul 18, 2018

SparkQA commented Jul 19, 2018

SparkQA commented Jul 19, 2018

tgravescs commented Jul 19, 2018

gatorsmile commented Jul 25, 2018

jiangxb1987 Jul 10, 2018 •

edited

Loading

hthuynh2 Jul 10, 2018 •

edited

Loading

hthuynh2 Jul 10, 2018 •

edited

Loading