[SPARK-1749] Job cancellation when SchedulerBackend does not implement killTask #686

markhamstra · 2014-05-07T23:02:35Z

It turns out that having the DAGScheduler tell the taskScheduler to cancelTasks when the backend does not implement killTask (e.g. Mesos) is not such a good idea.

AmplabJenkins · 2014-05-07T23:02:57Z

Merged build triggered.

markhamstra · 2014-05-07T23:02:59Z

@CodingCat @kayousterhout @rxin

AmplabJenkins · 2014-05-07T23:03:06Z

Merged build started.

AmplabJenkins · 2014-05-07T23:04:34Z

Merged build finished.

AmplabJenkins · 2014-05-07T23:04:34Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14790/

kayousterhout · 2014-05-07T23:09:21Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+            try { // cancelTasks will fail if a SchedulerBackend does not implement killTask
+              taskScheduler.cancelTasks(stageId, shouldInterruptThread)
+            } catch {
+              case e: UnsupportedOperationException => logInfo(s"Could not cancel tasks for stage $stageId", e)


Can you change this message to explain why: that the scheduler used (and then print what scheduler is being used?) doesn't support cancellation?

AmplabJenkins · 2014-05-07T23:12:58Z

Merged build triggered.

AmplabJenkins · 2014-05-07T23:13:06Z

Merged build started.

kayousterhout · 2014-05-07T23:15:58Z

So it looks like this commit fixes the crash issue (would be nice to note this in the test -- that the point of the test is to make sure that Spark doesn't crash). But it would also be good to get this message back to the user -- since right now if the user tries to cancel her job, Spark will appear to have successfully cancelled the job when, in fact, it has not been cancelled. How hard is it to do this?

markhamstra · 2014-05-07T23:20:28Z

The INFO log should include the information that tasks were not cancelled. Where/how else do you want to see notification of those facts? Is adding more Listener events something we want to contemplate still in 1.0.0, or should something like that go into 1.1?

CodingCat · 2014-05-07T23:27:08Z

what will happen to the to-be-cancelled tasks in Mesos when the user wants to cancel them? still running there? it seems that #686 (diff) is marking the stage has been cancelled and actually the tasks are still there?

markhamstra · 2014-05-07T23:37:16Z

If interruptThread is not true, then we are going to leave tasks running on the cluster after cancellation with other backends as well. This is definitely an issue begging for further work, but I don't think we can go any further than #498 right now.

AmplabJenkins · 2014-05-07T23:53:04Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-07T23:53:04Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14791/

AmplabJenkins · 2014-05-12T18:17:58Z

Merged build triggered.

AmplabJenkins · 2014-05-12T18:18:05Z

Merged build started.

AmplabJenkins · 2014-05-12T18:55:55Z

Merged build finished.

AmplabJenkins · 2014-05-12T18:55:55Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14899/

AmplabJenkins · 2014-05-27T22:22:58Z

Merged build triggered.

AmplabJenkins · 2014-05-27T22:23:08Z

Merged build started.

AmplabJenkins · 2014-05-27T23:03:51Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-05-27T23:03:52Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15234/

markhamstra · 2014-06-04T16:28:16Z

@rxin merge to 1.0.1 and 1.1.0?

markhamstra · 2014-06-20T14:35:42Z

ping: This should go into 1.0.1
@pwendell

kayousterhout · 2014-06-20T17:47:49Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+      override def stop() = {}
+      override def submitTasks(taskSet: TaskSet) = {
+        // normally done by TaskSetManager
+        taskSet.tasks.foreach(_.epoch = mapOutputTracker.getEpoch)


are these lines necessary (can you just do nothing here?)

Sure, doing nothing is easy.

On Fri, Jun 20, 2014 at 10:47 AM, Kay Ousterhout [email protected]
wrote:

In core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:

@@ -313,6 +314,47 @@ class DAGSchedulerSuite extends TestKit(ActorSystem("DAGSchedulerSuite")) with F
assertDataStructuresEmpty
}

test("job cancellation no-kill backend") {

// make sure that the DAGScheduler doesn't crash when the TaskScheduler

// doesn't implement killTask()

val noKillTaskScheduler = new TaskScheduler() {

override def rootPool: Pool = null

override def schedulingMode: SchedulingMode = SchedulingMode.NONE

override def start() = {}

override def stop() = {}

override def submitTasks(taskSet: TaskSet) = {

// normally done by TaskSetManager

taskSet.tasks.foreach(_.epoch = mapOutputTracker.getEpoch)

are these lines necessary (can you just do nothing here?)

—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/686/files#r14031997.

kayousterhout · 2014-06-20T17:48:47Z

Just a small comment on the tests but other than that this looks good

cancel a job on a SchedulerBackend that does not implement killTask

AmplabJenkins · 2014-06-20T18:10:00Z

Merged build triggered.

AmplabJenkins · 2014-06-20T18:10:09Z

Merged build started.

AmplabJenkins · 2014-06-20T18:49:51Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-20T18:49:51Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15960/

kayousterhout · 2014-06-20T19:20:09Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

+            } catch {
+              case e: UnsupportedOperationException =>
+                logInfo(s"Could not cancel tasks for stage $stageId", e)
+            }


Sorry Mark one more question here -- can we move the job.listener.jobFailed(error) call from line 1041 to here in the "try" clause? It seems weird to tell the user the job has been cancelled when, in fact, it hasn't.

Hmmm... not sure that I agree. A job being cancelled, stages being cancelled, and tasks being cancelled are all different things. The expectation is that job cancellation will lead to cancellation of independent stages and their associated tasks; but if no stages and tasks get cancelled, it's probably still worthwhile for the information to be sent that the job itself was cancelled. I expect that eventually all of the backends will support task killing, so this whole no-kill path should never be hit. But moving the job cancellation notification within the try-to-cancelTasks block will result in multiple notifications that the parent job was cancelled -- one for each independent stage cancellation. Or am I misunderstanding what you are suggesting?

Ah you're right that it doesn't make sense to add that here because it will be called for each stage. My intention was that if the job has running stages that don't get cancelled (because the task scheduler doesn't implement cancelTasks()), then we should not call job.listener.jobFailed() -- do you think that makes sense? Seems like the way to implement that would be to set a boolean flag here if the job can't be successfully cancelled, and then call jobFailed() 0 or 1 times at the end of this function depending on that flag.

What you suggest could be done, but there's a question of whether or not notification of cancellation of the job should be made regardless of whether any stages and task are successfully cancelled as a consequence. I don't really know how to answer that because I don't know how all of the listeners are using the notification or whether they are all expecting the same semantics.

@pwendell what do you think here?

Do we have the meaning of all the listener events fully documented someplace? Or perhaps that needs to be done in a separate PR and then DAGScheduler updated to match the documented expectation?

I think the correct semantics here are that we only trigger jobFailed if the job was succesfully halted. This is mostly consumed downstream by tests, but it's also used by code supporting asynchronous actions and approximate results. In those cases, it would be better not to notify jobFailed if the cancellation doesn't succeed, because they both assume that the hob has finished executing if that message is received.

Separately we should probably update the documentation in cancel to explain that it is a "best effort" method and will only be called if supported by the underlying scheduler. Otherwise, we should say it will act as a no-op, i.e. it will act is if cancel was never called. With this approach downstream consumers will only have two cases to worry about (a job was cancelled or it wasn't) rather than a third case, where we say it was cancelled but it secretely actually wasn't.

Ok, either tonight or tomorrow I can update this PR to reflect that strategy, or you can go ahead and make the change @pwendell. Outside the immediate scope of this PR, what prevents Mesos from being able to kill tasks, and when do we expect that to change?

I'll just take care of it... thanks mark!

Task killing is not supported in the fine-grained mode on mesos because, in that mode, we use Mesos's built in support for all of the control plane messages relating to tasks. So we'll have to figure out how to support killing tasks in that model. There are two questions, one is who actually sends the "kill" message to the executor and the other is how we tell Mesos that the cores are freed which were in use by the task. In the course of normal operation that's handled by using the Mesos launchTask and sendStatusUpdate interfaces.

pwendell · 2014-06-25T20:54:33Z

@markhamstra I'd like to pull this into a 1.0.1 RC that's going out today. This looks great, I just had a small comment. Are you around to address it? If not I can just make the change on merge.

@markhamstra

…t killTask This is a fixed up version of #686 (cc @markhamstra @pwendell). The last commit (the only one I authored) reflects the changes I made from Mark's original patch. Author: Mark Hamstra <[email protected]> Author: Kay Ousterhout <[email protected]> Closes #1219 from kayousterhout/mark-SPARK-1749 and squashes the following commits: 42dfa7e [Kay Ousterhout] Got rid of terrible double-negative name 80b3205 [Kay Ousterhout] Don't notify listeners of job failure if it wasn't successfully cancelled. d156d33 [Mark Hamstra] Do nothing in no-kill submitTasks 9312baa [Mark Hamstra] code review update cc353c8 [Mark Hamstra] scalastyle e61f7f8 [Mark Hamstra] Catch UnsupportedOperationException when DAGScheduler tries to cancel a job on a SchedulerBackend that does not implement killTask

@markhamstra

…t killTask This is a fixed up version of #686 (cc @markhamstra @pwendell). The last commit (the only one I authored) reflects the changes I made from Mark's original patch. Author: Mark Hamstra <[email protected]> Author: Kay Ousterhout <[email protected]> Closes #1219 from kayousterhout/mark-SPARK-1749 and squashes the following commits: 42dfa7e [Kay Ousterhout] Got rid of terrible double-negative name 80b3205 [Kay Ousterhout] Don't notify listeners of job failure if it wasn't successfully cancelled. d156d33 [Mark Hamstra] Do nothing in no-kill submitTasks 9312baa [Mark Hamstra] code review update cc353c8 [Mark Hamstra] scalastyle e61f7f8 [Mark Hamstra] Catch UnsupportedOperationException when DAGScheduler tries to cancel a job on a SchedulerBackend that does not implement killTask (cherry picked from commit b88a59a) Signed-off-by: Patrick Wendell <[email protected]>

pwendell · 2014-07-04T05:57:21Z

@markhamstra mind closing this? It got merged through Kay's PR/

@markhamstra

…t killTask This is a fixed up version of apache#686 (cc @markhamstra @pwendell). The last commit (the only one I authored) reflects the changes I made from Mark's original patch. Author: Mark Hamstra <[email protected]> Author: Kay Ousterhout <[email protected]> Closes apache#1219 from kayousterhout/mark-SPARK-1749 and squashes the following commits: 42dfa7e [Kay Ousterhout] Got rid of terrible double-negative name 80b3205 [Kay Ousterhout] Don't notify listeners of job failure if it wasn't successfully cancelled. d156d33 [Mark Hamstra] Do nothing in no-kill submitTasks 9312baa [Mark Hamstra] code review update cc353c8 [Mark Hamstra] scalastyle e61f7f8 [Mark Hamstra] Catch UnsupportedOperationException when DAGScheduler tries to cancel a job on a SchedulerBackend that does not implement killTask

…terations are large (apache#686) * Normalize vectors * Update test file too.

kayousterhout reviewed May 7, 2014
View reviewed changes

kayousterhout reviewed Jun 20, 2014
View reviewed changes

markhamstra added 2 commits June 20, 2014 11:03

Catch UnsupportedOperationException when DAGScheduler tries to

e61f7f8

cancel a job on a SchedulerBackend that does not implement killTask

scalastyle

cc353c8

markhamstra added 2 commits June 20, 2014 11:03

code review update

9312baa

Do nothing in no-kill submitTasks

d156d33

kayousterhout reviewed Jun 20, 2014
View reviewed changes

kayousterhout mentioned this pull request Jun 25, 2014

[SPARK-1749] Job cancellation when SchedulerBackend does not implement killTask #1219

Closed

markhamstra closed this Jul 4, 2014

helenyugithub pushed a commit to helenyugithub/spark that referenced this pull request Jul 13, 2020

[SPARK-24666][ML] Fix infinity vectors produced by Word2Vec when numI…

3760767

…terations are large (apache#686) * Normalize vectors * Update test file too.

[SPARK-1749] Job cancellation when SchedulerBackend does not implement killTask #686

[SPARK-1749] Job cancellation when SchedulerBackend does not implement killTask #686

Conversation

markhamstra commented May 7, 2014

AmplabJenkins commented May 7, 2014

markhamstra commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014

Choose a reason for hiding this comment

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014

kayousterhout commented May 7, 2014

markhamstra commented May 7, 2014

CodingCat commented May 7, 2014

markhamstra commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 7, 2014

AmplabJenkins commented May 12, 2014

AmplabJenkins commented May 12, 2014

AmplabJenkins commented May 12, 2014

AmplabJenkins commented May 12, 2014

AmplabJenkins commented May 27, 2014

AmplabJenkins commented May 27, 2014

AmplabJenkins commented May 27, 2014

AmplabJenkins commented May 27, 2014

markhamstra commented Jun 4, 2014

markhamstra commented Jun 20, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kayousterhout commented Jun 20, 2014

AmplabJenkins commented Jun 20, 2014

AmplabJenkins commented Jun 20, 2014

AmplabJenkins commented Jun 20, 2014

AmplabJenkins commented Jun 20, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwendell commented Jun 25, 2014

pwendell commented Jul 4, 2014