Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1749] Job cancellation when SchedulerBackend does not implement killTask #686

Closed
wants to merge 4 commits into from

Conversation

markhamstra
Copy link
Contributor

It turns out that having the DAGScheduler tell the taskScheduler to cancelTasks when the backend does not implement killTask (e.g. Mesos) is not such a good idea.

@AmplabJenkins
Copy link

Merged build triggered.

@markhamstra
Copy link
Contributor Author

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14790/

try { // cancelTasks will fail if a SchedulerBackend does not implement killTask
taskScheduler.cancelTasks(stageId, shouldInterruptThread)
} catch {
case e: UnsupportedOperationException => logInfo(s"Could not cancel tasks for stage $stageId", e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this message to explain why: that the scheduler used (and then print what scheduler is being used?) doesn't support cancellation?

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@kayousterhout
Copy link
Contributor

So it looks like this commit fixes the crash issue (would be nice to note this in the test -- that the point of the test is to make sure that Spark doesn't crash). But it would also be good to get this message back to the user -- since right now if the user tries to cancel her job, Spark will appear to have successfully cancelled the job when, in fact, it has not been cancelled. How hard is it to do this?

@markhamstra
Copy link
Contributor Author

The INFO log should include the information that tasks were not cancelled. Where/how else do you want to see notification of those facts? Is adding more Listener events something we want to contemplate still in 1.0.0, or should something like that go into 1.1?

@CodingCat
Copy link
Contributor

what will happen to the to-be-cancelled tasks in Mesos when the user wants to cancel them? still running there? it seems that #686 (diff) is marking the stage has been cancelled and actually the tasks are still there?

@markhamstra
Copy link
Contributor Author

If interruptThread is not true, then we are going to leave tasks running on the cluster after cancellation with other backends as well. This is definitely an issue begging for further work, but I don't think we can go any further than #498 right now.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14791/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14899/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15234/

@markhamstra
Copy link
Contributor Author

@rxin merge to 1.0.1 and 1.1.0?

@markhamstra
Copy link
Contributor Author

ping: This should go into 1.0.1
@pwendell

override def stop() = {}
override def submitTasks(taskSet: TaskSet) = {
// normally done by TaskSetManager
taskSet.tasks.foreach(_.epoch = mapOutputTracker.getEpoch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these lines necessary (can you just do nothing here?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, doing nothing is easy.

On Fri, Jun 20, 2014 at 10:47 AM, Kay Ousterhout [email protected]
wrote:

In core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:

@@ -313,6 +314,47 @@ class DAGSchedulerSuite extends TestKit(ActorSystem("DAGSchedulerSuite")) with F
assertDataStructuresEmpty
}

  • test("job cancellation no-kill backend") {
  • // make sure that the DAGScheduler doesn't crash when the TaskScheduler
  • // doesn't implement killTask()
  • val noKillTaskScheduler = new TaskScheduler() {
  •  override def rootPool: Pool = null
    
  •  override def schedulingMode: SchedulingMode = SchedulingMode.NONE
    
  •  override def start() = {}
    
  •  override def stop() = {}
    
  •  override def submitTasks(taskSet: TaskSet) = {
    
  •    // normally done by TaskSetManager
    
  •    taskSet.tasks.foreach(_.epoch = mapOutputTracker.getEpoch)
    

are these lines necessary (can you just do nothing here?)


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/686/files#r14031997.

@kayousterhout
Copy link
Contributor

Just a small comment on the tests but other than that this looks good

cancel a job on a SchedulerBackend that does not implement killTask
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15960/

} catch {
case e: UnsupportedOperationException =>
logInfo(s"Could not cancel tasks for stage $stageId", e)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Mark one more question here -- can we move the job.listener.jobFailed(error) call from line 1041 to here in the "try" clause? It seems weird to tell the user the job has been cancelled when, in fact, it hasn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... not sure that I agree. A job being cancelled, stages being cancelled, and tasks being cancelled are all different things. The expectation is that job cancellation will lead to cancellation of independent stages and their associated tasks; but if no stages and tasks get cancelled, it's probably still worthwhile for the information to be sent that the job itself was cancelled. I expect that eventually all of the backends will support task killing, so this whole no-kill path should never be hit. But moving the job cancellation notification within the try-to-cancelTasks block will result in multiple notifications that the parent job was cancelled -- one for each independent stage cancellation. Or am I misunderstanding what you are suggesting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right that it doesn't make sense to add that here because it will be called for each stage. My intention was that if the job has running stages that don't get cancelled (because the task scheduler doesn't implement cancelTasks()), then we should not call job.listener.jobFailed() -- do you think that makes sense? Seems like the way to implement that would be to set a boolean flag here if the job can't be successfully cancelled, and then call jobFailed() 0 or 1 times at the end of this function depending on that flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you suggest could be done, but there's a question of whether or not notification of cancellation of the job should be made regardless of whether any stages and task are successfully cancelled as a consequence. I don't really know how to answer that because I don't know how all of the listeners are using the notification or whether they are all expecting the same semantics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwendell what do you think here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have the meaning of all the listener events fully documented someplace? Or perhaps that needs to be done in a separate PR and then DAGScheduler updated to match the documented expectation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct semantics here are that we only trigger jobFailed if the job was succesfully halted. This is mostly consumed downstream by tests, but it's also used by code supporting asynchronous actions and approximate results. In those cases, it would be better not to notify jobFailed if the cancellation doesn't succeed, because they both assume that the hob has finished executing if that message is received.

Separately we should probably update the documentation in cancel to explain that it is a "best effort" method and will only be called if supported by the underlying scheduler. Otherwise, we should say it will act as a no-op, i.e. it will act is if cancel was never called. With this approach downstream consumers will only have two cases to worry about (a job was cancelled or it wasn't) rather than a third case, where we say it was cancelled but it secretely actually wasn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, either tonight or tomorrow I can update this PR to reflect that strategy, or you can go ahead and make the change @pwendell. Outside the immediate scope of this PR, what prevents Mesos from being able to kill tasks, and when do we expect that to change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just take care of it... thanks mark!

Task killing is not supported in the fine-grained mode on mesos because, in that mode, we use Mesos's built in support for all of the control plane messages relating to tasks. So we'll have to figure out how to support killing tasks in that model. There are two questions, one is who actually sends the "kill" message to the executor and the other is how we tell Mesos that the cores are freed which were in use by the task. In the course of normal operation that's handled by using the Mesos launchTask and sendStatusUpdate interfaces.

@pwendell
Copy link
Contributor

@markhamstra I'd like to pull this into a 1.0.1 RC that's going out today. This looks great, I just had a small comment. Are you around to address it? If not I can just make the change on merge.

asfgit pushed a commit that referenced this pull request Jun 26, 2014
…t killTask

This is a fixed up version of #686 (cc @markhamstra @pwendell).  The last commit (the only one I authored) reflects the changes I made from Mark's original patch.

Author: Mark Hamstra <[email protected]>
Author: Kay Ousterhout <[email protected]>

Closes #1219 from kayousterhout/mark-SPARK-1749 and squashes the following commits:

42dfa7e [Kay Ousterhout] Got rid of terrible double-negative name
80b3205 [Kay Ousterhout] Don't notify listeners of job failure if it wasn't successfully cancelled.
d156d33 [Mark Hamstra] Do nothing in no-kill submitTasks
9312baa [Mark Hamstra] code review update
cc353c8 [Mark Hamstra] scalastyle
e61f7f8 [Mark Hamstra] Catch UnsupportedOperationException when DAGScheduler tries to cancel a job on a SchedulerBackend that does not implement killTask
asfgit pushed a commit that referenced this pull request Jun 26, 2014
…t killTask

This is a fixed up version of #686 (cc @markhamstra @pwendell).  The last commit (the only one I authored) reflects the changes I made from Mark's original patch.

Author: Mark Hamstra <[email protected]>
Author: Kay Ousterhout <[email protected]>

Closes #1219 from kayousterhout/mark-SPARK-1749 and squashes the following commits:

42dfa7e [Kay Ousterhout] Got rid of terrible double-negative name
80b3205 [Kay Ousterhout] Don't notify listeners of job failure if it wasn't successfully cancelled.
d156d33 [Mark Hamstra] Do nothing in no-kill submitTasks
9312baa [Mark Hamstra] code review update
cc353c8 [Mark Hamstra] scalastyle
e61f7f8 [Mark Hamstra] Catch UnsupportedOperationException when DAGScheduler tries to cancel a job on a SchedulerBackend that does not implement killTask
(cherry picked from commit b88a59a)

Signed-off-by: Patrick Wendell <[email protected]>
@pwendell
Copy link
Contributor

pwendell commented Jul 4, 2014

@markhamstra mind closing this? It got merged through Kay's PR/

@markhamstra markhamstra closed this Jul 4, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…t killTask

This is a fixed up version of apache#686 (cc @markhamstra @pwendell).  The last commit (the only one I authored) reflects the changes I made from Mark's original patch.

Author: Mark Hamstra <[email protected]>
Author: Kay Ousterhout <[email protected]>

Closes apache#1219 from kayousterhout/mark-SPARK-1749 and squashes the following commits:

42dfa7e [Kay Ousterhout] Got rid of terrible double-negative name
80b3205 [Kay Ousterhout] Don't notify listeners of job failure if it wasn't successfully cancelled.
d156d33 [Mark Hamstra] Do nothing in no-kill submitTasks
9312baa [Mark Hamstra] code review update
cc353c8 [Mark Hamstra] scalastyle
e61f7f8 [Mark Hamstra] Catch UnsupportedOperationException when DAGScheduler tries to cancel a job on a SchedulerBackend that does not implement killTask
helenyugithub pushed a commit to helenyugithub/spark that referenced this pull request Jul 13, 2020
…terations are large (apache#686)

* Normalize vectors

* Update test file too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants