-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend #20640
Conversation
Jenkins, ok to test |
@IgorBerman I actually think that #17619 is the right approach. As @timout pointed out on that one, this functionality doesn't need to be covered in mesos specific code at all, as its covered by the BlacklistTracker. I don't like introducing new configs when we don't really need them. Other than this being less invasive, is there another advantage here? The modifications I suggested to that PR are relatively small -- I think its fine if you want to open a PR that is the original updated with my suggestions (credit still to @timout), as I'm not sure if they're still working on it. (my fault too as there was such a long delay for a proper review.) While I had some open questions, I think its a clear improvement in any case. I just need to get a little help on mesos testing, we can ask on the dev list. |
Test build #87556 has finished for PR 20640 at commit
|
@squito In general I agree with you, however current status of hardcoded max 2 failures is blocking, so I've created simple fix that will demand less integration testing. I can close it. |
I understand if you want to do something like this for yourself to unblock, but I think I'm -1 on merging this because of adding more configs just for a stopgap. but I think we agree on the right solution here -- if you post that I will try to review promptly since I've got this paged in (though it might take a bit to merge as we figure out how to test ...) |
ca4eba2
to
2240259
Compare
@squito updated. Actually I've seen that without any blacklisting there are some corner cases: e.g. suppose there is some transient error on some slave, without blacklisting the driver will accept offer and will continue to try to start executor on bad slave over and over again. During this time it won't accept any other offers(e.g. max cores reached already), but effectively there will be less cores in use(since executor constantly fails on this slave for some duration) |
Test build #87558 has finished for PR 20640 at commit
|
// Launches a new task on a valid offer from the same slave | ||
offerResources(List(offer2)) | ||
// but since it's blacklisted the offer is declined | ||
verifyDeclinedOffer(driver, createOfferId("o1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this actually pass? I thought it wouldn't b/c the filtering is done inside buildMesosTasks
, which never calls declineOffer
on offers that fail canLaunchTask
. (a separate thing which needs fixing -- you could open another issue for that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah nevermind. I took another look at the code and now I see how this works
thanks for updating. can you also update the PR description? yeah its fine to just update this one. You can't in general update others' prs, unless they give you push permissions to their repos. You can start from their branch, and then add your changes on top -- that is a little preferable as that way the commit history includes their work, so when someone merges its a bit more obvious. If you can adjust this PR that way, that would be nice -- but otherwise its OK, I will try to remember to adjust attribution when merging |
2240259
to
918e066
Compare
Test build #87562 has finished for PR 20640 at commit
|
1eb701c
to
2f1db89
Compare
Test build #87563 has finished for PR 20640 at commit
|
Test build #87564 has finished for PR 20640 at commit
|
lgtm @IgorBerman can you cleanup the PR description a little? headers got duplicated. And I'd reword a bit to something like
(the last "this closes" bit is useful for some tools we have, it will close the other one when this is merged.) thanks for doing this. I will probably leave this open for a bit if more mesos users have thoughts |
slave.taskFailures += 1 | ||
|
||
if (slave.taskFailures >= MAX_SLAVE_FAILURES) { | ||
logInfo(s"Blacklisting Mesos slave $slaveId due to too many failures; " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a concern to lose this error message? (I don't know anything about Mesos but it does seem potentially useful?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kayousterhout BlacklistTracker has it's own logging that is concerned with blacklisted nodes, won't it be enough? on the other hand, if blacklisting is disabled, which is default, then we will lose this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IgorBerman Yes, in the default case, it would be nice to have this information about a task failing, especially if it fails repeatedly.
@@ -571,7 +568,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend( | |||
cpus + totalCoresAcquired <= maxCores && | |||
mem <= offerMem && | |||
numExecutors < executorLimit && | |||
slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES && | |||
!scheduler.nodeBlacklist().contains(slaveId) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other places it looks like the hostname is used in the blacklist - why does this check against the slaveId instead of the offerHostname?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, thank!
here:
if (!nodeIdToBlacklistExpiryTime.contains(host)) { |
I've managed to track it to
taskSetBlacklistHelperOpt.foreach(_.updateBlacklistForFailedTask( |
I'll change it to offerHostname
2f1db89
to
e2ddc1b
Compare
Test build #87580 has finished for PR 20640 at commit
|
@squito updated to your wording. Where do you see headers duplicated? |
@squito so I've run internal workload with blacklisting on, with killing executors for the framework manually. It didn't have much effect(only once got blacklisting 1 executor and then after timeout it removed it) ps: not strictly connected to this conversation, but as a side note: I've probably found bug that running in coarse grained mode on mesos + dyn.allocation on + blacklisting on(not sure if it's relevant) + client mode and killing manually executors on some slaves makes spark driver to start 2nd executor on same slave for the same application(which is against constraints of coarse grained mode?) |
thanks @IgorBerman, description looks fine to me now, maybe I saw it wrong before. your test sounds pretty good to me ... you could turn on debug logging for MesosCoarseGrainedSchedulerBackend and look for these log msgs: What do you mean "it didn't have much effect" -- sounds like it did exactly the right thing? Sorry, I don't really understand description of the other bug you mentioned. Why shouldn't it start a 2nd executor on the same slave for the same application? That seems fine until you have enough failures for the node blacklisting to take effect. There is also a small race (that is relatively benign) that would allow you to get a an executor on a node which you are in the middle of blacklisting. |
Hi, I just wanted to check on the status of this as it looks like it's been stalled. Did this get fixed another way? Or if this is going to be the fix, is there anything holding up progress other than a merge conflict? |
@swevrywhere this PR is not merged, so the problem persist(you can always create custom distro) |
Hi @IgorBerman, @dongjoon-hyun, how can I help to move this PR forward and merge it? It's been a long time since the last activity on the issue and I'd be happy to help with it. |
@akirillov Hi, unfortunately I don't have any say here. The PR was ready for merge , but no one from maintainers decided to merge it. At the end we are patching our internal spark version with kind of same logic(+-) |
Thanks, @IgorBerman. Based on the conversation it looks like the more general idea is to use logic similar to the one in Speaking more generally, if there's a failed node or a network failure, the scheduler will not receive offers from that node and won't attempt to launch a task(executor) on it. Also, given that a coarse-grained scheduler is the default one, and the fine-grained scheduler is deprecated, the scheduling happens only on application start (except dynamic allocation use case). So given the nature and duration of the scheduling step, it's not clear if the blacklisting makes sense for the scheduling of executors themselves. |
@akirillov I'd say it makes sense(e.g. for cases of temp "bad" state of some node e.g. port collision for some reason, bad disk or anything else one can think of), but clearly not as it is implemented now(without any timeout) since it may mark all nodes in cluster as blacklisted eventually(if we are speaking about production cluster with long running jobs) |
Sorry, looks like we have lost track of this. since the original change was reviewed unless there are more changes on the mesos side (being discussed from akirillov?) or changes generally (YARN etc, ping @squito) @IgorBerman if you could rebase and ping me again. |
…-MesosCoarseGrainedSchedulerBackend
@felixcheung kind ping :) merged master, tell me if you have any comments, I see that this code changed a bit since I've looked at it... |
Retest this please. |
Hi, @IgorBerman and @akirillov .
|
Test build #111343 has finished for PR 20640 at commit
|
@dongjoon-hyun Yes, opened SPARK-24567 to track this |
Got it. Thank you for the confirmation! |
...scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala
Outdated
Show resolved
Hide resolved
...scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
Show resolved
Hide resolved
Hi, @IgorBerman . |
my reluctance to merge this in the past is that without SPARK-24567, it kinda seems like a step backwards to me. If mesos can't start a mesos-task (aka a spark-executor), before this change, we'd stop trying to place more executors there. The problem with the existing code is that the node is rejected indefinitely; I understand why you want to bring in the timeout. But you're only bringing in the timeout for those cases where the executors start successfully, but tasks still consistently fail. Eg. if there is one bad disk out of 10. But OTOH, if there is something else fundamentally wrong with the node that prevents any executor from starting (eg. missing libs), then the behavior becomes worse. I don't use mesos at all myself so I dunno whether mesos somehow shields you from one type of problem which still makes this worthwhile. But without that confidence, I didn't feel great about merging this. |
alternative
Got it, @squito . I'll follow your advice. |
I'm closing this PR as it is long open one and there is no agreement that it's valuable. |
What changes were proposed in this pull request?
This updates the Mesos scheduler to integrate with the common logic for node-blacklisting across all cluster managers in BlacklistTracker. Specifically, it removes a hardcoded MAX_SLAVE_FAILURES = 2 in MesosCoarseGrainedSchedulerBackend, and uses the blacklist from the BlacklistTracker, as Yarn does.
This closes #17619
Prime Author: [email protected]
How was this patch tested?
Created unit test for blacklisting of slaves
@timout
@squito (as reviewer of previous PR)