Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend #20640

Conversation

IgorBerman
Copy link

@IgorBerman IgorBerman commented Feb 20, 2018

What changes were proposed in this pull request?

This updates the Mesos scheduler to integrate with the common logic for node-blacklisting across all cluster managers in BlacklistTracker. Specifically, it removes a hardcoded MAX_SLAVE_FAILURES = 2 in MesosCoarseGrainedSchedulerBackend, and uses the blacklist from the BlacklistTracker, as Yarn does.

This closes #17619

Prime Author: [email protected]

How was this patch tested?

Created unit test for blacklisting of slaves

@timout
@squito (as reviewer of previous PR)

@squito
Copy link
Contributor

squito commented Feb 20, 2018

Jenkins, ok to test

@squito
Copy link
Contributor

squito commented Feb 20, 2018

@IgorBerman I actually think that #17619 is the right approach. As @timout pointed out on that one, this functionality doesn't need to be covered in mesos specific code at all, as its covered by the BlacklistTracker. I don't like introducing new configs when we don't really need them. Other than this being less invasive, is there another advantage here?

The modifications I suggested to that PR are relatively small -- I think its fine if you want to open a PR that is the original updated with my suggestions (credit still to @timout), as I'm not sure if they're still working on it. (my fault too as there was such a long delay for a proper review.)

While I had some open questions, I think its a clear improvement in any case. I just need to get a little help on mesos testing, we can ask on the dev list.

@SparkQA
Copy link

SparkQA commented Feb 20, 2018

Test build #87556 has finished for PR 20640 at commit ca4eba2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@IgorBerman
Copy link
Author

IgorBerman commented Feb 20, 2018

@squito In general I agree with you, however current status of hardcoded max 2 failures is blocking, so I've created simple fix that will demand less integration testing. I can close it.
I think in context of MesosCoarseGrainedSchedulerBackend the TaskSchedulerImpl called scheduler so to test if slaveId is blacklisted probably following is needed:
!scheduler.nodeBlacklist().contains(slaveId)

@squito
Copy link
Contributor

squito commented Feb 20, 2018

I understand if you want to do something like this for yourself to unblock, but I think I'm -1 on merging this because of adding more configs just for a stopgap.

but I think we agree on the right solution here -- if you post that I will try to review promptly since I've got this paged in (though it might take a bit to merge as we figure out how to test ...)

@IgorBerman
Copy link
Author

@squito ok, I can do this small fix to the PR of @timout. How can I update his PR/his code? Or I can use this one? WDYT?

@IgorBerman IgorBerman force-pushed the SPARK-19755-Blacklist-is-always-active-for-MesosCoarseGrainedSchedulerBackend branch from ca4eba2 to 2240259 Compare February 20, 2018 16:45
@IgorBerman
Copy link
Author

IgorBerman commented Feb 20, 2018

@squito updated. Actually I've seen that without any blacklisting there are some corner cases: e.g. suppose there is some transient error on some slave, without blacklisting the driver will accept offer and will continue to try to start executor on bad slave over and over again. During this time it won't accept any other offers(e.g. max cores reached already), but effectively there will be less cores in use(since executor constantly fails on this slave for some duration)
The best (IMO) scenario is to use blacklist with timeout, so BlacklistTracker provides this, which is great.

@SparkQA
Copy link

SparkQA commented Feb 20, 2018

Test build #87558 has finished for PR 20640 at commit 2240259.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Launches a new task on a valid offer from the same slave
offerResources(List(offer2))
// but since it's blacklisted the offer is declined
verifyDeclinedOffer(driver, createOfferId("o1"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this actually pass? I thought it wouldn't b/c the filtering is done inside buildMesosTasks, which never calls declineOffer on offers that fail canLaunchTask. (a separate thing which needs fixing -- you could open another issue for that.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nevermind. I took another look at the code and now I see how this works

@squito
Copy link
Contributor

squito commented Feb 20, 2018

thanks for updating. can you also update the PR description?

yeah its fine to just update this one. You can't in general update others' prs, unless they give you push permissions to their repos. You can start from their branch, and then add your changes on top -- that is a little preferable as that way the commit history includes their work, so when someone merges its a bit more obvious. If you can adjust this PR that way, that would be nice -- but otherwise its OK, I will try to remember to adjust attribution when merging

@IgorBerman IgorBerman force-pushed the SPARK-19755-Blacklist-is-always-active-for-MesosCoarseGrainedSchedulerBackend branch from 2240259 to 918e066 Compare February 20, 2018 20:21
@SparkQA
Copy link

SparkQA commented Feb 20, 2018

Test build #87562 has finished for PR 20640 at commit 918e066.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@IgorBerman
Copy link
Author

IgorBerman commented Feb 20, 2018

@squito ok, I've cherry-picked @timout 's 2 commits, so the history will contain his contribution and added on top of it commit with your proposition, hope now it's ok

@IgorBerman IgorBerman force-pushed the SPARK-19755-Blacklist-is-always-active-for-MesosCoarseGrainedSchedulerBackend branch 2 times, most recently from 1eb701c to 2f1db89 Compare February 20, 2018 20:37
@SparkQA
Copy link

SparkQA commented Feb 20, 2018

Test build #87563 has finished for PR 20640 at commit 1eb701c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 20, 2018

Test build #87564 has finished for PR 20640 at commit 2f1db89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@squito
Copy link
Contributor

squito commented Feb 20, 2018

lgtm

@IgorBerman can you cleanup the PR description a little? headers got duplicated. And I'd reword a bit to something like

This updates the Mesos scheduler to integrate with the common logic for node-blacklisting across all cluster managers in BlacklistTracker. Specifically, it removes a hardcoded MAX_SLAVE_FAILURES = 2 in MesosCoarseGrainedSchedulerBackend, and uses the blacklist from the BlacklistTracker, as Yarn does.

This closes #17619

(the last "this closes" bit is useful for some tools we have, it will close the other one when this is merged.)

thanks for doing this. I will probably leave this open for a bit if more mesos users have thoughts

slave.taskFailures += 1

if (slave.taskFailures >= MAX_SLAVE_FAILURES) {
logInfo(s"Blacklisting Mesos slave $slaveId due to too many failures; " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a concern to lose this error message? (I don't know anything about Mesos but it does seem potentially useful?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kayousterhout BlacklistTracker has it's own logging that is concerned with blacklisted nodes, won't it be enough? on the other hand, if blacklisting is disabled, which is default, then we will lose this information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IgorBerman Yes, in the default case, it would be nice to have this information about a task failing, especially if it fails repeatedly.

@@ -571,7 +568,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
cpus + totalCoresAcquired <= maxCores &&
mem <= offerMem &&
numExecutors < executorLimit &&
slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES &&
!scheduler.nodeBlacklist().contains(slaveId) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places it looks like the hostname is used in the blacklist - why does this check against the slaveId instead of the offerHostname?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, thank!

here:

if (!nodeIdToBlacklistExpiryTime.contains(host)) {

I've managed to track it to

taskSetBlacklistHelperOpt.foreach(_.updateBlacklistForFailedTask(

I'll change it to offerHostname

@IgorBerman IgorBerman force-pushed the SPARK-19755-Blacklist-is-always-active-for-MesosCoarseGrainedSchedulerBackend branch from 2f1db89 to e2ddc1b Compare February 21, 2018 07:46
@SparkQA
Copy link

SparkQA commented Feb 21, 2018

Test build #87580 has finished for PR 20640 at commit e2ddc1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@IgorBerman
Copy link
Author

@squito updated to your wording. Where do you see headers duplicated?
I'm going to test custom build that contains this patch(actually on top of 2.2.0) to verify that basic blacklisting works for executor level blacklisting and that offers are rejected up to timeout duration.
So I'll update you about my findings/testing results

@IgorBerman
Copy link
Author

@squito so I've run internal workload with blacklisting on, with killing executors for the framework manually. It didn't have much effect(only once got blacklisting 1 executor and then after timeout it removed it)
If you have some idea how to test it properly I can try to do it specifically aiming to this scenario of rejecting offers from mesos master for blacklisted slave resource.

ps: not strictly connected to this conversation, but as a side note: I've probably found bug that running in coarse grained mode on mesos + dyn.allocation on + blacklisting on(not sure if it's relevant) + client mode and killing manually executors on some slaves makes spark driver to start 2nd executor on same slave for the same application(which is against constraints of coarse grained mode?)

@squito
Copy link
Contributor

squito commented Feb 21, 2018

thanks @IgorBerman, description looks fine to me now, maybe I saw it wrong before.

your test sounds pretty good to me ... you could turn on debug logging for MesosCoarseGrainedSchedulerBackend and look for these log msgs:

https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L603

What do you mean "it didn't have much effect" -- sounds like it did exactly the right thing?

Sorry, I don't really understand description of the other bug you mentioned. Why shouldn't it start a 2nd executor on the same slave for the same application? That seems fine until you have enough failures for the node blacklisting to take effect. There is also a small race (that is relatively benign) that would allow you to get a an executor on a node which you are in the middle of blacklisting.

@swevrywhere
Copy link

Hi, I just wanted to check on the status of this as it looks like it's been stalled. Did this get fixed another way? Or if this is going to be the fix, is there anything holding up progress other than a merge conflict?

@IgorBerman
Copy link
Author

@swevrywhere this PR is not merged, so the problem persist(you can always create custom distro)

@akirillov
Copy link

Hi @IgorBerman, @dongjoon-hyun, how can I help to move this PR forward and merge it? It's been a long time since the last activity on the issue and I'd be happy to help with it.

@IgorBerman
Copy link
Author

@akirillov Hi, unfortunately I don't have any say here. The PR was ready for merge , but no one from maintainers decided to merge it. At the end we are patching our internal spark version with kind of same logic(+-)
I can rebase the patch if maintainers will decide to go forward.

@akirillov
Copy link

Thanks, @IgorBerman. Based on the conversation it looks like the more general idea is to use logic similar to the one in BlacklistTracker but for Mesos Task failures. Mesos task launch failure can be caused by multiple reasons including TASK_ERROR due to lack of permissions (not node-specific), TASK_KILLED due to over-commitment or the upcoming node draining Mesos feature. So it doesn't seem that BlacklistTracker can be used for this purpose and another implementation is needed.

Speaking more generally, if there's a failed node or a network failure, the scheduler will not receive offers from that node and won't attempt to launch a task(executor) on it. Also, given that a coarse-grained scheduler is the default one, and the fine-grained scheduler is deprecated, the scheduling happens only on application start (except dynamic allocation use case). So given the nature and duration of the scheduling step, it's not clear if the blacklisting makes sense for the scheduling of executors themselves.

@IgorBerman
Copy link
Author

@akirillov I'd say it makes sense(e.g. for cases of temp "bad" state of some node e.g. port collision for some reason, bad disk or anything else one can think of), but clearly not as it is implemented now(without any timeout) since it may mark all nodes in cluster as blacklisted eventually(if we are speaking about production cluster with long running jobs)

@felixcheung
Copy link
Member

Sorry, looks like we have lost track of this. since the original change was reviewed unless there are more changes on the mesos side (being discussed from akirillov?) or changes generally (YARN etc, ping @squito)

@IgorBerman if you could rebase and ping me again.

@IgorBerman
Copy link
Author

@felixcheung kind ping :) merged master, tell me if you have any comments, I see that this code changed a bit since I've looked at it...

@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun
Copy link
Member

Hi, @IgorBerman and @akirillov .
According to the discussion, this PR still doesn't cover the following?

[SPARK-16630][YARN] Blacklist a node if executors won't launch on it

@SparkQA
Copy link

SparkQA commented Sep 25, 2019

Test build #111343 has finished for PR 20640 at commit 95ca22c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@IgorBerman
Copy link
Author

@dongjoon-hyun Yes, opened SPARK-24567 to track this

@dongjoon-hyun
Copy link
Member

Got it. Thank you for the confirmation!

@dongjoon-hyun
Copy link
Member

Hi, @IgorBerman .
I made a PR against your branch. Please review and merge if you agree.

@squito
Copy link
Contributor

squito commented Oct 1, 2019

my reluctance to merge this in the past is that without SPARK-24567, it kinda seems like a step backwards to me. If mesos can't start a mesos-task (aka a spark-executor), before this change, we'd stop trying to place more executors there.

The problem with the existing code is that the node is rejected indefinitely; I understand why you want to bring in the timeout. But you're only bringing in the timeout for those cases where the executors start successfully, but tasks still consistently fail. Eg. if there is one bad disk out of 10. But OTOH, if there is something else fundamentally wrong with the node that prevents any executor from starting (eg. missing libs), then the behavior becomes worse.

I don't use mesos at all myself so I dunno whether mesos somehow shields you from one type of problem which still makes this worthwhile. But without that confidence, I didn't feel great about merging this.

@dongjoon-hyun
Copy link
Member

Got it, @squito . I'll follow your advice.

@IgorBerman
Copy link
Author

I'm closing this PR as it is long open one and there is no agreement that it's valuable.
anyone who wants to pick it, are welcome to create some.

@IgorBerman IgorBerman closed this Oct 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.