[SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend #20640

IgorBerman · 2018-02-20T12:32:02Z

What changes were proposed in this pull request?

This updates the Mesos scheduler to integrate with the common logic for node-blacklisting across all cluster managers in BlacklistTracker. Specifically, it removes a hardcoded MAX_SLAVE_FAILURES = 2 in MesosCoarseGrainedSchedulerBackend, and uses the blacklist from the BlacklistTracker, as Yarn does.

This closes #17619

Prime Author: [email protected]

How was this patch tested?

Created unit test for blacklisting of slaves

@timout
@squito (as reviewer of previous PR)

squito · 2018-02-20T14:40:24Z

Jenkins, ok to test

squito · 2018-02-20T14:47:46Z

@IgorBerman I actually think that #17619 is the right approach. As @timout pointed out on that one, this functionality doesn't need to be covered in mesos specific code at all, as its covered by the BlacklistTracker. I don't like introducing new configs when we don't really need them. Other than this being less invasive, is there another advantage here?

The modifications I suggested to that PR are relatively small -- I think its fine if you want to open a PR that is the original updated with my suggestions (credit still to @timout), as I'm not sure if they're still working on it. (my fault too as there was such a long delay for a proper review.)

While I had some open questions, I think its a clear improvement in any case. I just need to get a little help on mesos testing, we can ask on the dev list.

SparkQA · 2018-02-20T14:57:08Z

Test build #87556 has finished for PR 20640 at commit ca4eba2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

IgorBerman · 2018-02-20T15:11:56Z

@squito In general I agree with you, however current status of hardcoded max 2 failures is blocking, so I've created simple fix that will demand less integration testing. I can close it.
I think in context of MesosCoarseGrainedSchedulerBackend the TaskSchedulerImpl called scheduler so to test if slaveId is blacklisted probably following is needed:
!scheduler.nodeBlacklist().contains(slaveId)

squito · 2018-02-20T15:44:33Z

I understand if you want to do something like this for yourself to unblock, but I think I'm -1 on merging this because of adding more configs just for a stopgap.

but I think we agree on the right solution here -- if you post that I will try to review promptly since I've got this paged in (though it might take a bit to merge as we figure out how to test ...)

IgorBerman · 2018-02-20T15:49:57Z

@squito ok, I can do this small fix to the PR of @timout. How can I update his PR/his code? Or I can use this one? WDYT?

IgorBerman · 2018-02-20T16:47:12Z

@squito updated. Actually I've seen that without any blacklisting there are some corner cases: e.g. suppose there is some transient error on some slave, without blacklisting the driver will accept offer and will continue to try to start executor on bad slave over and over again. During this time it won't accept any other offers(e.g. max cores reached already), but effectively there will be less cores in use(since executor constantly fails on this slave for some duration)
The best (IMO) scenario is to use blacklist with timeout, so BlacklistTracker provides this, which is great.

SparkQA · 2018-02-20T17:01:15Z

Test build #87558 has finished for PR 20640 at commit 2240259.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-02-20T17:09:27Z

...scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala

+    // Launches a new task on a valid offer from the same slave
+    offerResources(List(offer2))
+    // but since it's blacklisted the offer is declined
+    verifyDeclinedOffer(driver, createOfferId("o1"))


will this actually pass? I thought it wouldn't b/c the filtering is done inside buildMesosTasks, which never calls declineOffer on offers that fail canLaunchTask. (a separate thing which needs fixing -- you could open another issue for that.)

ah nevermind. I took another look at the code and now I see how this works

squito · 2018-02-20T17:15:54Z

thanks for updating. can you also update the PR description?

yeah its fine to just update this one. You can't in general update others' prs, unless they give you push permissions to their repos. You can start from their branch, and then add your changes on top -- that is a little preferable as that way the commit history includes their work, so when someone merges its a bit more obvious. If you can adjust this PR that way, that would be nice -- but otherwise its OK, I will try to remember to adjust attribution when merging

…listTracker

SparkQA · 2018-02-20T20:24:24Z

Test build #87562 has finished for PR 20640 at commit 918e066.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

IgorBerman · 2018-02-20T20:24:42Z

@squito ok, I've cherry-picked @timout 's 2 commits, so the history will contain his contribution and added on top of it commit with your proposition, hope now it's ok

SparkQA · 2018-02-20T20:41:56Z

Test build #87563 has finished for PR 20640 at commit 1eb701c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-20T20:51:08Z

Test build #87564 has finished for PR 20640 at commit 2f1db89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-02-20T22:00:50Z

lgtm

@IgorBerman can you cleanup the PR description a little? headers got duplicated. And I'd reword a bit to something like

This updates the Mesos scheduler to integrate with the common logic for node-blacklisting across all cluster managers in BlacklistTracker. Specifically, it removes a hardcoded MAX_SLAVE_FAILURES = 2 in MesosCoarseGrainedSchedulerBackend, and uses the blacklist from the BlacklistTracker, as Yarn does.

This closes #17619

(the last "this closes" bit is useful for some tools we have, it will close the other one when this is merged.)

thanks for doing this. I will probably leave this open for a bit if more mesos users have thoughts

kayousterhout · 2018-02-20T23:49:46Z

...main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala

-          slave.taskFailures += 1
-
-          if (slave.taskFailures >= MAX_SLAVE_FAILURES) {
-            logInfo(s"Blacklisting Mesos slave $slaveId due to too many failures; " +


Is it a concern to lose this error message? (I don't know anything about Mesos but it does seem potentially useful?)

@kayousterhout BlacklistTracker has it's own logging that is concerned with blacklisted nodes, won't it be enough? on the other hand, if blacklisting is disabled, which is default, then we will lose this information.

@IgorBerman Yes, in the default case, it would be nice to have this information about a task failing, especially if it fails repeatedly.

kayousterhout · 2018-02-21T00:05:05Z

...main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala

@@ -571,7 +568,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
      cpus + totalCoresAcquired <= maxCores &&
      mem <= offerMem &&
      numExecutors < executorLimit &&
-      slaves.get(slaveId).map(_.taskFailures).getOrElse(0) < MAX_SLAVE_FAILURES &&
+      !scheduler.nodeBlacklist().contains(slaveId) &&


In other places it looks like the hostname is used in the blacklist - why does this check against the slaveId instead of the offerHostname?

You are right, thank!

here:

spark/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala

Line 193 in 9e50a1d

if (!nodeIdToBlacklistExpiryTime.contains(host)) {

I've managed to track it to

spark/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

Line 852 in e18d6f5

taskSetBlacklistHelperOpt.foreach(_.updateBlacklistForFailedTask(

I'll change it to offerHostname

SparkQA · 2018-02-21T08:01:37Z

Test build #87580 has finished for PR 20640 at commit e2ddc1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

IgorBerman · 2018-02-21T08:08:24Z

@squito updated to your wording. Where do you see headers duplicated?
I'm going to test custom build that contains this patch(actually on top of 2.2.0) to verify that basic blacklisting works for executor level blacklisting and that offers are rejected up to timeout duration.
So I'll update you about my findings/testing results

IgorBerman · 2018-02-21T15:46:52Z

@squito so I've run internal workload with blacklisting on, with killing executors for the framework manually. It didn't have much effect(only once got blacklisting 1 executor and then after timeout it removed it)
If you have some idea how to test it properly I can try to do it specifically aiming to this scenario of rejecting offers from mesos master for blacklisted slave resource.

ps: not strictly connected to this conversation, but as a side note: I've probably found bug that running in coarse grained mode on mesos + dyn.allocation on + blacklisting on(not sure if it's relevant) + client mode and killing manually executors on some slaves makes spark driver to start 2nd executor on same slave for the same application(which is against constraints of coarse grained mode?)

squito · 2018-02-21T16:08:35Z

thanks @IgorBerman, description looks fine to me now, maybe I saw it wrong before.

your test sounds pretty good to me ... you could turn on debug logging for MesosCoarseGrainedSchedulerBackend and look for these log msgs:

https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L603

What do you mean "it didn't have much effect" -- sounds like it did exactly the right thing?

Sorry, I don't really understand description of the other bug you mentioned. Why shouldn't it start a 2nd executor on the same slave for the same application? That seems fine until you have enough failures for the node blacklisting to take effect. There is also a small race (that is relatively benign) that would allow you to get a an executor on a node which you are in the middle of blacklisting.

swevrywhere · 2019-05-01T16:43:12Z

Hi, I just wanted to check on the status of this as it looks like it's been stalled. Did this get fixed another way? Or if this is going to be the fix, is there anything holding up progress other than a merge conflict?

IgorBerman · 2019-05-05T14:21:29Z

@swevrywhere this PR is not merged, so the problem persist(you can always create custom distro)

akirillov · 2019-08-28T19:53:33Z

Hi @IgorBerman, @dongjoon-hyun, how can I help to move this PR forward and merge it? It's been a long time since the last activity on the issue and I'd be happy to help with it.

IgorBerman · 2019-08-29T12:31:59Z

@akirillov Hi, unfortunately I don't have any say here. The PR was ready for merge , but no one from maintainers decided to merge it. At the end we are patching our internal spark version with kind of same logic(+-)
I can rebase the patch if maintainers will decide to go forward.

akirillov · 2019-08-29T17:02:33Z

Thanks, @IgorBerman. Based on the conversation it looks like the more general idea is to use logic similar to the one in BlacklistTracker but for Mesos Task failures. Mesos task launch failure can be caused by multiple reasons including TASK_ERROR due to lack of permissions (not node-specific), TASK_KILLED due to over-commitment or the upcoming node draining Mesos feature. So it doesn't seem that BlacklistTracker can be used for this purpose and another implementation is needed.

Speaking more generally, if there's a failed node or a network failure, the scheduler will not receive offers from that node and won't attempt to launch a task(executor) on it. Also, given that a coarse-grained scheduler is the default one, and the fine-grained scheduler is deprecated, the scheduling happens only on application start (except dynamic allocation use case). So given the nature and duration of the scheduling step, it's not clear if the blacklisting makes sense for the scheduling of executors themselves.

IgorBerman · 2019-08-29T18:28:11Z

@akirillov I'd say it makes sense(e.g. for cases of temp "bad" state of some node e.g. port collision for some reason, bad disk or anything else one can think of), but clearly not as it is implemented now(without any timeout) since it may mark all nodes in cluster as blacklisted eventually(if we are speaking about production cluster with long running jobs)

felixcheung · 2019-09-01T18:09:44Z

Sorry, looks like we have lost track of this. since the original change was reviewed unless there are more changes on the mesos side (being discussed from akirillov?) or changes generally (YARN etc, ping @squito)

@IgorBerman if you could rebase and ping me again.

…-MesosCoarseGrainedSchedulerBackend

IgorBerman · 2019-09-25T07:51:30Z

@felixcheung kind ping :) merged master, tell me if you have any comments, I see that this code changed a bit since I've looked at it...

dongjoon-hyun · 2019-09-25T09:39:14Z

Retest this please.

dongjoon-hyun · 2019-09-25T09:46:39Z

Hi, @IgorBerman and @akirillov .
According to the discussion, this PR still doesn't cover the following?

[SPARK-16630][YARN] Blacklist a node if executors won't launch on it

SparkQA · 2019-09-25T09:59:49Z

Test build #111343 has finished for PR 20640 at commit 95ca22c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

IgorBerman · 2019-09-25T10:15:29Z

@dongjoon-hyun Yes, opened SPARK-24567 to track this

dongjoon-hyun · 2019-09-25T12:17:44Z

Got it. Thank you for the confirmation!

...scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala

...main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala

dongjoon-hyun · 2019-09-25T20:38:46Z

Hi, @IgorBerman .
I made a PR against your branch. Please review and merge if you agree.

alternative IgorBerman/spark#1

squito · 2019-10-01T21:30:12Z

my reluctance to merge this in the past is that without SPARK-24567, it kinda seems like a step backwards to me. If mesos can't start a mesos-task (aka a spark-executor), before this change, we'd stop trying to place more executors there.

The problem with the existing code is that the node is rejected indefinitely; I understand why you want to bring in the timeout. But you're only bringing in the timeout for those cases where the executors start successfully, but tasks still consistently fail. Eg. if there is one bad disk out of 10. But OTOH, if there is something else fundamentally wrong with the node that prevents any executor from starting (eg. missing libs), then the behavior becomes worse.

I don't use mesos at all myself so I dunno whether mesos somehow shields you from one type of problem which still makes this worthwhile. But without that confidence, I didn't feel great about merging this.

alternative

dongjoon-hyun · 2019-10-03T04:19:30Z

Got it, @squito . I'll follow your advice.

IgorBerman · 2019-10-10T08:51:23Z

I'm closing this PR as it is long open one and there is no agreement that it's valuable.
anyone who wants to pick it, are welcome to create some.

IgorBerman force-pushed the SPARK-19755-Blacklist-is-always-active-for-MesosCoarseGrainedSchedulerBackend branch from ca4eba2 to 2240259 Compare February 20, 2018 16:45

squito reviewed Feb 20, 2018

View reviewed changes

antiout added 2 commits February 20, 2018 22:11

Removed hardcoded blacklist functionality, must be controled by Black…

636959a

…listTracker

Removed hardcoded blacklist functionality, must be controled by Black…

f09faf7

…listTracker

IgorBerman force-pushed the SPARK-19755-Blacklist-is-always-active-for-MesosCoarseGrainedSchedulerBackend branch from 2240259 to 918e066 Compare February 20, 2018 20:21

IgorBerman force-pushed the SPARK-19755-Blacklist-is-always-active-for-MesosCoarseGrainedSchedulerBackend branch 2 times, most recently from 1eb701c to 2f1db89 Compare February 20, 2018 20:37

squito mentioned this pull request Feb 20, 2018

[SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend. #17619

Closed

kayousterhout reviewed Feb 20, 2018

View reviewed changes

kayousterhout reviewed Feb 21, 2018

View reviewed changes

SPARK-19755 declining offers from blacklisted slave by BlacklistTracker

e2ddc1b

IgorBerman force-pushed the SPARK-19755-Blacklist-is-always-active-for-MesosCoarseGrainedSchedulerBackend branch from 2f1db89 to e2ddc1b Compare February 21, 2018 07:46

dongjoon-hyun added SCHEDULER MESOS labels Jun 14, 2019

IgorBerman and others added 2 commits September 25, 2019 09:35

Merge branch 'master' into SPARK-19755-Blacklist-is-always-active-for…

104d44f

…-MesosCoarseGrainedSchedulerBackend

SPARK-19755 fixing merge

95ca22c

dongjoon-hyun reviewed Sep 25, 2019

View reviewed changes

...scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala Outdated Show resolved Hide resolved

SPARK-19755 adding jira to testcase name

cb8cb57

dongjoon-hyun reviewed Sep 25, 2019

View reviewed changes

...scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Sep 25, 2019

View reviewed changes

...main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala Show resolved Hide resolved

IgorBerman and others added 2 commits September 25, 2019 15:57

SPARK-19755 removing duplicated line

83cabff

alternative

2fc0288

Recover comments

04f931d

Merge pull request #1 from dongjoon-hyun/PR-20640-2

dbe7d18

alternative

IgorBerman closed this Oct 10, 2019

[SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend #20640

[SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend #20640

Conversation

IgorBerman commented Feb 20, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

squito commented Feb 20, 2018

squito commented Feb 20, 2018

SparkQA commented Feb 20, 2018

IgorBerman commented Feb 20, 2018 • edited Loading

squito commented Feb 20, 2018

IgorBerman commented Feb 20, 2018

IgorBerman commented Feb 20, 2018 • edited Loading

SparkQA commented Feb 20, 2018

squito Feb 20, 2018

Choose a reason for hiding this comment

squito Feb 20, 2018

Choose a reason for hiding this comment

squito commented Feb 20, 2018

SparkQA commented Feb 20, 2018

IgorBerman commented Feb 20, 2018 • edited Loading

SparkQA commented Feb 20, 2018

SparkQA commented Feb 20, 2018

squito commented Feb 20, 2018

kayousterhout Feb 20, 2018

Choose a reason for hiding this comment

IgorBerman Feb 21, 2018

Choose a reason for hiding this comment

susanxhuynh Feb 21, 2018

Choose a reason for hiding this comment

kayousterhout Feb 21, 2018

Choose a reason for hiding this comment

IgorBerman Feb 21, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 21, 2018

IgorBerman commented Feb 21, 2018

IgorBerman commented Feb 21, 2018

squito commented Feb 21, 2018

swevrywhere commented May 1, 2019

IgorBerman commented May 5, 2019

akirillov commented Aug 28, 2019

IgorBerman commented Aug 29, 2019

akirillov commented Aug 29, 2019

IgorBerman commented Aug 29, 2019

felixcheung commented Sep 1, 2019

IgorBerman commented Sep 25, 2019

dongjoon-hyun commented Sep 25, 2019

dongjoon-hyun commented Sep 25, 2019

SparkQA commented Sep 25, 2019

IgorBerman commented Sep 25, 2019

dongjoon-hyun commented Sep 25, 2019

dongjoon-hyun commented Sep 25, 2019

squito commented Oct 1, 2019

dongjoon-hyun commented Oct 3, 2019

IgorBerman commented Oct 10, 2019

IgorBerman commented Feb 20, 2018 •

edited

Loading

IgorBerman commented Feb 20, 2018 •

edited

Loading

IgorBerman commented Feb 20, 2018 •

edited

Loading

IgorBerman commented Feb 20, 2018 •

edited

Loading