[SPARK-16630][YARN] Blacklist a node if executors won't launch on it #21068

attilapiros · 2018-04-13T14:43:17Z

What changes were proposed in this pull request?

This change extends YARN resource allocation handling with blacklisting functionality.
This handles cases when node is messed up or misconfigured such that a container won't launch on it. Before this change backlisting only focused on task execution but this change introduces YarnAllocatorBlacklistTracker which tracks allocation failures per host (when enabled via "spark.yarn.blacklist.executor.launch.blacklisting.enabled").

How was this patch tested?

With unit tests

Including a new suite: YarnAllocatorBlacklistTrackerSuite.

Manually

It was tested on a cluster by deleting the Spark jars on one of the node.

Behaviour before these changes

Starting Spark as:

spark2-shell --master yarn --deploy-mode client --num-executors 4  --conf spark.executor.memory=4g --conf "spark.yarn.max.executor.failures=6"

Log is:

18/04/12 06:49:36 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (6) reached)
18/04/12 06:49:39 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Max number of executor failures (6) reached)
18/04/12 06:49:39 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
18/04/12 06:49:39 INFO yarn.ApplicationMaster: Deleting staging directory hdfs://apiros-1.gce.test.com:8020/user/systest/.sparkStaging/application_1523459048274_0016
18/04/12 06:49:39 INFO util.ShutdownHookManager: Shutdown hook called

Behaviour after these changes

Starting Spark as:

spark2-shell --master yarn --deploy-mode client --num-executors 4  --conf spark.executor.memory=4g --conf "spark.yarn.max.executor.failures=6" --conf "spark.yarn.blacklist.executor.launch.blacklisting.enabled=true"

And the log is:

18/04/13 05:37:43 INFO yarn.YarnAllocator: Will request 1 executor container(s), each with 1 core(s) and 4505 MB memory (including 409 MB of overhead)
18/04/13 05:37:43 INFO yarn.YarnAllocator: Submitted 1 unlocalized container requests.
18/04/13 05:37:43 INFO yarn.YarnAllocator: Launching container container_1523459048274_0025_01_000008 on host apiros-4.gce.test.com for executor with ID 6
18/04/13 05:37:43 INFO yarn.YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
18/04/13 05:37:43 INFO yarn.YarnAllocator: Completed container container_1523459048274_0025_01_000007 on host: apiros-4.gce.test.com (state: COMPLETE, exit status: 1)
18/04/13 05:37:43 INFO yarn.YarnAllocatorBlacklistTracker: blacklisting host as YARN allocation failed: apiros-4.gce.test.com
18/04/13 05:37:43 INFO yarn.YarnAllocatorBlacklistTracker: adding nodes to YARN application master's blacklist: List(apiros-4.gce.test.com)
18/04/13 05:37:43 WARN yarn.YarnAllocator: Container marked as failed: container_1523459048274_0025_01_000007 on host: apiros-4.gce.test.com. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1523459048274_0025_01_000007
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
        at org.apache.hadoop.util.Shell.run(Shell.java:507)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Where the most important part is:

18/04/13 05:37:43 INFO yarn.YarnAllocatorBlacklistTracker: blacklisting host as YARN allocation failed: apiros-4.gce.test.com
18/04/13 05:37:43 INFO yarn.YarnAllocatorBlacklistTracker: adding nodes to YARN application master's blacklist: List(apiros-4.gce.test.com)

And execution was continued (no shutdown called).

Testing the backlisting of the whole cluster

Starting Spark with YARN blacklisting enabled then removing a the Spark core jar one by one from all the cluster nodes. Then executing a simple spark job which fails checking the yarn log the expected exit status is contained:

18/06/15 01:07:10 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted)
18/06/15 01:07:13 INFO util.ShutdownHookManager: Shutdown hook called

tgravescs · 2018-04-13T14:48:26Z

Jenkins, test this please

squito · 2018-04-13T16:08:54Z

Jenkins, add to whitelist

SparkQA · 2018-04-13T18:24:43Z

Test build #89347 has finished for PR 21068 at commit fd1923e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-13T18:38:30Z

Test build #89348 has finished for PR 21068 at commit fd1923e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-13T19:37:42Z

Test build #89350 has finished for PR 21068 at commit fd1923e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito

just really minor comments from a first read, need to spend more time understanding it all better

squito · 2018-04-13T21:22:27Z

...gers/yarn/src/main/scala/org/apache/spark/deploy/yarn/FailureWithinTimeIntervalTracker.scala

+  // Queue to store the timestamp of failed executors for each host
+  private val failedExecutorsTimeStampsPerHost = mutable.Map[String, mutable.Queue[Long]]()
+
+  private val sumFailedExecutorsTimeStamps = new mutable.Queue[Long]()


why is this called "sum"? I think the old name failedExecutorTimestamps is more appropriate, same for the other places you added "sum"

squito · 2018-04-13T21:35:23Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

+      BLACKLIST_SIZE_LIMIT.getOrElse((numClusterNodes * BLACKLIST_SIZE_DEFAULT_WEIGHT).toInt)
+    val nodesToBlacklist =
+      if (schedulerBlacklistedNodesWithExpiry.size +
+        allocationBlacklistedNodesWithExpiry.size > limit) {


nit: double-indent the continued continuation of the if condition. (we dont' do this everywhere but we should, I find it helps)

squito

I think its worth considering if we can make these changes less yarn-specific. Really we're only getting a bit of info from the cluster manager:

the container failed during allocation
how many nodes are on the cluster

and we only need to have the combined set of blacklisted nodes available to the cluster manager. The rest of the logic could live within BlacklistTracker (or some similar helper) which doesn't need to know about the cluster manager at all.

Other than just renaming, the significant change that would mean is that all the logic in YarnAllocatorBlacklistTracker would need to move to ther driver, instead of on the AM, so it would change the messages somewhat. In particular I think you'd need to change the ExecutorExited message to include whether it was a failure to even allocate the container.

This way it would be easier to add this for mesos (there are already mesos changes that are sort of waiting on this) and kubernetes

@tgravescs thoughts?

SparkQA · 2018-04-13T22:48:10Z

Test build #89355 has finished for PR 21068 at commit e49bd0d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-15T16:58:29Z

Test build #89373 has finished for PR 21068 at commit 57086bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

henryr

Just took a quick look - would work on Imran's advice first and then see if any of my comments are still valid.

henryr · 2018-04-16T22:44:55Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

+  }
+
+  private def updateAllocationBlacklistedNodes(hostname: String): Unit = {
+    if (IS_YARN_ALLOCATION_BLACKLIST_ENABLED) {


consider just:

if (!IS_YARN_ALLOCATION_BLACKLIST_ENABLED) return;

to save a level of indentation below.

As I know using return in Scala is mostly discouraged. Anyway here we have only two levels of indentations so I would keep these if conditions as it is.

henryr · 2018-04-16T22:45:23Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

+    if (IS_YARN_ALLOCATION_BLACKLIST_ENABLED) {
+      val failuresOnHost = failureWithinTimeIntervalTracker.getNumExecutorFailuresOnHost(hostname)
+      if (failuresOnHost > BLACKLIST_MAX_FAILED_EXEC_PER_NODE) {
+        logInfo("blacklisting host as YARN allocation failed: %s".format(hostname))


log msg could include the number of failures

thanks, I will add it

henryr · 2018-04-16T22:51:55Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

+
+  private var currentBlacklistedYarnNodes = Set.empty[String]
+
+  private var schedulerBlacklistedNodesWithExpiry = Map.empty[String, Long]


Do you need to keep a separate data structure for the scheduler and allocator blacklisted nodes? Instead, could you add the scheduler ones into a shared map when setSchedulerBlacklistedNodes is called?

We have to store them separately as we there is these two sources of backlisted nodes and they are updated separately via the two setters where not the diffs but the complete state of the backlisted sets are coming (another reason is only allocator backlisted nodes expiry handled by YarnAllocatorBlacklistTracker).

henryr · 2018-04-16T23:06:35Z

...gers/yarn/src/main/scala/org/apache/spark/deploy/yarn/FailureWithinTimeIntervalTracker.scala

+    while (executorFailuresValidityInterval > 0
+      && failedExecutorsTimeStampsForHost.nonEmpty
+      && failedExecutorsTimeStampsForHost.head < endTime - executorFailuresValidityInterval) {
+      failedExecutorsTimeStampsForHost.dequeue()


It's counter-intuitive that this get* method mutates state. If I called

getNumFailuresWithinValidityInterval(foo, 0) getNumFailuresWithinValidityInterval(foo, 10) getNumFailuresWithinValidityInterval(foo, 0)

The last call can return something different from the first because all the failures that weren't within 10 - executorFailuresValidityInterval will have been dropped.

Ok, I will take your recommendation and drop endTime with renaming the method to getRecentFailureCount.

henryr · 2018-04-16T23:11:29Z

...gers/yarn/src/main/scala/org/apache/spark/deploy/yarn/FailureWithinTimeIntervalTracker.scala

+      endTime: Long): Int = {
+    while (executorFailuresValidityInterval > 0
+      && failedExecutorsTimeStampsForHost.nonEmpty
+      && failedExecutorsTimeStampsForHost.head < endTime - executorFailuresValidityInterval) {


This relies on the fact the clock is monotonic, but if it's a SystemClock it's based on System.currentTimeMillis() which is not monotonic and can time-travel.

This code is coming from YarnAllocator.scala#L175.

As I see it the common solution to use Clock, SystemClock and ManualClock in Spark. And here the validity time is much higher then the diff NTP can apply.

henryr · 2018-04-16T23:15:06Z

...gers/yarn/src/main/scala/org/apache/spark/deploy/yarn/FailureWithinTimeIntervalTracker.scala

+
+  private val failedExecutorsTimeStamps = new mutable.Queue[Long]()
+
+  private def getNumFailuresWithinValidityInterval(


It's not really clear what a 'validity interval' is. I think it means that only failures that have happened recently are considered valid? I think it would be clearer to call this getNumFailuresSince(), or getRecentFailureCount() or similar, and explicitly pass in the timestamp the caller wants to consider failures since.

If you do the latter, and drop the endTime argument, then you partly address the issue I raise below about how this mutates state, because getRecentFailureCount() suggests more clearly that it's expecting to take into account the current time.

Thanks then getRecentFailureCount will be the method name without the endTime argument.

tgravescs

If we can move it to be common I think that would be good.
@squito do you know if mesos and/or kubernetes can provide this same information?

tgravescs · 2018-04-16T16:57:24Z

core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala

@@ -216,6 +216,10 @@ private[scheduler] class BlacklistTracker (
    }
  }

+  private def updateNodeBlacklist(): Unit = {


this function seems unnecessary to me, I don't see it adding any value vs doing inline.

tgravescs · 2018-04-16T20:49:28Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

+      .booleanConf
+      .createOptional
+
+  private[spark] val YARN_BLACKLIST_SIZE_LIMIT =


why do we want both this and the spark.yarn.blacklist.size.default.weight?

we can remove it

tgravescs · 2018-04-16T20:54:37Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

@@ -328,4 +328,26 @@ package object config {
    CACHED_FILES_TYPES,
    CACHED_CONF_ARCHIVE)

+  /* YARN allocator-level blacklisting related config entries. */
+  private[spark] val YARN_ALLOCATION_BLACKLIST_ENABLED =
+    ConfigBuilder("spark.yarn.allocation.blacklist.enabled")


I would say either just call it spark.yarn.blacklist.enabled or we make it more specific that this is executor launch failure blacklisting

First I named it "spark.yarn.blacklist.enabled" but then I was wondering whether a user will confuse it with YARN backlisting so I have added the "allocation" part. So I would go for the second option: "spark.yarn.executor.launch.blacklist.enabled".

tgravescs · 2018-04-16T20:56:54Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

+
+  private[spark] val YARN_BLACKLIST_SIZE_DEFAULT_WEIGHT =
+  ConfigBuilder("spark.yarn.blacklist.size.default.weight")
+    .doc("If blacklist size limit is not specified then the default limit will be the number of " +


perhaps rename something like spark.yarn.blacklist.maxNodeBlacklistRatio . (note we are talking about using Ratio in another config here: #19881)

tgravescs · 2018-04-17T15:27:22Z

actually the only other thing I need to make sure is there aren't any delays if we now send the information from yarn allocator back to scheduler and then I assume it would need to get it back again from scheduler. During that the yarn allocator could be calling allocate() and updating things. So we need to make sure it gets the most up to date blacklist.

also I need to double check but the blacklist information isn't being sent to the yarn allocator when dynamic allocation is off right? We would want that to happen.

squito · 2018-04-17T16:46:57Z

actually the only other thing I need to make sure is there aren't any delays if we now send the information from yarn allocator back to scheduler and then I assume it would need to get it back again from scheduler. During that the yarn allocator could be calling allocate() and updating things. So we need to make sure it gets the most up to date blacklist.

also I need to double check but the blacklist information isn't being sent to the yarn allocator when dynamic allocation is off right? We would want that to happen.

yeah both good points. actually, don't we want to update the general node blacklist on the yarn allocator even when dynamic allocation is off? I don't think it gets updated at all unless dynamic allocation is on, it seems all the updates originate in ExecutorAllocationManager, the blacklist never actively pushes updates to the yarn allocator. That seems like an existing shortcoming.

do you know if mesos and/or kubernetes can provide this same information?

I don't know about kubernetes at all. Mesos does provide info when a container fails. I don't think it lets you know the total cluster size, but that should be optional. Btw, node count is never going to be totally sufficient, as the remaining nodes might not actually be able to run your executors (smaller hardware, always taken up by higher priority applications, other constraints in a framework like mesos), its always going to be best effort.

@attilapiros and I discussed this briefly yesterday, an alternative to moving everything into the BlacklistTracker on the driver is to just have some abstract base class, which is changed slightly for each cluster manager. Then you could keep the flow like it is here, with the extra blacklisting living in YarnAllocator still.

attilapiros · 2018-04-17T20:41:16Z

Yes we can create an abstract class from YarnAllocatorBlacklistTracker (like AbstractAllocatorBlacklistTracker) where the method synchronizeBlacklistedNodes can have different implementations. In this case the core and the messages can stay as it is. As I see this is the less risky and cheaper solution. On the other hand having the complete blacklisting in the driver has a more centralized/clear design.

We just have to make our mind where to go from here. Any help and suggestions are welcomed for the decision.

squito · 2018-04-18T15:47:19Z

I think Tom makes a good case for why this should live in the YarnAllocator as you have it.

I also don't think you need to worry about creating an abstract class yet, that refactoring can be done when another cluster manager tries to share some code ... it would just be helpful to keep that use in mind.

also I filed https://issues.apache.org/jira/browse/SPARK-24016 for updating the task-based node blacklist even with static allocation

tgravescs · 2018-04-18T18:10:16Z

thanks for filing that jira @squito, I agree we should have blacklisting work with dynamic allocation disabled as well. (A bit of a tangent from this jira) I'm actually wondering now about the scheduler blacklisting and whether it should have a max blacklisted Ratio as well. I don't remember if we discussed this previously.

For this, I'm fine either way, if there are people interested in doing the mesos/kubernetes stuff now we could certainly coordinate with them to see if there is something common we could do now. I haven't had time to keep up with those jira to know though. Otherwise this isn't public facing so we can do that when they decide to implement it.

SparkQA · 2018-04-18T19:31:04Z

Test build #89514 has finished for PR 21068 at commit c92a090.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-04-19T14:20:47Z

@tgravescs on the blacklist ratio for task-based blacklisting -- there is nothing, but there are some related jiras: SPARK-22148 & SPARK-15815

to be honest I have doubts about the utility of the ratio ... if you really want to make sure blacklisting doesn't lead to starvation, you've got to have some other mechanism, as you could easily have the remaining nodes be occupied or have insufficient resources.

Kubernetes doesn't do anything with the node blacklisting currently: SPARK-23485

Mesos already has a notion of blacklisting nodes for failing to allocate containers, but its currently at odds with the task-based blacklist. #20640 is somewhat stalled because blacklisting based on allocation failures is missing in a general sense.

In any case, I still think we shouldn't make the code more complex for something other clusters managers might use in the future, and that the current overall organization is fine.

tgravescs · 2018-04-20T19:07:07Z

ok sounds fine to me, so we should review as is then

squito · 2018-04-25T14:19:48Z

A couple more high-level thoughts:

Do we want to have a event posted about the node getting blacklisted? I think it would be useful. But then there needs to be a msg from the YarnAllocator back to the driver about the blacklisting.
I was thinking about how this interacts with SPARK-13669. at first I was thinking this makes that entirely unnecessary, but I guess that is not true -- that is still useful if the external shuffle service goes down after the executor is started.

vanzin

I did a first pass and mostly pointed out stylistic stuff... I need a second pass to take a closer look at the functionality. Didn't see any red flags though.

vanzin · 2018-04-25T21:26:04Z

core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala

@@ -126,7 +126,7 @@ private[scheduler] class BlacklistTracker (
          nodeIdToBlacklistExpiryTime.remove(node)
          listenerBus.post(SparkListenerNodeUnblacklisted(now, node))
        }
-        _nodeBlacklist.set(nodeIdToBlacklistExpiryTime.keySet.toSet)
+        _nodeBlacklist.set(collection.immutable.Map(nodeIdToBlacklistExpiryTime.toSeq: _*))


Isn't this the same as calling nodeIdToBlacklistExpiryTime.toMap? (That returns an immutable map.)

At the very least, the collectiom.immutable. part looks unnecessary. Same thing happens below.

vanzin · 2018-04-25T21:28:39Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

@@ -651,8 +651,8 @@ private[spark] class TaskSchedulerImpl(
   * Get a snapshot of the currently blacklisted nodes for the entire application.  This is
   * thread-safe -- it can be called without a lock on the TaskScheduler.
   */
-  def nodeBlacklist(): scala.collection.immutable.Set[String] = {
-    blacklistTrackerOpt.map(_.nodeBlacklist()).getOrElse(scala.collection.immutable.Set())
+  def nodeBlacklistWithExpiryTimes(): scala.collection.immutable.Map[String, Long] = {


Why not just Map[String, Long]?

I kinda find it odd when I see these types used this way, so unless there's a good reason...

vanzin · 2018-04-25T21:30:17Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

@@ -170,8 +170,8 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
        if (executorDataMap.contains(executorId)) {
          executorRef.send(RegisterExecutorFailed("Duplicate executor ID: " + executorId))
          context.reply(true)
-        } else if (scheduler.nodeBlacklist != null &&
-          scheduler.nodeBlacklist.contains(hostname)) {
+        } else if (scheduler.nodeBlacklistWithExpiryTimes != null &&


nodeBlacklistWithExpiryTimes is never null right?

Also calling that method twice causes unnecessary computation...

vanzin · 2018-04-25T21:32:33Z

...gers/yarn/src/main/scala/org/apache/spark/deploy/yarn/FailureWithinTimeIntervalTracker.scala

+import org.apache.spark.internal.Logging
+import org.apache.spark.util.{Clock, SystemClock}
+
+private[spark] class FailureWithinTimeIntervalTracker(sparkConf: SparkConf) extends Logging {


Add scaladoc explaining what this does?

vanzin · 2018-04-25T21:32:50Z

...gers/yarn/src/main/scala/org/apache/spark/deploy/yarn/FailureWithinTimeIntervalTracker.scala

+
+private[spark] class FailureWithinTimeIntervalTracker(sparkConf: SparkConf) extends Logging {
+
+  private var clock: Clock = new SystemClock


This should be a constructor argument.

vanzin · 2018-04-25T22:03:24Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

+  private def synchronizeBlacklistedNodeWithYarn(nodesToBlacklist: Set[String]): Unit = {
+    // Update blacklist information to YARN ResourceManager for this application,
+    // in order to avoid allocating new Containers on the problematic nodes.
+    val blacklistAdditions = (nodesToBlacklist -- currentBlacklistedYarnNodes).toList.sorted


additions, removals are just as good names for these variables.

vanzin · 2018-04-25T22:04:15Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

+
+  private def removeExpiredYarnBlacklistedNodes() = {
+    val now = clock.getTimeMillis()
+    allocationBlacklistedNodesWithExpiry.retain {


Doesn't this work?

allocationBlacklistedNodesWithExpiry.retain { case (_, expiry) => ... }

vanzin · 2018-04-25T22:04:33Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

+    .doubleConf
+    .checkValue(weight => weight >= 0 && weight <= 1, "The value of this ratio must be in [0, 1].")
+    .createWithDefault(0.75)
+


Too many blank lines.

vanzin · 2018-04-25T22:08:12Z

...rs/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTrackerSuite.scala

+  test("expiring its own blacklisted nodes") {
+    clock.setTime(0L)
+
+    1 to MAX_FAILED_EXEC_PER_NODE_VALUE foreach {


(1 to blah).foreach { _ => ... }

vanzin · 2018-04-25T22:08:35Z

...rs/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTrackerSuite.scala

+  }
+
+  test("not handling the expiry of scheduler blacklisted nodes") {
+    clock.setTime(0L)


SparkQA · 2018-04-26T17:45:27Z

Test build #89889 has finished for PR 21068 at commit 0ba8510.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-30T20:01:16Z

Test build #91307 has finished for PR 21068 at commit 0e78b38.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-05-31T21:44:46Z

hey sorry I have been meaning to respond to this but keep getting sidetracked. As Tom and I are going to meet in person next week anyway, I figure at this point it makes sense to just wait till we chat directly to make sure we're on the same page. It sounds like we're in agreement but at this point might as well wait a couple more days, as I haven't had a chance to do a final review anyway

squito · 2018-06-12T16:15:48Z

Tom and I had a chance to discuss this in person, and after some back and forth I think we decided that maybe its best to remove the limit but have the application fail if the entire cluster is blacklisted. @tgravescs does that sound correct?

I mentioned this briefly to @attilapiros and he mentioned that might be hard, but instead you could stop allocation blacklisting which would result in the usual yarn app failure from too many executors. He's going to look at this a little more closely and report back here. I'd be OK with that -- the main goal is just make sure that an app doesn't hang if you've blacklisted the entire cluster. I'm pretty sure that's @tgravescs main concern as well. (If the only reasonable way to do that is with the existing limit, I'm fine w/ that too.)

SparkQA · 2018-06-13T14:02:39Z

Test build #91764 has finished for PR 21068 at commit 61f3d17.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-13T19:53:06Z

Test build #91779 has finished for PR 21068 at commit 7fce4ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

a couple of minor things, but overall lgtm

@attilapiros can you please test this latest version on a cluster again?

@tgravescs this version will kill the app when the whole cluster is blacklisted, attila found out it was easy to do.

squito · 2018-06-14T15:28:55Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

+
+  private val defaultTimeout = "1h"
+
+  private val blacklistTimeoutMillis =


BlacklistTracker.getBlacklistTimeout(conf)

Ok then I relax a bit on the visibility of BlacklistTracker by changing it from private[scheduler] to private[spark].

squito · 2018-06-14T15:32:17Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

+    val endTime = clock.getTimeMillis()
+    while (executorFailuresValidityInterval > 0 &&
+      failedExecutorsWithTimeStamps.nonEmpty &&
+      failedExecutorsWithTimeStamps.head < endTime - executorFailuresValidityInterval) {


double indent the condition

squito · 2018-06-14T20:49:34Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

-      failedExecutorsWithTimeStamps.dequeue()
+        failedExecutorsWithTimeStamps.nonEmpty &&
+        failedExecutorsWithTimeStamps.head < endTime - executorFailuresValidityInterval) {
+        failedExecutorsWithTimeStamps.dequeue()


but only single indent the body of the while

SparkQA · 2018-06-14T21:47:46Z

Test build #91860 has finished for PR 21068 at commit aa52f6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2018-06-15T11:14:23Z

Retested manually on a cluster. The PR's description is updated with the result .

SparkQA · 2018-06-15T13:13:39Z

Test build #91905 has finished for PR 21068 at commit 848d050.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-06-15T13:38:52Z

Test build #91907 has finished for PR 21068 at commit a462ce0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-06-15T14:00:09Z

Jenkins, retest this please

squito · 2018-06-15T14:00:38Z

lgtm

will leave open for a couple of days to let @tgravescs take a look

skonto · 2018-06-15T14:05:09Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

+ *
+ * <ul>
+ *   <li> from the scheduler as task level blacklisted nodes
+ *   <li> from this class (tracked here) as YARN resource allocation problems


This source never touches the scheduler's blacklist right?

Right. Just the other way around: the scheduler's blacklisted hosts will be sent here for forwarding them to YARN. This way at the resource allocation they will be taken into account.

SparkQA · 2018-06-15T18:27:06Z

Test build #91920 has finished for PR 21068 at commit a462ce0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

👍 this is going to be very useful

felixcheung · 2018-06-16T07:15:51Z

...anagers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala

+  private def updateAllocationBlacklistedNodes(hostname: String): Unit = {
+    val failuresOnHost = failureTracker.numFailuresOnHost(hostname)
+    if (failuresOnHost > maxFailuresPerHost) {
+      logInfo(s"blacklisting $hostname as YARN allocation failed $failuresOnHost times")


maybe logWarn?

would be great if there is a metric on failuresOnHost count...

Thanks, I am happy you consider this change useful.

Regarding logInfo I have chosen that to be consistent with the logging of the existing BlacklistTracker where blacklisting itself is taken as a part of the normal behaviour and logInfo is used. But if you have a strong feeling about logWarn I can do the change.

For the metrics I've done some quick search in the yarn module and it seems to me currently no metrics are coming from there so the change probably is not just a few lines. What about me creating a new jira task for it? Is that fine for you?

yes, exposing metrics is not a bad idea, but I'd like to leave it out of this change

@felixcheung I have started to gain some experience about metrics (as I worked on SPARK-24594) and it seems to me the structure of the metrics (the metric names) should be known and registered before starting the metric systems. So I can add a new metric for ALL the failures, but not for each hosts separately, like with console sink:

-- Gauges ---------------------------------------------------------------------- yarn_cluster.executorFailures.ALL value = 3

Aggregated values would be also possible. Any idea what would be the most valuable for Spark users besides this restriction?

I was thinking a bit more about this problem and I have an idea: creating metrics for the number of hosts with a predetermined number of executor failures. Like yarn_cluster.numHostWithExecutorFailures.x where x is [1 , ... max (10, spark.blacklist.application.maxFailedExecutorsPerNode if backlisting enable, spark.yarn.max.executor.failures if set)]. What is your opinion?

tgravescs · 2018-06-18T14:00:11Z

resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala

@@ -328,4 +328,10 @@ package object config {
    CACHED_FILES_TYPES,
    CACHED_CONF_ARCHIVE)

+  /* YARN allocator-level blacklisting related config entries. */
+  private[spark] val YARN_EXECUTOR_LAUNCH_BLACKLIST_ENABLED =


need to document this in docs/running-on-yarn.md

tgravescs · 2018-06-18T14:02:47Z

Looks like it was modified to kill if all nodes blacklisted so I'm good with this approach.

SparkQA · 2018-06-18T18:31:02Z

Test build #92031 has finished for PR 21068 at commit f71c7c5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2018-06-18T18:52:14Z

retest this please

SparkQA · 2018-06-18T23:26:45Z

Test build #92040 has finished for PR 21068 at commit f71c7c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

attilapiros · 2018-06-19T12:19:25Z

Here is the new task for the metrics: https://issues.apache.org/jira/browse/SPARK-24594.

squito · 2018-06-21T14:18:33Z

merged to master. Thanks @attilapiros !

initial commit

fd1923e

fix unit test failure

e49bd0d

squito reviewed Apr 13, 2018

View reviewed changes

applying review comments

57086bb

henryr reviewed Apr 16, 2018

View reviewed changes

tgravescs reviewed Apr 17, 2018

View reviewed changes

applying review comments

c92a090

squito mentioned this pull request Apr 19, 2018

[SPARK-19755][Mesos] Blacklist is always active for MesosCoarseGrainedSchedulerBackend #20640

Closed

vanzin reviewed Apr 25, 2018

View reviewed changes

applying review comments 3

0ba8510

attilapiros added 2 commits April 26, 2018 20:35

Merge branch 'master' into SPARK-16630

4df2311

fix unittests

17bbbee

remove NodeBlacklistRatio

61f3d17

unittest fix

7fce4ee

squito approved these changes Jun 14, 2018

View reviewed changes

fix

aa52f6e

squito reviewed Jun 14, 2018

View reviewed changes

attilapiros added 2 commits June 15, 2018 11:24

indent

848d050

Merge branch 'master' into SPARK-16630

a462ce0

skonto reviewed Jun 15, 2018

View reviewed changes

felixcheung reviewed Jun 16, 2018

View reviewed changes

tgravescs reviewed Jun 18, 2018

View reviewed changes

document spark.yarn.blacklist.executor.launch.blacklisting.enabled

f71c7c5

asfgit closed this in b56e9c6 Jun 21, 2018


		private var currentBlacklistedYarnNodes = Set.empty[String]

		private var schedulerBlacklistedNodesWithExpiry = Map.empty[String, Long]


		private val failedExecutorsTimeStamps = new mutable.Queue[Long]()

		private def getNumFailuresWithinValidityInterval(


		private[spark] class FailureWithinTimeIntervalTracker(sparkConf: SparkConf) extends Logging {

		private var clock: Clock = new SystemClock


		private val defaultTimeout = "1h"

		private val blacklistTimeoutMillis =

[SPARK-16630][YARN] Blacklist a node if executors won't launch on it #21068

[SPARK-16630][YARN] Blacklist a node if executors won't launch on it #21068

Conversation

attilapiros commented Apr 13, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

With unit tests

Manually

Behaviour before these changes

Behaviour after these changes

Testing the backlisting of the whole cluster

tgravescs commented Apr 13, 2018

squito commented Apr 13, 2018

SparkQA commented Apr 13, 2018

SparkQA commented Apr 13, 2018

SparkQA commented Apr 13, 2018

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

squito left a comment

Choose a reason for hiding this comment

SparkQA commented Apr 13, 2018

SparkQA commented Apr 15, 2018

henryr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Apr 17, 2018

squito commented Apr 17, 2018

attilapiros commented Apr 17, 2018

squito commented Apr 18, 2018

tgravescs commented Apr 18, 2018

SparkQA commented Apr 18, 2018

squito commented Apr 19, 2018

tgravescs commented Apr 20, 2018

squito commented Apr 25, 2018

vanzin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 26, 2018

SparkQA commented May 30, 2018

squito commented May 31, 2018

squito commented Jun 12, 2018

SparkQA commented Jun 13, 2018

SparkQA commented Jun 13, 2018

squito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

attilapiros commented Apr 13, 2018 •

edited

Loading

attilapiros commented Jun 15, 2018 •

edited

Loading

skonto Jun 15, 2018 •

edited

Loading