[SPARK-3627] - [yarn] - fix exit code and final status reporting to RM #2577

tgravescs · 2014-09-29T15:15:43Z

See the description and whats handled in the jira comment: https://issues.apache.org/jira/browse/SPARK-3627?focusedCommentId=14150013&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14150013

This does not handle yarn client mode reporting of the driver to the AM. I think that should be handled when we make it an unmanaged AM.

tgravescs · 2014-09-29T15:16:11Z

@witgo can you verify this covers #2311

SparkQA · 2014-09-29T15:19:32Z

QA tests have started for PR 2577 at commit 32f4dfa.

This patch merges cleanly.

tgravescs · 2014-09-29T16:11:03Z

also note this does change everything to allow yarn to retry. previously when it hit the maximum number of executor failures it didn't retry the AM. I waffled back and forth on this one. At first the thought was that if that many executors are dying its probably an issue with the user code, but then again if you have a really long running job then I can think of situations you want it to retry. Anyone have strong opinion on that?

SparkQA · 2014-09-29T16:28:48Z

QA tests have finished for PR 2577 at commit 32f4dfa.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-09-29T16:28:51Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20975/

witgo · 2014-09-29T17:52:11Z

yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

@@ -450,6 +539,15 @@ object ApplicationMaster extends Logging {

  val SHUTDOWN_HOOK_PRIORITY: Int = 30

+  // exit codes for different causes, no reason behind the values


We can use this class?
ExecutorExitCode

The application Master is not an executor so I chose not to use it. It also doesn't have the same exit reasons which could be useful if the user has an exit code and wants to know what that matches up to

vanzin · 2014-09-29T18:54:45Z

Looks ok to me, although the exception handling does feel a little paranoid. :-) Just had a few nits.

vanzin · 2014-09-29T20:32:19Z

yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

+        val sc = sparkContextRef.get()
+        if (sc != null) {
+          logInfo("Invoking sc stop from finish")
+          sc.stop()


I'm feeling a little bit weird about this call.

Feels to me like it would be better to do it after the user thread is interrupted and the user thread stops. And since we already have a shutdown hook that takes care of calling it if the user code doesn't, that it's already handled.

Is there a particular case you're thinking about here that is not covered by the current code?

I was thinking it would be nicer (as far as like cleanup and such) to do the sc.stop() before the interrupt, in case the interrupt didn't end up behind handled nicely. Note that under normal exit situations this wouldn't be invoked here. Its when something else goes wrong (like max executor failures, etc).
Is there some condition you know its bad to call it?
I'll do a few more tests on it to see what happens in both cases.

I'm just wondering what will be the side-effects on user code if the context is stopped before the code expects it to. In the end everything will fail anyway, but maybe telling the user code to shut down "nicely" first is better?

tgravescs · 2014-09-30T13:35:12Z

thanks for the review @vanzin. I've updated it.

SparkQA · 2014-09-30T13:39:29Z

QA tests have started for PR 2577 at commit fab166d.

This patch merges cleanly.

SparkQA · 2014-09-30T14:48:14Z

QA tests have finished for PR 2577 at commit fab166d.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-09-30T14:48:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21040/

vanzin · 2014-09-30T17:10:33Z

LGTM. Thanks!

andrewor14 · 2014-09-30T18:50:09Z

yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClientImpl.scala

+  override def unregister(status: FinalApplicationStatus, diagnostics: String = "") = synchronized {
+    if (registered) {
+      val finishReq = Records.newRecord(classOf[FinishApplicationMasterRequest])
+        .asInstanceOf[FinishApplicationMasterRequest]


You probably don't need this cast

this pr didn't change this code, other then wrapping it with an if. Its also going to be deprecated soon so I don't see a reason to fix it.

andrewor14 · 2014-09-30T19:20:15Z

yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

+    // spark driver should already be up since it launched us, but we don't want to
+    // wait forever, so wait 100 seconds max to match the cluster mode setting.
+    // Leave this config unpublished for now.
+    val numTries = sparkConf.getInt("spark.yarn.ApplicationMaster.client.waitTries", 1000)


This config should use camel case for applicationMaster. Also, there's already a spark.yarn.applicationMaster.waitTries. Does the extra client mean it's for client mode? Do we want a separate setting for client vs deploy modes here?

By the way there is a mismatch between what is already there (spark.yarn.ApplicationMatser.waitTries) and what we document (spark.yarn.applicationMaster.waitTries). I think this is a bug that we can fix separately.

yes the client was tacked on to mean it used in the client mode because the timing of the loops are different between the modes. Its an internal config right now so user shouldn't be setting. The timing is different because client mode is already up when this is launched, versus in cluster mode we are launching the user code, which takes some times (10's of seconds).

I'll file a separate jira to fix up the mismatch in doc/config.

Also, it's kind of inconsistent to use applicationMaster.client.waitTries for client mode but applicationMaster.waitTries for cluster mode, and the existing documentation for the latter makes no mention of cluster mode even though it's only used there. It's fine to keep the client config here but we should make the other one applicationMaster.cluster.waitTries in a future JIRA and deprecate the less specific one.

ok for this pr I'll leave it applicationMaster.waitTries and match cluster mode and I'll file a separate jira to clean it up. The documentation doesn't state how long each loop is for example. I think these would be better to just change to be a wait times versus number of tries and then they can be used for both modes.

https://issues.apache.org/jira/browse/SPARK-3779 filed

tgravescs · 2014-10-03T14:55:06Z

Addressed all the review comments.

SparkQA · 2014-10-03T14:59:32Z

QA tests have started for PR 2577 at commit 24c98e3.

This patch merges cleanly.

SparkQA · 2014-10-03T15:03:01Z

QA tests have finished for PR 2577 at commit 24c98e3.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-03T15:03:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21248/

SparkQA · 2014-10-03T15:14:28Z

QA tests have started for PR 2577 at commit e8cc261.

This patch merges cleanly.

SparkQA · 2014-10-03T16:22:46Z

QA tests have finished for PR 2577 at commit e8cc261.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-03T16:22:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21250/

andrewor14 · 2014-10-03T20:37:39Z

yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

+
+    // spark driver should already be up since it launched us, but we don't want to
+    // wait forever, so wait 100 seconds max to match the cluster mode setting.
+    // Leave this config unpublished for now.


minor, but can you add SPARK-3779 to the comment so others know we're tracking this issue?

andrewor14 · 2014-10-03T20:46:24Z

Hey @tgravescs this LGTM pending a few minor comments.

SparkQA · 2014-10-06T14:39:50Z

QA tests have started for PR 2577 at commit 9c2efbf.

This patch merges cleanly.

SparkQA · 2014-10-06T15:48:44Z

QA tests have finished for PR 2577 at commit 9c2efbf.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)
- case class UncacheTableCommand(tableName: String) extends Command
- case class CacheTableCommand(
- case class UncacheTableCommand(tableName: String) extends LeafNode with Command
- case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(

AmplabJenkins · 2014-10-06T15:48:47Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21330/Test PASSed.

andrewor14 · 2014-10-07T04:59:00Z

LGTM, feel free to merge it.

tgravescs · 2014-10-07T14:52:32Z

Thanks @andrewor14. I've merged this into 1.2

javabrett · 2015-05-21T02:43:44Z

yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala

        } catch {
          case e: InvocationTargetException =>
            e.getCause match {
              case _: InterruptedException =>
                // Reporter thread can interrupt to stop user class
-
-              case e => throw e
+              case e: Exception =>


I'm curious, should this be Throwable? If my application throws an uncaught Error, shouldn't that also result in FAILED, and would it (still) do so with this change? P.S. my Scala is not that strong.

This was changed in a subsequent PR. Check the current code.

tgravescs added 3 commits September 29, 2014 08:38

SPARK-3627 - yarn - fix exit code and final status reporting to RM

d3cc800

change order of cleanup staging dir

f0b6519

switch back

32f4dfa

witgo reviewed Sep 29, 2014
View reviewed changes

vanzin reviewed Sep 29, 2014
View reviewed changes

update based on review comments

fab166d

andrewor14 reviewed Sep 30, 2014
View reviewed changes

tgravescs added 2 commits October 2, 2014 08:32

Merge remote-tracking branch 'upstream/master' into SPARK-3627

85f1901

rework

24c98e3

fix accidental typo during fixing comment

e8cc261

andrewor14 reviewed Oct 3, 2014
View reviewed changes

review comments

9c2efbf

asfgit closed this in 70e824f Oct 7, 2014

SaintBacchus mentioned this pull request Nov 28, 2014

[YARN][SPARK-3293]Fix yarn's web show "SUCCEEDED" when the driver throw a exception in yarn-client #3508

Closed

javabrett reviewed May 21, 2015
View reviewed changes

AtomMarker mentioned this pull request Jul 30, 2020

FlinkX on yarn per-job模式运行的最终状态错误 DTStack/chunjun#258

Closed

		@@ -450,6 +539,15 @@ object ApplicationMaster extends Logging {

		val SHUTDOWN_HOOK_PRIORITY: Int = 30

		// exit codes for different causes, no reason behind the values

[SPARK-3627] - [yarn] - fix exit code and final status reporting to RM #2577

[SPARK-3627] - [yarn] - fix exit code and final status reporting to RM #2577

Conversation

tgravescs commented Sep 29, 2014

tgravescs commented Sep 29, 2014

SparkQA commented Sep 29, 2014

tgravescs commented Sep 29, 2014

SparkQA commented Sep 29, 2014

AmplabJenkins commented Sep 29, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanzin commented Sep 29, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Sep 30, 2014

SparkQA commented Sep 30, 2014

SparkQA commented Sep 30, 2014

AmplabJenkins commented Sep 30, 2014

vanzin commented Sep 30, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Oct 3, 2014

SparkQA commented Oct 3, 2014

SparkQA commented Oct 3, 2014

AmplabJenkins commented Oct 3, 2014

SparkQA commented Oct 3, 2014

SparkQA commented Oct 3, 2014

AmplabJenkins commented Oct 3, 2014

Choose a reason for hiding this comment

andrewor14 commented Oct 3, 2014

SparkQA commented Oct 6, 2014

SparkQA commented Oct 6, 2014

AmplabJenkins commented Oct 6, 2014

andrewor14 commented Oct 7, 2014

tgravescs commented Oct 7, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment