Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1978] In some cases, spark-yarn does not automatically restart the failed container #921

Closed
wants to merge 10 commits into from

Conversation

witgo
Copy link
Contributor

@witgo witgo commented May 30, 2014

No description provided.

@witgo witgo changed the title In some cases, yarn does not automatically restart the container [WIP] In some cases, yarn does not automatically restart the container May 30, 2014
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@sryza
Copy link
Contributor

sryza commented May 30, 2014

This is already handled in ExecutorLauncher.launchReporterThread and ApplicationMaster.launchReporterThread, no?

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15306/

@witgo witgo changed the title [WIP] In some cases, yarn does not automatically restart the container In some cases, yarn does not automatically restart the container May 31, 2014
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@witgo
Copy link
Contributor Author

witgo commented May 31, 2014

@sryza
When yarnAllocator.getNumExecutorsFailed return value is greater than zero .
yarnAllocator.getNumExecutorsRunning < args.numExecutors is true forever .
That is to say,In this case,only expression userThread.isAlive or !driverClosed is false, ExecutorLauncher.launchReporterThread or ApplicationMaster.launchReporterThread will execute.

@witgo witgo changed the title In some cases, yarn does not automatically restart the container In some cases, spark-yarn does not automatically restart the container May 31, 2014
@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15314/

@witgo witgo changed the title In some cases, spark-yarn does not automatically restart the container In some cases, spark-yarn does not automatically restart the failed container May 31, 2014
@witgo witgo changed the title In some cases, spark-yarn does not automatically restart the failed container [SPARK-1978] In some cases, spark-yarn does not automatically restart the failed container May 31, 2014
@@ -256,14 +256,22 @@ class ApplicationMaster(args: ApplicationMasterArguments, conf: Configuration,
// TODO: Handle container failure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I believe this is what this TODO is referring to so you can remove that TODO.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15476/

yarnAllocator.addResourceRequests(args.numExecutors)
while ((yarnAllocator.getNumExecutorsRunning < args.numExecutors) && (!driverClosed)) {
yarnAllocator.allocateResources()
allocateMissingExecutor()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar here can you move this up above allocateResources

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you change this to have similar logic - allocate outside loop, then inside loop add missing and then allocate.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15551/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15568/

@tgravescs
Copy link
Contributor

Thanks @witgo if you can change the order of the logic in the ExecutorLauncher to match, this looks good.

@AmplabJenkins
Copy link

Merged build triggered.

@witgo
Copy link
Contributor Author

witgo commented Jun 10, 2014

Done

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15593/

@tgravescs
Copy link
Contributor

Looks good, +1. Thanks @witgo

@asfgit asfgit closed this in 884ca71 Jun 10, 2014
@witgo witgo deleted the allocateExecutors branch June 10, 2014 15:38
asfgit pushed a commit that referenced this pull request Jun 10, 2014
… the failed container

Author: witgo <[email protected]>

Closes #921 from witgo/allocateExecutors and squashes the following commits:

bc3aa66 [witgo] review commit
8800eba [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
32ac7af [witgo] review commit
056b8c7 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
04c6f7e [witgo] Merge branch 'master' into allocateExecutors
aff827c [witgo] review commit
5c376e0 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
1faf4f4 [witgo] Merge branch 'master' into allocateExecutors
3c464bd [witgo] add time limit to allocateExecutors
e00b656 [witgo] In some cases, yarn does not automatically restart the container
@tgravescs
Copy link
Contributor

I merged this into branch-1.0 also

pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
… the failed container

Author: witgo <[email protected]>

Closes apache#921 from witgo/allocateExecutors and squashes the following commits:

bc3aa66 [witgo] review commit
8800eba [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
32ac7af [witgo] review commit
056b8c7 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
04c6f7e [witgo] Merge branch 'master' into allocateExecutors
aff827c [witgo] review commit
5c376e0 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
1faf4f4 [witgo] Merge branch 'master' into allocateExecutors
3c464bd [witgo] add time limit to allocateExecutors
e00b656 [witgo] In some cases, yarn does not automatically restart the container
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
… the failed container

Author: witgo <[email protected]>

Closes apache#921 from witgo/allocateExecutors and squashes the following commits:

bc3aa66 [witgo] review commit
8800eba [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
32ac7af [witgo] review commit
056b8c7 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
04c6f7e [witgo] Merge branch 'master' into allocateExecutors
aff827c [witgo] review commit
5c376e0 [witgo] Merge branch 'master' of https://github.com/apache/spark into allocateExecutors
1faf4f4 [witgo] Merge branch 'master' into allocateExecutors
3c464bd [witgo] add time limit to allocateExecutors
e00b656 [witgo] In some cases, yarn does not automatically restart the container
agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants