Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7451][YARN] Preemption of executors is counted as failure causing Spark job to fail #5993

Closed
wants to merge 2 commits into from

Conversation

ashwinshankar77
Copy link

Added a check to handle container exit status for the preemption scenario, log an INFO message in such cases and move on.
@andrewor14

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@@ -373,7 +373,11 @@ private[yarn] class YarnAllocator(
// Hadoop 2.2.X added a ContainerExitStatus we should switch to use
// there are some exit status' we shouldn't necessarily count against us, but for
// now I think its ok as none of the containers are expected to exit
if (completedContainer.getExitStatus == -103) { // vmem limit exceeded
if (completedContainer.getExitStatus == ContainerExitStatus.PREEMPTED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this constant exist all the way back to Hadoop 2.2? IIRC preemption was added later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the exit status has it back in 2.1.1-beta

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin yes its been there since 2.1.1-beta. Here is the JIRA which added that constant: https://issues.apache.org/jira/browse/YARN-1049

@tgravescs
Copy link
Contributor

Note that the AllocateResponse also contains PreemptionMessage which is list of containers RM is asking for back. AM could potentially use this to decide which containers to kill. That can be future enhancement though.

I think this looks good other then @sryza comments.

@ashwinshankar77
Copy link
Author

Thanks all for reviewing !
@sryza Addressed your comment in the latest patch.
@tgravescs Yes, one use case on top of my mind is try to avoid preempting containers
which has RDD's cached. However, I believe the current implementation in the scheduler
doesn't take inputs from AM for preemption and that the list of containers in PreemptionMessage
is more of a notification to AM, no ?

@tgravescs
Copy link
Contributor

Yeah I'm sure it would require more plumbing hence a future enhancement if we decide its useful.

@sryza
Copy link
Contributor

sryza commented May 8, 2015

LGTM

@andrewor14
Copy link
Contributor

Looks like we never ran tests for this. Jenkins test this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32272 has started for PR 5993 at commit 90900cf.

@andrewor14
Copy link
Contributor

LGTM2

@SparkQA
Copy link

SparkQA commented May 8, 2015

Test build #32272 has finished for PR 5993 at commit 90900cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32272/
Test PASSed.

@sryza
Copy link
Contributor

sryza commented May 8, 2015

Committing this

asfgit pushed a commit that referenced this pull request May 9, 2015
…sing Spark job to fail

Added a check to handle container exit status for the preemption scenario, log an INFO message in such cases and move on.
andrewor14

Author: Ashwin Shankar <[email protected]>

Closes #5993 from ashwinshankar77/SPARK-7451 and squashes the following commits:

90900cf [Ashwin Shankar] Fix log info message
cf8b6cf [Ashwin Shankar] Stop counting preemption of executors as failure
(cherry picked from commit b6c797b)

Signed-off-by: Sandy Ryza <[email protected]>
@asfgit asfgit closed this in b6c797b May 9, 2015
@ashwinshankar77
Copy link
Author

Thanks for committing this, Sandy !

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
…sing Spark job to fail

Added a check to handle container exit status for the preemption scenario, log an INFO message in such cases and move on.
andrewor14

Author: Ashwin Shankar <[email protected]>

Closes apache#5993 from ashwinshankar77/SPARK-7451 and squashes the following commits:

90900cf [Ashwin Shankar] Fix log info message
cf8b6cf [Ashwin Shankar] Stop counting preemption of executors as failure
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…sing Spark job to fail

Added a check to handle container exit status for the preemption scenario, log an INFO message in such cases and move on.
andrewor14

Author: Ashwin Shankar <[email protected]>

Closes apache#5993 from ashwinshankar77/SPARK-7451 and squashes the following commits:

90900cf [Ashwin Shankar] Fix log info message
cf8b6cf [Ashwin Shankar] Stop counting preemption of executors as failure
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…sing Spark job to fail

Added a check to handle container exit status for the preemption scenario, log an INFO message in such cases and move on.
andrewor14

Author: Ashwin Shankar <[email protected]>

Closes apache#5993 from ashwinshankar77/SPARK-7451 and squashes the following commits:

90900cf [Ashwin Shankar] Fix log info message
cf8b6cf [Ashwin Shankar] Stop counting preemption of executors as failure
mingyukim pushed a commit to palantir/spark that referenced this pull request Jul 10, 2015
…sing Spark job to fail

Added a check to handle container exit status for the preemption scenario, log an INFO message in such cases and move on.
andrewor14

Author: Ashwin Shankar <[email protected]>

Closes apache#5993 from ashwinshankar77/SPARK-7451 and squashes the following commits:

90900cf [Ashwin Shankar] Fix log info message
cf8b6cf [Ashwin Shankar] Stop counting preemption of executors as failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants