Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7503][YARN] Resources in .sparkStaging directory can't be cleaned up on error #6026

Closed

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented May 9, 2015

When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources.

You can see this issue by running following command.

bin/spark-submit --master yarn --deploy-mode cluster --class <someClassName> <non-existing-jar>

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 9, 2015

Test build #32299 has started for PR 6026 at commit f61071b.

localResources = prepareLocalResources(appStagingDir)
} catch {
case e: Throwable =>
var stagingDirPath: Path = null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this variable is needed? appStagingDir can be used everywhere except where it needs to be wrapped for FileSystem. Can localResources remain a val and receive the value of the try block?

@srowen
Copy link
Member

srowen commented May 9, 2015

To be more specific, you are pointing out that the staging dir is set up before the AM runs. The AM also does this cleanup, but if it fails to start, nothing cleans this up.

I agree, but this it a band-aid that doesn't catch most error cases and would still leave this dir lying around.

Can the staging files be set up much later in this initialization? that would greatly narrow the problem

@SparkQA
Copy link

SparkQA commented May 9, 2015

Test build #32299 has finished for PR 6026 at commit f61071b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32299/
Test FAILed.

@sarutak
Copy link
Member Author

sarutak commented May 10, 2015

Thank you for your comment @srowen .
Yeah, as you mentioned, staging dir is not cleaned up when ApplicationMaster fails to start.
But I think it's difficult to setup local resources after we know that ApplicationMaster start.
I'll investigate another way.

@srowen
Copy link
Member

srowen commented May 10, 2015

At least can it be one of the last things that happens before starting? that would tighten this up.

@tgravescs
Copy link
Contributor

How about we just wrap submitApplication is a try catch block and check to see if we need to cleanup the staging directory if it fails. that way it should cover more failure scenarios.

You have to do certain things in certain order in order to submit. We create the launch context (which includes preparing the resources, create the submission context and submit to yarn. Personally I like having the resources setup before setting it in the context.

@vanzin
Copy link
Contributor

vanzin commented May 12, 2015

How about we just wrap submitApplication is a try catch block and check to see if we need to cleanup the staging directory if it fails.

+1

@sarutak
Copy link
Member Author

sarutak commented May 12, 2015

Thanks for all of your advice.
Yeah, wrapping submitApplication by try/catch block is good idea.
But, what if the last app attempt fails before ApplicationMaster starts and after submitApplication returns successfully.
Should we check the host name of last ApplicationMaster by ApplicationReport#getHost ? In this case, the host name should be "N/A".

@vanzin
Copy link
Contributor

vanzin commented May 12, 2015

But, what if the last app attempt fails before ApplicationMaster starts and after submitApplication returns successfully.

That requires the launcher to still be around to be fixed, which is not guaranteed (use can just ctrl-c out of the launcher, it's not required after the app starts).

If the launcher is still around, though, it can probably just do a final check for the presence of the staging dir after the app finishes, regardless of final status.

@sarutak
Copy link
Member Author

sarutak commented May 13, 2015

O.K, I'll wrap submitApplication with a try/catch block.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32570 has started for PR 6026 at commit 882f921.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32570 has finished for PR 6026 at commit 882f921.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • final class OneVsRest extends Estimator[OneVsRestModel] with OneVsRestParams
    • decisionTreeCode = '''class DecisionTreeParams(Params):
    • class DecisionTreeParams(Params):
    • class LinearRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
    • class LinearRegressionModel(JavaModel):
    • class TreeRegressorParams(object):
    • class RandomForestParams(object):
    • class GBTParams(object):
    • class DecisionTreeRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol,
    • class DecisionTreeRegressionModel(JavaModel):
    • class RandomForestRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasSeed,
    • class RandomForestRegressionModel(JavaModel):
    • class GBTRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter,
    • class GBTRegressionModel(JavaModel):
    • s"FileOutputCommitter or its subclass is expected, but got a $
    • trait FSBasedRelationProvider
    • abstract class OutputWriter

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32570/
Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32571 has started for PR 6026 at commit caef9f4.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32571 has finished for PR 6026 at commit caef9f4.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32571/
Test FAILed.

@sarutak
Copy link
Member Author

sarutak commented May 13, 2015

retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32573 has started for PR 6026 at commit caef9f4.

@SparkQA
Copy link

SparkQA commented May 13, 2015

Test build #32573 has finished for PR 6026 at commit caef9f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32573/
Test PASSed.

appId
} catch {
case e: Throwable =>
if (appId != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there's no great way to share this with similar code in ApplicationMaster.scala? maybe not. This could be made into a private method, as it is in the other similar code block, but it's not a big deal. LGTM.

@asfgit asfgit closed this in c64ff80 May 15, 2015
asfgit pushed a commit that referenced this pull request May 15, 2015
…aned up on error

When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources.

You can see this issue by running following command.
```
bin/spark-submit --master yarn --deploy-mode cluster --class <someClassName> <non-existing-jar>
```

Author: Kousuke Saruta <[email protected]>

Closes #6026 from sarutak/delete-uploaded-resources-on-error and squashes the following commits:

caef9f4 [Kousuke Saruta] Fixed style
882f921 [Kousuke Saruta] Wrapped Client#submitApplication with try/catch blocks in order to delete resources on error
1786ca4 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into delete-uploaded-resources-on-error
f61071b [Kousuke Saruta] Fixed cleanup problem

(cherry picked from commit c64ff80)
Signed-off-by: Sean Owen <[email protected]>
@sarutak sarutak deleted the delete-uploaded-resources-on-error branch May 18, 2015 01:52
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
…aned up on error

When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources.

You can see this issue by running following command.
```
bin/spark-submit --master yarn --deploy-mode cluster --class <someClassName> <non-existing-jar>
```

Author: Kousuke Saruta <[email protected]>

Closes apache#6026 from sarutak/delete-uploaded-resources-on-error and squashes the following commits:

caef9f4 [Kousuke Saruta] Fixed style
882f921 [Kousuke Saruta] Wrapped Client#submitApplication with try/catch blocks in order to delete resources on error
1786ca4 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into delete-uploaded-resources-on-error
f61071b [Kousuke Saruta] Fixed cleanup problem
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…aned up on error

When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources.

You can see this issue by running following command.
```
bin/spark-submit --master yarn --deploy-mode cluster --class <someClassName> <non-existing-jar>
```

Author: Kousuke Saruta <[email protected]>

Closes apache#6026 from sarutak/delete-uploaded-resources-on-error and squashes the following commits:

caef9f4 [Kousuke Saruta] Fixed style
882f921 [Kousuke Saruta] Wrapped Client#submitApplication with try/catch blocks in order to delete resources on error
1786ca4 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into delete-uploaded-resources-on-error
f61071b [Kousuke Saruta] Fixed cleanup problem
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…aned up on error

When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources.

You can see this issue by running following command.
```
bin/spark-submit --master yarn --deploy-mode cluster --class <someClassName> <non-existing-jar>
```

Author: Kousuke Saruta <[email protected]>

Closes apache#6026 from sarutak/delete-uploaded-resources-on-error and squashes the following commits:

caef9f4 [Kousuke Saruta] Fixed style
882f921 [Kousuke Saruta] Wrapped Client#submitApplication with try/catch blocks in order to delete resources on error
1786ca4 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into delete-uploaded-resources-on-error
f61071b [Kousuke Saruta] Fixed cleanup problem
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants