-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7503][YARN] Resources in .sparkStaging directory can't be cleaned up on error #6026
Conversation
Merged build triggered. |
Merged build started. |
Test build #32299 has started for PR 6026 at commit |
localResources = prepareLocalResources(appStagingDir) | ||
} catch { | ||
case e: Throwable => | ||
var stagingDirPath: Path = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this variable is needed? appStagingDir
can be used everywhere except where it needs to be wrapped for FileSystem
. Can localResources
remain a val
and receive the value of the try
block?
To be more specific, you are pointing out that the staging dir is set up before the AM runs. The AM also does this cleanup, but if it fails to start, nothing cleans this up. I agree, but this it a band-aid that doesn't catch most error cases and would still leave this dir lying around. Can the staging files be set up much later in this initialization? that would greatly narrow the problem |
Test build #32299 has finished for PR 6026 at commit
|
Merged build finished. Test FAILed. |
Test FAILed. |
Thank you for your comment @srowen . |
At least can it be one of the last things that happens before starting? that would tighten this up. |
How about we just wrap submitApplication is a try catch block and check to see if we need to cleanup the staging directory if it fails. that way it should cover more failure scenarios. You have to do certain things in certain order in order to submit. We create the launch context (which includes preparing the resources, create the submission context and submit to yarn. Personally I like having the resources setup before setting it in the context. |
+1 |
Thanks for all of your advice. |
That requires the launcher to still be around to be fixed, which is not guaranteed (use can just ctrl-c out of the launcher, it's not required after the app starts). If the launcher is still around, though, it can probably just do a final check for the presence of the staging dir after the app finishes, regardless of final status. |
…uploaded-resources-on-error
O.K, I'll wrap |
…lete resources on error
Merged build triggered. |
Merged build started. |
Test build #32570 has started for PR 6026 at commit |
Test build #32570 has finished for PR 6026 at commit
|
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build triggered. |
Merged build started. |
Test build #32571 has started for PR 6026 at commit |
Test build #32571 has finished for PR 6026 at commit
|
Merged build finished. Test FAILed. |
Test FAILed. |
retest this please. |
Merged build triggered. |
Merged build started. |
Test build #32573 has started for PR 6026 at commit |
Test build #32573 has finished for PR 6026 at commit
|
Merged build finished. Test PASSed. |
Test PASSed. |
appId | ||
} catch { | ||
case e: Throwable => | ||
if (appId != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there's no great way to share this with similar code in ApplicationMaster.scala
? maybe not. This could be made into a private method, as it is in the other similar code block, but it's not a big deal. LGTM.
…aned up on error When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources. You can see this issue by running following command. ``` bin/spark-submit --master yarn --deploy-mode cluster --class <someClassName> <non-existing-jar> ``` Author: Kousuke Saruta <[email protected]> Closes #6026 from sarutak/delete-uploaded-resources-on-error and squashes the following commits: caef9f4 [Kousuke Saruta] Fixed style 882f921 [Kousuke Saruta] Wrapped Client#submitApplication with try/catch blocks in order to delete resources on error 1786ca4 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into delete-uploaded-resources-on-error f61071b [Kousuke Saruta] Fixed cleanup problem (cherry picked from commit c64ff80) Signed-off-by: Sean Owen <[email protected]>
…aned up on error When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources. You can see this issue by running following command. ``` bin/spark-submit --master yarn --deploy-mode cluster --class <someClassName> <non-existing-jar> ``` Author: Kousuke Saruta <[email protected]> Closes apache#6026 from sarutak/delete-uploaded-resources-on-error and squashes the following commits: caef9f4 [Kousuke Saruta] Fixed style 882f921 [Kousuke Saruta] Wrapped Client#submitApplication with try/catch blocks in order to delete resources on error 1786ca4 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into delete-uploaded-resources-on-error f61071b [Kousuke Saruta] Fixed cleanup problem
…aned up on error When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources. You can see this issue by running following command. ``` bin/spark-submit --master yarn --deploy-mode cluster --class <someClassName> <non-existing-jar> ``` Author: Kousuke Saruta <[email protected]> Closes apache#6026 from sarutak/delete-uploaded-resources-on-error and squashes the following commits: caef9f4 [Kousuke Saruta] Fixed style 882f921 [Kousuke Saruta] Wrapped Client#submitApplication with try/catch blocks in order to delete resources on error 1786ca4 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into delete-uploaded-resources-on-error f61071b [Kousuke Saruta] Fixed cleanup problem
…aned up on error When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources. You can see this issue by running following command. ``` bin/spark-submit --master yarn --deploy-mode cluster --class <someClassName> <non-existing-jar> ``` Author: Kousuke Saruta <[email protected]> Closes apache#6026 from sarutak/delete-uploaded-resources-on-error and squashes the following commits: caef9f4 [Kousuke Saruta] Fixed style 882f921 [Kousuke Saruta] Wrapped Client#submitApplication with try/catch blocks in order to delete resources on error 1786ca4 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into delete-uploaded-resources-on-error f61071b [Kousuke Saruta] Fixed cleanup problem
When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources.
You can see this issue by running following command.