-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BTL vader (and sm) crash when filesystem of session directory is too small #4553
Comments
iirc, a similar issue was previously reported. |
I had reported this as issue #3251, maybe that one should be reopened and merged. |
There were multiple paths that could lead to a fast box allocation. One of them made little sense (in-place send) so it has been removed to allow a rework of the fast-box send function. This should fix a number of issues with hanging/crashing when using the vader btl. References open-mpi#4260, open-mpi#4553. Signed-off-by: Nathan Hjelm <[email protected]>
Ack, opps. #4569 referenced the wrong issue. No effect on this bug. |
It might be worth just moving the backing files to /dev/shm and be done with it. The benefit of the session directory is that the orted will blow it away for us if something goes wrong. Nothing similar for /dev/shm (psm2 leave garbage there if you cntr-c an MPI app). |
Do we have an MCA param that allows you to put the shmem files somewhere? That would be a better solution. The auto-cleanup of the session directory is a Good Thing for exactly this reason (i.e., removing shared memory files upon crash). Always putting the shared memory files in |
or perhaps we could have a way for the proc to "register" it's shmem backing file with the orted so it gets cleaned up? Would be trivial to do. I realize it leaves a little race condition, but perhaps better than nothing? |
+1 on @rhc54 suggestion ! as far as i am concerned, Open MPI auto-cleanup should be considered as best effort, an other (non portable ?) option is to always have |
The orted already does that - gets passed in the PMIx download during pmix.init. Only problem right now is that the shared memory file size forces vader to put the backing file somewhere other than under the orted location, hence the idea of registering it. Alternatively, since the orted inits opal anyway, I suppose we could have the orted open/query the BTL's to get any size requirement and use that in determining locations. My preference would be to simply have the proc register locations with the orted for cleanup - keeps separations a little cleaner. |
My bad - just realized you meant something other than the temporary directory location. I now grok what you mean, but am not sure how the orted would know where to put the tmp file. In this case, vader needs more space than the orted realizes. Doing the query might solve that problem - but I would still opt for the registration approach. |
I'm quite sure we've talked about the registration idea before -- I think we didn't do it previously because:
|
yeah, direct launch is a problem that may be a short-term problem. However, as PMIx continues to evolve and get rolled out, there is no reason why the registration wouldn't carry over into the direct launch scenario. We could have the PMIx server track the requests and do the cleanup when it sees the proc exit. |
@jsquyres @ggouaillardet Please see #4606 - if it looks okay, can someone please add the vader registration and push it there? |
@rhc54 and I are iterating on this (via pmix/RFCs#27); a few changes are forthcoming... |
This was fixed by #4606 |
Götz Waschk (@LaHaine) cited in a thread starting here https://www.mail-archive.com/[email protected]/msg30820.html that he would see crashes with the vader and sm BTLs when he went above a certain number of processes in the job (1024, in his case).
Later in the thread, it was determined that the issue was that the
/tmp
filesystem where the session directory was located (and where the vader and SM BTLs put their memory-mapped files) was too small -- the job was crashing when we filled up the filesystem.Open MPI shouldn't crash in this case. It would be 97% better if we emit an
opal_show_help()
message saying specifically what happened (i.e., that we effectively ran out of space in the session directory) here and gracefully die. But segv'ing -- or otherwise crashing -- feels like it should be an avoidable error, and doesn't help the user diagnose what went wrong / how to fix it.Note that the stack traces cited on the email thread were from the v1.10.x series, and probably aren't useful for checking exactly where this is happening on master and more recent release series (indeed, the sm BTL was removed starting with Open MPI v3.0.x; I list the sm BTL here simply because we're still supporting the v2.1.x series). But it shouldn't be hard to duplicate this error in a controlled environment and find where exactly the "out of space" issue is causing vader to crash (and sm, if we care).
Hopefully, this will lead to a fairly easy fix of emitting a
show_help
message and killing the job in an orderly fashion.The text was updated successfully, but these errors were encountered: