Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTL vader (and sm) crash when filesystem of session directory is too small #4553

Closed
jsquyres opened this issue Nov 30, 2017 · 14 comments
Closed

Comments

@jsquyres
Copy link
Member

jsquyres commented Nov 30, 2017

Götz Waschk (@LaHaine) cited in a thread starting here https://www.mail-archive.com/[email protected]/msg30820.html that he would see crashes with the vader and sm BTLs when he went above a certain number of processes in the job (1024, in his case).

Later in the thread, it was determined that the issue was that the /tmp filesystem where the session directory was located (and where the vader and SM BTLs put their memory-mapped files) was too small -- the job was crashing when we filled up the filesystem.

Open MPI shouldn't crash in this case. It would be 97% better if we emit an opal_show_help() message saying specifically what happened (i.e., that we effectively ran out of space in the session directory) here and gracefully die. But segv'ing -- or otherwise crashing -- feels like it should be an avoidable error, and doesn't help the user diagnose what went wrong / how to fix it.

Note that the stack traces cited on the email thread were from the v1.10.x series, and probably aren't useful for checking exactly where this is happening on master and more recent release series (indeed, the sm BTL was removed starting with Open MPI v3.0.x; I list the sm BTL here simply because we're still supporting the v2.1.x series). But it shouldn't be hard to duplicate this error in a controlled environment and find where exactly the "out of space" issue is causing vader to crash (and sm, if we care).

Hopefully, this will lead to a fairly easy fix of emitting a show_help message and killing the job in an orderly fashion.

@ggouaillardet
Copy link
Contributor

iirc, a similar issue was previously reported.
what we found that even if the filesystem is too small, ftruncate() successes, and even mmap() successes.
Ultimately, the application crashes when accessing the mmap'ed memory.

@LaHaine
Copy link

LaHaine commented Dec 1, 2017

I had reported this as issue #3251, maybe that one should be reopened and merged.

hjelmn added a commit to hjelmn/ompi that referenced this issue Dec 5, 2017
There were multiple paths that could lead to a fast box
allocation. One of them made little sense (in-place send) so it has
been removed to allow a rework of the fast-box send function. This
should fix a number of issues with hanging/crashing when using the
vader btl.

References open-mpi#4260, open-mpi#4553.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Member

hjelmn commented Dec 5, 2017

Ack, opps. #4569 referenced the wrong issue. No effect on this bug.

@hjelmn
Copy link
Member

hjelmn commented Dec 5, 2017

It might be worth just moving the backing files to /dev/shm and be done with it. The benefit of the session directory is that the orted will blow it away for us if something goes wrong. Nothing similar for /dev/shm (psm2 leave garbage there if you cntr-c an MPI app).

@jsquyres
Copy link
Member Author

jsquyres commented Dec 6, 2017

Do we have an MCA param that allows you to put the shmem files somewhere? That would be a better solution.

The auto-cleanup of the session directory is a Good Thing for exactly this reason (i.e., removing shared memory files upon crash). Always putting the shared memory files in /dev/shm (where there's no auto-cleanup) is a Bad Thing. We should give the users a way to do this if they want to (e.g., via MCA param), but putting them there by default is asking for trouble (i.e., leaving lots of stale shared memory files around).

@rhc54
Copy link
Contributor

rhc54 commented Dec 7, 2017

or perhaps we could have a way for the proc to "register" it's shmem backing file with the orted so it gets cleaned up? Would be trivial to do. I realize it leaves a little race condition, but perhaps better than nothing?

@ggouaillardet
Copy link
Contributor

+1 on @rhc54 suggestion !

as far as i am concerned, Open MPI auto-cleanup should be considered as best effort,
and does not replace the need for a hardened job epilog on a production system.

an other (non portable ?) option is to always have orted create&open the temporary files and delete them right after. Then the file descriptor can be passed to the MPI tasks via a unix socket (for example http://www.normalesup.org/~george/comp/libancillary/) . In order to save some file descriptors, the fd can be closed if we know for sure it will not be required by other MPI tasks at a later point.

@rhc54
Copy link
Contributor

rhc54 commented Dec 7, 2017

The orted already does that - gets passed in the PMIx download during pmix.init. Only problem right now is that the shared memory file size forces vader to put the backing file somewhere other than under the orted location, hence the idea of registering it. Alternatively, since the orted inits opal anyway, I suppose we could have the orted open/query the BTL's to get any size requirement and use that in determining locations.

My preference would be to simply have the proc register locations with the orted for cleanup - keeps separations a little cleaner.

@rhc54
Copy link
Contributor

rhc54 commented Dec 7, 2017

My bad - just realized you meant something other than the temporary directory location. I now grok what you mean, but am not sure how the orted would know where to put the tmp file. In this case, vader needs more space than the orted realizes. Doing the query might solve that problem - but I would still opt for the registration approach.

@jsquyres
Copy link
Member Author

jsquyres commented Dec 7, 2017

I'm quite sure we've talked about the registration idea before -- I think we didn't do it previously because:

  1. Would have required a bunch of infrastructure that wasn't in place (maybe?) -- but that's probably moot / easy to do these days.
  2. Doesn't help with direct launch scenarios.
    • ...but maybe this is a problem regardless? I don't remember.

@rhc54
Copy link
Contributor

rhc54 commented Dec 7, 2017

yeah, direct launch is a problem that may be a short-term problem. However, as PMIx continues to evolve and get rolled out, there is no reason why the registration wouldn't carry over into the direct launch scenario. We could have the PMIx server track the requests and do the cleanup when it sees the proc exit.

@rhc54
Copy link
Contributor

rhc54 commented Dec 12, 2017

@jsquyres @ggouaillardet Please see #4606 - if it looks okay, can someone please add the vader registration and push it there?

@jsquyres
Copy link
Member Author

@rhc54 and I are iterating on this (via pmix/RFCs#27); a few changes are forthcoming...

@rhc54
Copy link
Contributor

rhc54 commented Dec 19, 2017

This was fixed by #4606

@rhc54 rhc54 closed this as completed Dec 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants