BTL vader (and sm) crash when filesystem of session directory is too small #4553

jsquyres · 2017-11-30T18:00:10Z

Götz Waschk (@LaHaine) cited in a thread starting here https://www.mail-archive.com/[email protected]/msg30820.html that he would see crashes with the vader and sm BTLs when he went above a certain number of processes in the job (1024, in his case).

Later in the thread, it was determined that the issue was that the /tmp filesystem where the session directory was located (and where the vader and SM BTLs put their memory-mapped files) was too small -- the job was crashing when we filled up the filesystem.

Open MPI shouldn't crash in this case. It would be 97% better if we emit an opal_show_help() message saying specifically what happened (i.e., that we effectively ran out of space in the session directory) here and gracefully die. But segv'ing -- or otherwise crashing -- feels like it should be an avoidable error, and doesn't help the user diagnose what went wrong / how to fix it.

Note that the stack traces cited on the email thread were from the v1.10.x series, and probably aren't useful for checking exactly where this is happening on master and more recent release series (indeed, the sm BTL was removed starting with Open MPI v3.0.x; I list the sm BTL here simply because we're still supporting the v2.1.x series). But it shouldn't be hard to duplicate this error in a controlled environment and find where exactly the "out of space" issue is causing vader to crash (and sm, if we care).

Hopefully, this will lead to a fairly easy fix of emitting a show_help message and killing the job in an orderly fashion.

The text was updated successfully, but these errors were encountered:

ggouaillardet · 2017-12-01T06:26:18Z

iirc, a similar issue was previously reported.
what we found that even if the filesystem is too small, ftruncate() successes, and even mmap() successes.
Ultimately, the application crashes when accessing the mmap'ed memory.

LaHaine · 2017-12-01T07:19:12Z

I had reported this as issue #3251, maybe that one should be reopened and merged.

There were multiple paths that could lead to a fast box allocation. One of them made little sense (in-place send) so it has been removed to allow a rework of the fast-box send function. This should fix a number of issues with hanging/crashing when using the vader btl. References open-mpi#4260, open-mpi#4553. Signed-off-by: Nathan Hjelm <[email protected]>

hjelmn · 2017-12-05T21:16:30Z

Ack, opps. #4569 referenced the wrong issue. No effect on this bug.

hjelmn · 2017-12-05T22:38:05Z

It might be worth just moving the backing files to /dev/shm and be done with it. The benefit of the session directory is that the orted will blow it away for us if something goes wrong. Nothing similar for /dev/shm (psm2 leave garbage there if you cntr-c an MPI app).

jsquyres · 2017-12-06T17:00:36Z

Do we have an MCA param that allows you to put the shmem files somewhere? That would be a better solution.

The auto-cleanup of the session directory is a Good Thing for exactly this reason (i.e., removing shared memory files upon crash). Always putting the shared memory files in /dev/shm (where there's no auto-cleanup) is a Bad Thing. We should give the users a way to do this if they want to (e.g., via MCA param), but putting them there by default is asking for trouble (i.e., leaving lots of stale shared memory files around).

rhc54 · 2017-12-07T00:59:44Z

or perhaps we could have a way for the proc to "register" it's shmem backing file with the orted so it gets cleaned up? Would be trivial to do. I realize it leaves a little race condition, but perhaps better than nothing?

ggouaillardet · 2017-12-07T01:38:28Z

+1 on @rhc54 suggestion !

as far as i am concerned, Open MPI auto-cleanup should be considered as best effort,
and does not replace the need for a hardened job epilog on a production system.

an other (non portable ?) option is to always have orted create&open the temporary files and delete them right after. Then the file descriptor can be passed to the MPI tasks via a unix socket (for example http://www.normalesup.org/~george/comp/libancillary/) . In order to save some file descriptors, the fd can be closed if we know for sure it will not be required by other MPI tasks at a later point.

rhc54 · 2017-12-07T02:26:18Z

The orted already does that - gets passed in the PMIx download during pmix.init. Only problem right now is that the shared memory file size forces vader to put the backing file somewhere other than under the orted location, hence the idea of registering it. Alternatively, since the orted inits opal anyway, I suppose we could have the orted open/query the BTL's to get any size requirement and use that in determining locations.

My preference would be to simply have the proc register locations with the orted for cleanup - keeps separations a little cleaner.

rhc54 · 2017-12-07T02:28:51Z

My bad - just realized you meant something other than the temporary directory location. I now grok what you mean, but am not sure how the orted would know where to put the tmp file. In this case, vader needs more space than the orted realizes. Doing the query might solve that problem - but I would still opt for the registration approach.

jsquyres · 2017-12-07T17:00:40Z

I'm quite sure we've talked about the registration idea before -- I think we didn't do it previously because:

Would have required a bunch of infrastructure that wasn't in place (maybe?) -- but that's probably moot / easy to do these days.
Doesn't help with direct launch scenarios.
- ...but maybe this is a problem regardless? I don't remember.

rhc54 · 2017-12-07T17:26:55Z

yeah, direct launch is a problem that may be a short-term problem. However, as PMIx continues to evolve and get rolled out, there is no reason why the registration wouldn't carry over into the direct launch scenario. We could have the PMIx server track the requests and do the cleanup when it sees the proc exit.

rhc54 · 2017-12-12T02:00:19Z

@jsquyres @ggouaillardet Please see #4606 - if it looks okay, can someone please add the vader registration and push it there?

jsquyres · 2017-12-12T19:02:42Z

@rhc54 and I are iterating on this (via pmix/RFCs#27); a few changes are forthcoming...

rhc54 · 2017-12-19T22:18:49Z

This was fixed by #4606

jsquyres added bug good first issue help wanted labels Nov 30, 2017

hjelmn mentioned this issue Dec 5, 2017

btl/vader: change the way fast boxes are used #4569

Merged

rhc54 closed this as completed Dec 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BTL vader (and sm) crash when filesystem of session directory is too small #4553

BTL vader (and sm) crash when filesystem of session directory is too small #4553

jsquyres commented Nov 30, 2017 •

edited

Loading

ggouaillardet commented Dec 1, 2017

LaHaine commented Dec 1, 2017

hjelmn commented Dec 5, 2017

hjelmn commented Dec 5, 2017

jsquyres commented Dec 6, 2017

rhc54 commented Dec 7, 2017

ggouaillardet commented Dec 7, 2017

rhc54 commented Dec 7, 2017

rhc54 commented Dec 7, 2017

jsquyres commented Dec 7, 2017

rhc54 commented Dec 7, 2017

rhc54 commented Dec 12, 2017

jsquyres commented Dec 12, 2017

rhc54 commented Dec 19, 2017

BTL vader (and sm) crash when filesystem of session directory is too small #4553

BTL vader (and sm) crash when filesystem of session directory is too small #4553

Comments

jsquyres commented Nov 30, 2017 • edited Loading

ggouaillardet commented Dec 1, 2017

LaHaine commented Dec 1, 2017

hjelmn commented Dec 5, 2017

hjelmn commented Dec 5, 2017

jsquyres commented Dec 6, 2017

rhc54 commented Dec 7, 2017

ggouaillardet commented Dec 7, 2017

rhc54 commented Dec 7, 2017

rhc54 commented Dec 7, 2017

jsquyres commented Dec 7, 2017

rhc54 commented Dec 7, 2017

rhc54 commented Dec 12, 2017

jsquyres commented Dec 12, 2017

rhc54 commented Dec 19, 2017

jsquyres commented Nov 30, 2017 •

edited

Loading