-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to PMIx v3.0 PR for cleanup registration #4606
Conversation
Errrr...the compiler crashed! bot:ibm:retest |
The IBM CI (PGI Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/9b9a01ace59e2c90133ff1ceae45391f |
The IBM CI (PGI Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/763bea641c8e9426774f16dd4d1c2e3b |
:retest: |
retest |
bot:retest |
I've updated our CI nodes to use PGI 17.10 and I'm hopeful it will not segv. |
The IBM CI (PGI Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/aa81ca940c777214010112f95bdce4ad |
Doesn't help. |
It didn't pick up the new compiler install - Geoff is working on it. |
bot:retest |
3rd times the charm! |
The IBM CI (PGI Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/bfa3fd68985416c1bb8542ca222e9999 |
Before you burn a ton more time on this, let me try just doing the PMIx update separately and see if it is something simply borked in this PR. |
I was able to cd into the build directory and do a simple pgcc -c -fpic gds_dstore_component.c -I. which didn't segv even at -O2 or -O3 (using pgi 17.10) Let me know what I can try next. |
Sounds like it's a compiler issue then. I don't think I can help much with it. |
hard to see how it can be a compiler problem since it works with everything else, but let's try it another way. I'm suspicious of some of the compiler directives in the latest dstore additions - may be something PGI doesn't like, though it begs the question of how it got thru the pmix testing. |
Well, @gpaulsen just said that he was able to compile exactly this source with this compiler. I think that if compiler works well it should be deterministic and have same results all the time. |
Okay, so this PR minus the new dstore code builds just fine - see #4622. Now will try a totally fresh PR with the dstore code, just to see if something in this PR got borked. |
FWIW: I believe the problem is the |
@rhc54, thanks for checking that. I'd be curious to see if this is an issue. |
Sure enough - confirmed that the dstore code is the problem with #4623. Will try removing those attributes. Just FYI: if you look elsewhere in the code, we always use something like |
We have it currently in OMPI master |
Hmmm...interesting. Well, let's see what changed here. |
Huh - well, it seems to me that those macro definitions are no longer of any value. They are just used in a simple set of functions. Let's try replacing them and see what happens. |
I'm going to do this to avoid using macros in dstore |
ok - let me know when you have it done and I'll update #4623 so we can retry |
FWIW: it looks like the compiler is segfaulting due to |
If we going to - let's keep this PR as is so they can debug |
Not worth it, IMO. As I said, we've encountered such things before, and nearly every compiler has some level of confusion that causes the optimizer to barf. It looks like @karasevb hit a sensitive point on PGI here, but it could as easily have bothered some other compiler at some particular version. Like I said, we have learned to keep the "return" calls simple. No reason to drive them into deep recursive macro calls. I'll keep this PR because I want to use it to test @karasevb's changes. Once we get the code cleaned up so it passes, I'll close this and create a proper update. |
The IBM CI (PGI Compiler) build failed! Please review the log, linked below. Gist: https://gist.github.com/8140ba90cde48dc6e2af657ebfd15d81 |
I will work to get this reported to PGI. Please keep this PR for them until they can clone it. |
I've reached the conclusion that this isn't a compiler bug - it now appears that the problem can be directly traced to use of "inlined" functions in a function pointer structure. Removing those directives allows things to build. Frankly, I'm a tad surprised this didn't cause problems elsewhere - I'm not sure what an "inlined" function pointer actually means in that context. Talking to my compiler guru now - will pass along more later. |
Good catch @rhc54! |
In any case - segfaulting compiler is a bug :) |
It doesn't look like the compiler is segfaulting any more - it appears to me that it is simply reporting an error and aborting. I'm still investigating |
Well, here is what my compiler guru saids - which I'll preface that any mistakes are introduced by me. Basically, inline is just a hint, and so implementation is really flexible and varies a great deal across compilers and even release versions within a compiler. Using an inline function in a function pointer is generally supported, BUT may require that you have an "extern inline" declaration somewhere so it can be found. Again, implementations vary, and some may simply reject it. In this case, the compiler actually was reporting the error. It clearly states that the problem was too deep a recursion in the pmix_sm_store function, which is where the function pointers were being accessed and had recursive calls into inline functions. So it likely was a combination of an inlined function pointer plus multiple recursion that was being flagged. The other issue he cited is that, due to all the flexibility, some compilers also take the hint to mean "keep the assembly for this block of code as a unit in the icache". It is the equivalent of the "register" directive for a variable. So if you flag large blocks of code, or have conditional jumps in them, it causes optimization issues due to cache thrashing, as @hjelmn noted. It doesn't look like my last change fixed the problem, so there is still something wrong in that code. Sigh. |
@rhc54 I see - I only remember the old errors. Curious to see the error output. As I mentioned before and @gpaulsen confirmed the intention - it would be good to keep a segfaulting instance for PGI guys so they can fix their compiler. Do you think you can re-create it in a separate branch and let it to hang until they will try? |
bot:ibm:retest |
My 2 cents - we should merge in the repaired PR so that if someone pick up OMPI and happens to have an impacted compiler then they don't see a build failure. Odds are they won't be in a position to upgrade their compiler. But I'd like to make sure we preserve the version of the branch that shows the compiler crash so @gpaulsen or someone can take it back to PGI for inspection. Ideally we would have a small reproducer, but I don't know if we (1) know enough about why this is happening or (2) have the time right now to do so. |
If available, have apps use registration capability to cleanup their session directories. Setup capability for vader to register its shared memory file location - let someone familiar with that code do so. Final cleanup to track uid/gid, update the opal/pmix API to pass flags for ignore and leave top directory alone Signed-off-by: Ralph Castain <[email protected]>
This commit moves the backing files to /dev/shm to avoid limitations that may be set on /tmp. The files are registered with pmix to ensure they are cleaned up after an erroneous exit. Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit 4810127)
We clearly need to re-look at our infrastructure as this is getting to be far too common from multiple testers:
|
@rhc54 the issue with |
If available, have apps use registration capability to cleanup their session directories. Setup capability for vader to register its shared memory file location - let someone familiar with that code do so.
Signed-off-by: Ralph Castain [email protected]