Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signal forwarding is broken in ompi-release #2075

Closed
artpol84 opened this issue Sep 13, 2016 · 5 comments
Closed

Signal forwarding is broken in ompi-release #2075

artpol84 opened this issue Sep 13, 2016 · 5 comments

Comments

@artpol84
Copy link
Contributor

I'm seeing litter on our jenkins server. The reason is that timeout signal is not properly propagated to the application processes in v2.x. The following example hangs:

$timeout -s SIGSEGV 2m /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun -np 8 --bind-to none -x SHMEM_SYMMETRIC_HEAP_SIZE=256M --mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm --mca rmaps_base_dist_hca mlx5_0:1 --mca sshmem_verbs_hca_name mlx5_0:1 --mca spml ucx -mca pml ucx taskset -c 12,13 sleep 10000
[jenkins03:02612] *** Process received signal ***
[jenkins03:02612] Signal: Segmentation fault (11)
[jenkins03:02612] Signal code:  (0)
[jenkins03:02612] Failing at address: 0x10af00000a33
[jenkins03:02612] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7ffff6898100]
[jenkins03:02612] [ 1] /usr/lib64/libc.so.6(epoll_wait+0x33)[0x7ffff65be7a3]
[jenkins03:02612] [ 2] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/lib/libopen-pal.so.20(+0x8ec93)[0x7ffff785ac93]
[jenkins03:02612] [ 3] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/lib/libopen-pal.so.20(opal_libevent2022_event_base_loop+0x170)[0x7ffff785e6e0]
[jenkins03:02612] [ 4] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun[0x40541a]
[jenkins03:02612] [ 5] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun[0x403730]
[jenkins03:02612] [ 6] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ffff64e9b15]
[jenkins03:02612] [ 7] /var/lib/jenkins/jobs/gh-ompi-release-pr/workspace/ompi_install1/bin/oshrun[0x403649]
[jenkins03:02612] *** End of error message ***
oshrun: Forwarding signal 18 to job

master is not affected:

/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/oshrun -np 8 --bind-to none -x SHMEM_SYMMETRIC_HEAP_SIZE=256M --report-state-on-timeout --get-stack-traces --timeout 20 --mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm --mca rmaps_base_dist_hca mlx5_0:1 --mca sshmem_verbs_hca_name mlx5_0:1 --mca spml ucx -mca pml ucx taskset -c 10,11 sleep 10000
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 20 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------
DATA FOR JOB: [12390,0]
    Num apps: 1 Num procs: 1    JobState: ALL DAEMONS REPORTED  Abort: False
    Num launched: 0 Num reported: 1 Num terminated: 0

    Procs:
        Rank: 0 Node: jenkins03 PID: 2845   State: RUNNING  ExitCode 0

DATA FOR JOB: [12390,1]
    Num apps: 1 Num procs: 8    JobState: RUNNING   Abort: False
    Num launched: 8 Num reported: 0 Num terminated: 0

    Procs:
        Rank: 0 Node: jenkins03 PID: 2853   State: RUNNING  ExitCode 0
        Rank: 1 Node: jenkins03 PID: 2854   State: RUNNING  ExitCode 0
        Rank: 2 Node: jenkins03 PID: 2855   State: RUNNING  ExitCode 0
        Rank: 3 Node: jenkins03 PID: 2856   State: RUNNING  ExitCode 0
        Rank: 4 Node: jenkins03 PID: 2857   State: RUNNING  ExitCode 0
        Rank: 5 Node: jenkins03 PID: 2858   State: RUNNING  ExitCode 0
        Rank: 6 Node: jenkins03 PID: 2859   State: RUNNING  ExitCode 0
        Rank: 7 Node: jenkins03 PID: 2860   State: RUNNING  ExitCode 0

Waiting for stack traces (this may take a few moments)...
STACK TRACE FOR PROC [[12390,1],0] (jenkins03, PID 2853)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],1] (jenkins03, PID 2854)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],2] (jenkins03, PID 2855)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],3] (jenkins03, PID 2856)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],4] (jenkins03, PID 2857)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],5] (jenkins03, PID 2858)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],6] (jenkins03, PID 2859)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()

STACK TRACE FOR PROC [[12390,1],7] (jenkins03, PID 2860)
    #0  0x00007ffff7ad7400 in __nanosleep_nocancel () from /usr/lib64/libc.so.6
    #1  0x0000000000403e5f in rpl_nanosleep ()
    #2  0x0000000000403cc0 in xnanosleep ()
    #3  0x00000000004016cd in main ()
@artpol84
Copy link
Contributor Author

Let me double-check this

@jsquyres
Copy link
Member

I think there was some discussion about this (on the weekly call? on devel? ...I'm afraid I don't recall offhand) that this might not be correct because we have never forwarded SIGSEGV.

@artpol84 Were you able to double check this?

@artpol84
Copy link
Contributor Author

Not yet, Jeff.

2016-09-22 17:24 GMT+07:00 Jeff Squyres [email protected]:

I think there was some discussion about this (on the weekly call? on
devel? ...I'm afraid I don't recall offhand) that this might not be correct
because we have never forwarded SIGSEGV.

@artpol84 https://github.com/artpol84 Were you able to double check
this?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2075 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHL5PhR-4pFeyvL7RHLLTIOt9TgUd7Qoks5qsldSgaJpZM4J7P89
.

С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

@rhc54
Copy link
Contributor

rhc54 commented May 29, 2017

Let me know if this continues to be an issue

@rhc54 rhc54 closed this as completed May 29, 2017
@tgpfeiffer
Copy link

I see the same problem with OpenMPI 3.0.0, when I send a TERM signal to mpirun then it produces an infinite stream of

(null): Forwarding signal 18 to job

messages to stdout until the signal is escalated to KILL. I think I have not seen that behavior on 2.1.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants