Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlap test sometimes failing in Mellanox Jenkins with Yalla #1813

Closed
jsquyres opened this issue Jun 23, 2016 · 9 comments · Fixed by #1816
Closed

Overlap test sometimes failing in Mellanox Jenkins with Yalla #1813

jsquyres opened this issue Jun 23, 2016 · 9 comments · Fixed by #1816
Assignees
Labels
Milestone

Comments

@jsquyres
Copy link
Member

Mellanox Jenkins is showing a periodic problem with the overlap test using Yalla. There has been some discussion on open-mpi/ompi-release#1240 about what the actual root cause is, starting around here: open-mpi/ompi-release#1240 (comment).

Please continue the discussion here: @jladd-mlnx @miked-mellanox @artpol84 @nysal @hjelmn

We need to decide if this is a v2.0.0 or v2.0.1 issue. What is the scope of the problem, and how quickly can it be fixed? I'm tentatively marking this as v2.0.0, just so that it comes up on our daily discussions.

I'm opening a separate issue to track the discussion that was occurring on open-mpi/ompi-release#1240 (so that 1240 can go back to its regularly scheduled programming).

@artpol84
Copy link
Contributor

artpol84 commented Jun 24, 2016

Here is the analysis and the solution for the problem:

$ git bisect good
b4ff061bd45705eb39a91976fad855315ba990cc is the first bad commit
commit b4ff061bd45705eb39a91976fad855315ba990cc
Author: Nathan Hjelm <[email protected]>
Date:   Wed May 25 15:42:53 2016 -0600

    pml/yalla: update for request changes

    This commit brings the pml/yalla component up to date with the request
    rework changes.

    Signed-off-by: Nathan Hjelm <[email protected]>

    (cherry picked from open-mpi/ompi@9d439664f074a690ff8c550f9ef489bd70520a0f)

    Signed-off-by: Nathan Hjelm <[email protected]>

:040000 040000 c8047a13774043cf31793730b23f9fcc3cfeb1d6 38e5a2e275229f11f92612f7f1bc541cd9147017 M      ompi

  • I guess that Mellanox Jenkins was screaming about it since the end of the May.
  • PR with the fix is proposed here: ompi request handling race condition fix (MT-case) #1815. I'm still verifying the code, but would like to here comments from @hjelmn and @bosilca.
  • Problem analysis:
    In open-mpi/ompi-release@b4ff061 ompi_request_lock locks around critical yalla regions were removed which made possible the following race condition leading to a deadlock:
    • In multithreaded cases request processing can appear in one thread (will call it T1) and MPI_Wait is called in another (T2).
    • In the T1 we have the following backtrace (see below)
    • In wait_sync_update we are not guaranteed that after decreasing count to 0 in line 79 we won't loose the CPU.
    • In later case T2 can quickly execute and release sync object in line 348 before T1 would get to WAIT_SYNC_SIGNAL in line 87.
    • before open-mpi/ompi-release@b4ff061 the lock in yalla request free callback was holding T2 from releasing sync object before T1's signal.
#2  0x00007fffedec5de0 in wait_sync_update (sync=0x7fffffffd030, updates=1, status=0) at ../../../../opal/threads/wait_sync.h:91
#3  0x00007fffedec5ed3 in ompi_request_complete (request=0xb2c830, with_signal=true) at ../../../../ompi/request/request.h:423
#4  0x00007fffedec6ba0 in mca_pml_yalla_recv_completion_cb (context=0xb2c830) at pml_yalla_request.c:208
#5  0x00007fffedbb64e6 in mxm_proto_progress (context=0x782fc0) at mxm/core/mxm.c:89
#6  mxm_progress (context=0x782fc0) at mxm/core/mxm.c:347
#7  0x00007fffedec31a6 in mca_pml_yalla_progress () at pml_yalla.c:290
#8  0x00007ffff765b8c0 in opal_progress () at runtime/opal_progress.c:225
#9  0x00007fffedec3bc8 in mca_pml_yalla_recv (buf=0x0, count=0, datatype=0x601c40, src=1, tag=34532, comm=0x601640, status=0x0) at pml_yalla.c:383
#10 0x00007ffff7cfdd90 in PMPI_Recv (buf=0x0, count=0, type=0x601c40, source=1, tag=34532, comm=0x601640, status=0x0) at precv.c:77
#11 0x0000000000400c63 in threadfunc ()
#12 0x0000003d698079d1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003d690e8b6d in clone () from /lib64/libc.so.6

P.S. @hjelmn, Yalla is not guilty and the facts were pointing on the new request code.

@artpol84
Copy link
Contributor

artpol84 commented Jun 24, 2016

+ @karasevb

@jladd-mlnx
Copy link
Member

@artpol84 Outstanding analysis. @hppritcha - LANL, we'll send you a bill for our services ☺️. @karasevb @artpol84 I think we better ensure that this change hasn't negatively impacted our message rates. Please test this with osu_mbw_mr against 1.10.3. @gpaulsen - is IBM keeping track of this in Spectrum MPI?

@jladd-mlnx
Copy link
Member

@artpol84 @karasevb - please check the UCX PML as well. Probably the bug was pushed into that component as well.

@rhc54
Copy link
Contributor

rhc54 commented Jun 24, 2016

@artpol84 thanks - we knew Yalla wasn't guilty of having a new error, but clearly there is some interaction that involved the way Yalla used the request code vs how the other pml's etc do, and that really meant you folks needed to be involved. we'll definitely look forward to your patch.

@artpol84
Copy link
Contributor

@jladd-mlnx this fix is general and fixes this kind of race conditions for all PML's (thanks to centralized approach to locking).
Will check osu_mbw_mr on Monday.

@gpaulsen
Copy link
Member

@nysal - has been following the locking changes needed, and should be able to comment. We've hit something similar in our PAMI PML, and are just now reacting to it.

hjelmn added a commit to hjelmn/ompi that referenced this issue Jun 24, 2016
This commit fixes a race condition discovered by @artpol84. The race
happens when a signalling thread decrements the sync count to 0 then
goes to sleep. If the waiting thread runs and detects the count == 0
before going to sleep on the condition variable it will destroy the
condition variable while the signalling thread is potentially still
processing the completion. The fix is to add a non-atomic member to
the sync structure that indicates another process is handling
completion. Since the member will only be set to false by the
initiating thread and the completing thread the variable does not need
to be protected. When destoying a condition variable the waiting
thread needs to wait until the singalling thread is finished.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi#1813

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jun 24, 2016
This commit fixes a race condition discovered by @artpol84. The race
happens when a signalling thread decrements the sync count to 0 then
goes to sleep. If the waiting thread runs and detects the count == 0
before going to sleep on the condition variable it will destroy the
condition variable while the signalling thread is potentially still
processing the completion. The fix is to add a non-atomic member to
the sync structure that indicates another process is handling
completion. Since the member will only be set to false by the
initiating thread and the completing thread the variable does not need
to be protected. When destoying a condition variable the waiting
thread needs to wait until the singalling thread is finished.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi#1813

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Jun 25, 2016
This commit fixes a race condition discovered by @artpol84. The race
happens when a signalling thread decrements the sync count to 0 then
goes to sleep. If the waiting thread runs and detects the count == 0
before going to sleep on the condition variable it will destroy the
condition variable while the signalling thread is potentially still
processing the completion. The fix is to add a non-atomic member to
the sync structure that indicates another process is handling
completion. Since the member will only be set to false by the
initiating thread and the completing thread the variable does not need
to be protected. When destoying a condition variable the waiting
thread needs to wait until the singalling thread is finished.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi#1813

Signed-off-by: Nathan Hjelm <[email protected]>
artpol84 added a commit to artpol84/ompi that referenced this issue Jun 25, 2016
artpol84 added a commit to artpol84/ompi that referenced this issue Jun 25, 2016
artpol84 added a commit to artpol84/ompi that referenced this issue Jun 25, 2016
artpol84 added a commit to artpol84/ompi that referenced this issue Jun 25, 2016
@artpol84
Copy link
Contributor

Currently we have 2 alternative PR's that are trying to solve the same race condition:

Let's discuss their proc and cons based on community experience with regard to atomic performance. There is no rush - this bug was pending there since May 24. We can spend some more time before merging.

IMO from the performance perspective they are pretty much the same and in this case I believe that Mellanox patch should have precedence since it was opened first. However I might be lacking of experience with performance details of the atomics and I hope to acquire them from this discussion :). @bosilca @jladd-mlnx please advise and include others who might have an expertise.

Also I believe that either PR we will choose Mellanox copyright must be placed in the opal/threads/wait_sync.h because as it is usual with race conditions it took 95% of work to pinpoint and describe the race and 5% to fix it, also Mellanox contributed to the LANL PR helping to avoid possible new race condition.

@rhc54
Copy link
Contributor

rhc54 commented Jun 25, 2016

We generally take whichever has the best performance, followed by the minimal change, and if those are both the same, then we would use the one from the person who typically contributes to that code area. We don't usually get hung up over who proposed something first or get into arguments over who did what amount of work as that is highly subjective. Frankly, I've never seen a case where the two parties didn't just work it out between themselves.

Adding Mellanox copyright to the other PR is mostly a shrug - I've never heard of anyone really caring before as it doesn't signify anything. The lawyers just ask that we mark the files we "touch", and that is all the copyrights in the files are intended to do.

artpol84 added a commit to artpol84/ompi that referenced this issue Jun 26, 2016
hjelmn added a commit to hjelmn/ompi that referenced this issue Jun 27, 2016
This commit fixes a race condition discovered by @artpol84. The race
happens when a signalling thread decrements the sync count to 0 then
goes to sleep. If the waiting thread runs and detects the count == 0
before going to sleep on the condition variable it will destroy the
condition variable while the signalling thread is potentially still
processing the completion. The fix is to add a non-atomic member to
the sync structure that indicates another process is handling
completion. Since the member will only be set to false by the
initiating thread and the completing thread the variable does not need
to be protected. When destoying a condition variable the waiting
thread needs to wait until the singalling thread is finished.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi#1813

Signed-off-by: Nathan Hjelm <[email protected]>
artpol84 pushed a commit to artpol84/ompi that referenced this issue Jun 28, 2016
This commit fixes a race condition discovered by @artpol84. The race
happens when a signalling thread decrements the sync count to 0 then
goes to sleep. If the waiting thread runs and detects the count == 0
before going to sleep on the condition variable it will destroy the
condition variable while the signalling thread is potentially still
processing the completion. The fix is to add a non-atomic member to
the sync structure that indicates another process is handling
completion. Since the member will only be set to false by the
initiating thread and the completing thread the variable does not need
to be protected. When destoying a condition variable the waiting
thread needs to wait until the singalling thread is finished.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi#1813

Signed-off-by: Nathan Hjelm <[email protected]>
jsquyres pushed a commit to jsquyres/ompi that referenced this issue Aug 23, 2016
This commit fixes a race condition discovered by @artpol84. The race
happens when a signalling thread decrements the sync count to 0 then
goes to sleep. If the waiting thread runs and detects the count == 0
before going to sleep on the condition variable it will destroy the
condition variable while the signalling thread is potentially still
processing the completion. The fix is to add a non-atomic member to
the sync structure that indicates another process is handling
completion. Since the member will only be set to false by the
initiating thread and the completing thread the variable does not need
to be protected. When destoying a condition variable the waiting
thread needs to wait until the singalling thread is finished.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi#1813

Signed-off-by: Nathan Hjelm <[email protected]>

(cherry picked from open-mpi/ompi@fb455f0)

Signed-off-by: Nathan Hjelm <[email protected]>
bosilca pushed a commit to bosilca/ompi that referenced this issue Oct 3, 2016
This commit fixes a race condition discovered by @artpol84. The race
happens when a signalling thread decrements the sync count to 0 then
goes to sleep. If the waiting thread runs and detects the count == 0
before going to sleep on the condition variable it will destroy the
condition variable while the signalling thread is potentially still
processing the completion. The fix is to add a non-atomic member to
the sync structure that indicates another process is handling
completion. Since the member will only be set to false by the
initiating thread and the completing thread the variable does not need
to be protected. When destoying a condition variable the waiting
thread needs to wait until the singalling thread is finished.

Thanks to @artpol84 for tracking this down.

Fixes open-mpi#1813

Signed-off-by: Nathan Hjelm <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants