-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlap test sometimes failing in Mellanox Jenkins with Yalla #1813
Comments
Here is the analysis and the solution for the problem:
P.S. @hjelmn, Yalla is not guilty and the facts were pointing on the new request code. |
@artpol84 Outstanding analysis. @hppritcha - LANL, we'll send you a bill for our services |
@artpol84 thanks - we knew Yalla wasn't guilty of having a new error, but clearly there is some interaction that involved the way Yalla used the request code vs how the other pml's etc do, and that really meant you folks needed to be involved. we'll definitely look forward to your patch. |
@jladd-mlnx this fix is general and fixes this kind of race conditions for all PML's (thanks to centralized approach to locking). |
@nysal - has been following the locking changes needed, and should be able to comment. We've hit something similar in our PAMI PML, and are just now reacting to it. |
This commit fixes a race condition discovered by @artpol84. The race happens when a signalling thread decrements the sync count to 0 then goes to sleep. If the waiting thread runs and detects the count == 0 before going to sleep on the condition variable it will destroy the condition variable while the signalling thread is potentially still processing the completion. The fix is to add a non-atomic member to the sync structure that indicates another process is handling completion. Since the member will only be set to false by the initiating thread and the completing thread the variable does not need to be protected. When destoying a condition variable the waiting thread needs to wait until the singalling thread is finished. Thanks to @artpol84 for tracking this down. Fixes open-mpi#1813 Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a race condition discovered by @artpol84. The race happens when a signalling thread decrements the sync count to 0 then goes to sleep. If the waiting thread runs and detects the count == 0 before going to sleep on the condition variable it will destroy the condition variable while the signalling thread is potentially still processing the completion. The fix is to add a non-atomic member to the sync structure that indicates another process is handling completion. Since the member will only be set to false by the initiating thread and the completing thread the variable does not need to be protected. When destoying a condition variable the waiting thread needs to wait until the singalling thread is finished. Thanks to @artpol84 for tracking this down. Fixes open-mpi#1813 Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a race condition discovered by @artpol84. The race happens when a signalling thread decrements the sync count to 0 then goes to sleep. If the waiting thread runs and detects the count == 0 before going to sleep on the condition variable it will destroy the condition variable while the signalling thread is potentially still processing the completion. The fix is to add a non-atomic member to the sync structure that indicates another process is handling completion. Since the member will only be set to false by the initiating thread and the completing thread the variable does not need to be protected. When destoying a condition variable the waiting thread needs to wait until the singalling thread is finished. Thanks to @artpol84 for tracking this down. Fixes open-mpi#1813 Signed-off-by: Nathan Hjelm <[email protected]>
Currently we have 2 alternative PR's that are trying to solve the same race condition:
Let's discuss their proc and cons based on community experience with regard to atomic performance. There is no rush - this bug was pending there since May 24. We can spend some more time before merging. IMO from the performance perspective they are pretty much the same and in this case I believe that Mellanox patch should have precedence since it was opened first. However I might be lacking of experience with performance details of the atomics and I hope to acquire them from this discussion :). @bosilca @jladd-mlnx please advise and include others who might have an expertise. Also I believe that either PR we will choose Mellanox copyright must be placed in the |
We generally take whichever has the best performance, followed by the minimal change, and if those are both the same, then we would use the one from the person who typically contributes to that code area. We don't usually get hung up over who proposed something first or get into arguments over who did what amount of work as that is highly subjective. Frankly, I've never seen a case where the two parties didn't just work it out between themselves. Adding Mellanox copyright to the other PR is mostly a shrug - I've never heard of anyone really caring before as it doesn't signify anything. The lawyers just ask that we mark the files we "touch", and that is all the copyrights in the files are intended to do. |
This commit fixes a race condition discovered by @artpol84. The race happens when a signalling thread decrements the sync count to 0 then goes to sleep. If the waiting thread runs and detects the count == 0 before going to sleep on the condition variable it will destroy the condition variable while the signalling thread is potentially still processing the completion. The fix is to add a non-atomic member to the sync structure that indicates another process is handling completion. Since the member will only be set to false by the initiating thread and the completing thread the variable does not need to be protected. When destoying a condition variable the waiting thread needs to wait until the singalling thread is finished. Thanks to @artpol84 for tracking this down. Fixes open-mpi#1813 Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a race condition discovered by @artpol84. The race happens when a signalling thread decrements the sync count to 0 then goes to sleep. If the waiting thread runs and detects the count == 0 before going to sleep on the condition variable it will destroy the condition variable while the signalling thread is potentially still processing the completion. The fix is to add a non-atomic member to the sync structure that indicates another process is handling completion. Since the member will only be set to false by the initiating thread and the completing thread the variable does not need to be protected. When destoying a condition variable the waiting thread needs to wait until the singalling thread is finished. Thanks to @artpol84 for tracking this down. Fixes open-mpi#1813 Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a race condition discovered by @artpol84. The race happens when a signalling thread decrements the sync count to 0 then goes to sleep. If the waiting thread runs and detects the count == 0 before going to sleep on the condition variable it will destroy the condition variable while the signalling thread is potentially still processing the completion. The fix is to add a non-atomic member to the sync structure that indicates another process is handling completion. Since the member will only be set to false by the initiating thread and the completing thread the variable does not need to be protected. When destoying a condition variable the waiting thread needs to wait until the singalling thread is finished. Thanks to @artpol84 for tracking this down. Fixes open-mpi#1813 Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from open-mpi/ompi@fb455f0) Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a race condition discovered by @artpol84. The race happens when a signalling thread decrements the sync count to 0 then goes to sleep. If the waiting thread runs and detects the count == 0 before going to sleep on the condition variable it will destroy the condition variable while the signalling thread is potentially still processing the completion. The fix is to add a non-atomic member to the sync structure that indicates another process is handling completion. Since the member will only be set to false by the initiating thread and the completing thread the variable does not need to be protected. When destoying a condition variable the waiting thread needs to wait until the singalling thread is finished. Thanks to @artpol84 for tracking this down. Fixes open-mpi#1813 Signed-off-by: Nathan Hjelm <[email protected]>
Mellanox Jenkins is showing a periodic problem with the overlap test using Yalla. There has been some discussion on open-mpi/ompi-release#1240 about what the actual root cause is, starting around here: open-mpi/ompi-release#1240 (comment).
Please continue the discussion here: @jladd-mlnx @miked-mellanox @artpol84 @nysal @hjelmn
We need to decide if this is a v2.0.0 or v2.0.1 issue. What is the scope of the problem, and how quickly can it be fixed? I'm tentatively marking this as v2.0.0, just so that it comes up on our daily discussions.
I'm opening a separate issue to track the discussion that was occurring on open-mpi/ompi-release#1240 (so that 1240 can go back to its regularly scheduled programming).
The text was updated successfully, but these errors were encountered: