-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to fix synchronization in SemaphoreStep
#258
Conversation
@@ -189,13 +189,16 @@ public static class Execution extends AbstractStepExecutionImpl { | |||
Object returnValue = null; | |||
Throwable error = null; | |||
boolean success = false, failure = false, sync = true; | |||
String c = Jenkins.XSTREAM.toXML(getContext()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just keeping this out of the synchronized block as before.
synchronized (s) { | ||
if (s.returnValues.containsKey(k)) { | ||
success = true; | ||
returnValue = s.returnValues.get(k); | ||
} else if (s.errors.containsKey(k)) { | ||
failure = true; | ||
error = s.errors.get(k); | ||
} else { | ||
s.contexts.put(k, c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix. Other threads now only ever see State
in one of two modes: either contexts
does not contain the relevant key, and returnValues
and errors
should be used and will be checked here, or contexts
does contain the key, and the StepContext
should be used instead. Previously, there was a case where contexts
did not contain the key yet but returnValues
and errors
had already been checked.
String c = Jenkins.XSTREAM.toXML(getContext()); | ||
synchronized (s) { | ||
s.contexts.put(k, c); | ||
} | ||
sync = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think there may be other pre-existing bugs here. What happens if another thread calls success
or failure
after the first synchronized
block above but before this completes? Those threads will call StepContext.onSuccess
/StepContext.onFailure
before StepExecution.start
has completed, which seems unusual, but maybe it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should change the overall approach, drop support for the synchronous completion cases, and have the step just poll continuously on a background thread looking for results. It would be slower and less efficient, but it also seems less prone to complex synchronization issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should change the overall approach
FWIW, I filed #259 with an example of what that could look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those threads will call
StepContext.onSuccess
/StepContext.onFailure
beforeStepExecution.start
has completed
That would be normal if false
is returned. It would be weird if start
later returns true
, but I do not think it is harmful. TBH I do not recall the logic here in workflow-cps
very well. Anyway this seems unlikely in a normal test, because it should call wait
and then do some stuff before calling success
.
SemaphoreStep
Perhaps it would be easier to reason about if instead of three maps there were just one, with values being a union of Blocked | Success(Object) | Error(Throwable)? |
I saw the issue in a proprietary plugin, but indeed the logs for that test failure look the same:
|
Sorry, four maps, I forgot about |
Yes, but this pattern is a bit awkward in Java (at least pre-sealed classes/interfaces and future related improvements). I can file another PR with that approach for comparison. |
See #260. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a general preference for something like #260 as a refactoring, but OTOH this is fairly simple and looks like it may solve the practical issues, so you should go ahead and get something tested & released.
String c = Jenkins.XSTREAM.toXML(getContext()); | ||
synchronized (s) { | ||
s.contexts.put(k, c); | ||
} | ||
sync = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those threads will call
StepContext.onSuccess
/StepContext.onFailure
beforeStepExecution.start
has completed
That would be normal if false
is returned. It would be weird if start
later returns true
, but I do not think it is harmful. TBH I do not recall the logic here in workflow-cps
very well. Anyway this seems unlikely in a normal test, because it should call wait
and then do some stuff before calling success
.
@jglick From some PCT testing, this PR looks fine, but #260 has a few issues that would need to be addressed (#260 (comment)). I can fix them, but is it ok with you if we just go ahead and merge/release this for now? |
Feel free to release this PR if it seems to pass PCT. No need to waste days on this. |
Even after #235, I saw a test flake because a
semaphore
step hung forever even thoughSempahoreStep.success
was called. The logs indicate an order of operations that show a bug in the existing synchronization:Here is my hypothesis as to what happened. We have two threads: Thread 1 calls
SemaphoreStep.start
. Thread 2 callsSemaphoreStep.success
. They race for theState
monitor. I think that thread 1 acquired theState
monitor first and ran this entire block:workflow-support-plugin/src/test/java/org/jenkinsci/plugins/workflow/test/steps/SemaphoreStep.java
Lines 192 to 200 in 1e84e32
Thread 2 then acquired the
State
monitor and ran this block:workflow-support-plugin/src/test/java/org/jenkinsci/plugins/workflow/test/steps/SemaphoreStep.java
Lines 136 to 143 in 1e84e32
At this point,
s.contexts
does not containk
, soPlanning to unblock wait/1 as success
gets logged, and the monitor is released, but thread 1 already checkeds.returnValues
and will never check it again. Thread 1 in the meantime proceeded to thissynchronized
block and ran it once thread 2 released the monitor:workflow-support-plugin/src/test/java/org/jenkinsci/plugins/workflow/test/steps/SemaphoreStep.java
Lines 210 to 212 in 1e84e32
At this point though, this does nothing. Thread 2 already ran
SemaphoreStep.success
, and the context did not exist, so it put a value inState.returnValues
. Thread 1 had already checkedSemaphoreStep.returnValues
though, and will not check it again, so the step just hangs forever and the test times out.CC @jgreffe