Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Glibc] Condition variable lost wakeups #56

Closed
mratsim opened this issue Dec 19, 2019 · 13 comments
Closed

[Glibc] Condition variable lost wakeups #56

mratsim opened this issue Dec 19, 2019 · 13 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Dec 19, 2019

So I'm in conflict:

  • On the red corner, formal verification, model checking, axioms, proofs and the foundation of our comprehension of our universe.
  • On the blue corner, glibc, an industry standard which should have a reference implementation of pthreads.

As an arbiter, OSX C standard library.

Those are example logs of unfortunate fates of my runtime:

image
image
image

Plenty of people suggested on stack overflow that you should use locks, you should unlock before signaling, you should unlock after signaling.
However the evidence seem damning, Glibc sits on the box of the accused. OSX does not exhibit the same behaviour. Musl does though. Formal verification explored over almost 10 millions of possible state interleaving and did not find a deadlock or livelock in my event notifier:
image

So what does Glibc has for defence?

While waiting for a fix that may never come especially in distros that upgrade very slowly, let's review our options:

  • Ask everyone to switch to Mac (yeah no, ...)
  • Don't back off: saving power would be nice still.
  • Use exponential backoff, log-log-iterated backoff of some of the backoffs techniques used for Wi-fi/bluetooth explained in the backoff readme: https://github.com/mratsim/weave/blob/1a458832/weave/channels/event_notifiers_and_backoff.md: acceptable but nanosleep is not a posix standard and sleep in microsecond is too long and also "approximative". It also does not completely sleep the threads so there is still some power use remaining. And lastly it increases latency.
  • Try one of the wakeup ceremony mentioned in there: https://stackoverflow.com/a/9918764 (apparently used in the linux kernel)
/* atomic op outside of mutex, and then: */

pthread_mutex_lock(&m);
pthread_mutex_unlock(&m);

pthread_cond_signal(&c);
  • or condition signaling via mutex unlock? https://news.ycombinator.com/item?id=11894100 (but in the screenshot I did add lock everywhere
  • last resort implementing my own futexes and condition variable from scratch on top of Linux, Mac and Windows primitives ...
@mratsim
Copy link
Owner Author

mratsim commented Dec 19, 2019

No comment ...
image

@mratsim
Copy link
Owner Author

mratsim commented Dec 21, 2019

No more condvar on Linux #58

@mratsim mratsim closed this as completed Dec 21, 2019
@mratsim
Copy link
Owner Author

mratsim commented Dec 21, 2019

And for posterity, I confirm that there are no lost signals in Windows condition variables.

@mratsim
Copy link
Owner Author

mratsim commented May 9, 2020

@mratsim
Copy link
Owner Author

mratsim commented Mar 8, 2021

Seems like there is now an issue upstream affecting C#, Python, OCaml:

And someone reimplemented glibc condition variable in TLA+ to prove that they were buggy:

@richfelker
Copy link

richfelker commented Mar 9, 2021

Someone pointed me to this issue and it was interesting to me because I was one of the original participants in the discussion around the glibc bug (https://sourceware.org/bugzilla/show_bug.cgi?id=13165) it sounds like, and apparently there's allegedly a new similar glibc bug. However, I'm pretty sure that was not the issue, and that the problem here was a not a glibc bug. Observing any bug of that form (stolen signals), requires the signal to take place with the lock held; otherwise, there is no ordering relationship between the signal and newly arriving waiters that might "steal" the signal, and thus no way to designate it stolen. In the code shown here, the signal takes place without the lock held.

I'm not sure if the new futex code you have is correct or not; with backoff and retry it probably works automatically though. But the cond var code is almost certainly wrong. If you need to know that signal will wake a waiter that arrived before the event you're signaling, the change to state and signal operation must take place together under a single instance of the lock being held.

@richfelker
Copy link

As an extra data point, the new glibc bug https://sourceware.org/bugzilla/show_bug.cgi?id=25847 is not reproducible on musl (where the current condvar implementation, dating back to 2014, was specifically designed around getting the stolen wakes issue right, as well as self-synchronized destruction right), but as I understand it, you hit the problem described above with both glibc and musl.

@hmkemppainen
Copy link

I'm that someone. I was at work at the time and didn't really have the time to study the code or the TLA+ model. I just sort of assumed you know what you're doing and dropped a link to this bug on IRC for the musl folk to check out.

I now took a look at the PlusCal code and it is clear that it's not simulating a condition variable correctly. Signal should be a no-op if nobody is waiting, but in your spec it simply sets a boolean variable which will eventually awaken a waiter whether they were waiting at the time of signal or not.

Try something like this and TLC will find a deadlock:

diff --git a/formal_verification/event_notifiers.tla b/formal_verification/event_notifiers.tla
index aaae45f..0ff4bd8 100644
--- a/formal_verification/event_notifiers.tla
+++ b/formal_verification/event_notifiers.tla
@@ -36,6 +36,7 @@ variables
     ticket = FALSE;                        \* a ticket to sleep in a phase
     signaled = FALSE;                      \* a flag for incoming condvar signal
 
+    condWait = 0;
     condVar = FALSE;                       \* Simulate a condition variable
     msgToParent = "None";                  \* Simulate a worksharing request to the parent
     signaledTerminate = FALSE;             \* Simulate a termination task
@@ -88,8 +89,10 @@ procedure park()
     begin
         NotSignaled2:  if ~signaled then
         StillValid:      if ticket = phase then
+        StartWait:         condWait := condWait + 1;
         Wait:              await condVar;    \* next line is atomic
                            condVar := FALSE; \* we don't model the spurious wakeups here
+                           condWait := condWait - 1;
                          end if;             \* but they are not a problem anyway
                        end if;
         Reset:         signaled := FALSE;
@@ -104,7 +107,9 @@ procedure notify()
                        end if;
         Notify:        signaled := TRUE;
         InvalidTicket: phase := ~phase;
-        Awaken:        condVar := TRUE;
+        Awaken:        if condWait > 0 then
+                          condVar := TRUE;
+                       end if;
         RET_Notify:    return;
 end procedure;

As @richfelker said, you really should use that lock. It's not there just for show, and you can't dance around it using atomics.

@mratsim
Copy link
Owner Author

mratsim commented Mar 16, 2021

Thank you for the TLA condvar

As @richfelker said, you really should use that lock. It's not there just for show, and you can't dance around it using atomics.

I'm not sure what you mean by dance around with atomics, this also deadlocks despite using the lock

image

@hmkemppainen
Copy link

hmkemppainen commented Mar 16, 2021

I'm not sure what you mean by dance around with atomics, this also deadclocks despite using the lock

I mean that the waiter must both check the state variable and call wait while while holding the lock. If you don't, there's always a race between checking the state variable and entering wait (no matter how much you use atomics and fences). Of course the signaller must modify that state variable only while holding the lock.

If you think about the condvar api, it wouldn't make any sense for them to require you to just wrap the wait/signal calls with a lock. If that is all they need from you, then they could very well grab the lock inside the library for you and you wouldn't ever need to worry about it. But simply wrapping the call is not enough, you must also protect the controlling state variable with a lock. That is why it's your responsibility to use the lock.

I don't know if that explanation makes sense but consider this pseudocode:

park() {
  lock.acquire();
  if (shouldWakeUp) {
    shouldWakeUp = FALSE;
    lock.release();
    return;
  }
  // if we didn't have the lock here, shouldWakeUp could change to TRUE right now
  // and the notifying thread could signal before we enter wait().  That signal is effectively
  // lost.
  cond.wait(lock);
  lock.release();
}

notify() {
  lock.acquire();
  shouldWakeUp = TRUE;
  cond.signal(); // it should be safe to signal after the release too
  lock.release();
}

It is critical that the condition for waking up is checked while the lock is held, before entering wait. That's what the lock is there for.

@mratsim
Copy link
Owner Author

mratsim commented Mar 16, 2021

The whole point of my "2-phase commit to sleep" protocol is to avoid holding the lock for a long time.

There is work to do in between the shouldWakeUp check and cond.wait(lock) that requires sending messages to other threads.
Your proposal might deadlock my runtime if threads are locking each of their event notifiers.

@richfelker
Copy link

You still must check the predicate again with the lock held before waiting on the condvar, and there must be an invariant that other threads cannot modify the data the predicate depends on without holding the lock.

You don't have to "hold the lock for a long time". You should generally do nothing but take the lock, make sure nothing has changed that might require re-inspection, and then perform the condvar wait. Long-running operations can be done outside that.

@mratsim
Copy link
Owner Author

mratsim commented Mar 16, 2021

Ah I see what you mean.

In the current code this is ensured by the phase and ticket system:

type
EventNotifier* = object
## Multi Producers, Single Consumer event notification
## This is can be seen as a wait-free condition variable for producers
## that avoids them spending time in expensive kernel land due to mutexes.
##
## This data structure should be associated with a MPSC channel
## to notify that an "event" happened in the channel.
## It avoid spurious polling of empty channels,
## and allow parking of threads to save on CPU power.
##
## See also: binary semaphores, eventcounts
## On Windows: ManuallyResetEvent and AutoResetEvent
when supportsFutex:
# ---- Consumer specific ----
ticket{.align: WV_CacheLinePadding.}: uint8 # A ticket for the consumer to sleep in a phase
# ---- Contention ---- no real need for padding as cache line should be reloaded in case of contention anyway
futex: Futex # A Futex (atomic int32 that can put thread to sleep)
phase: Atomic[uint8] # A binary timestamp, toggles between 0 and 1 (but there is no atomic "not")
signaled: Atomic[bool] # Signaling condition
else:
# ---- Consumer specific ----
lock{.align: WV_CacheLinePadding.}: Lock # The lock is never used, it's just there for the condition variable
ticket: uint8 # A ticket for the consumer to sleep in a phase
# ---- Contention ---- no real need for padding as cache line should be reloaded in case of contention anyway
cond: Cond # Allow parking of threads
phase: Atomic[uint8] # A binary timestamp, toggles between 0 and 1 (but there is no atomic "not")
signaled: Atomic[bool] # Signaling condition

func prepareToPark*(en: var EventNotifier) {.inline.} =
## The consumer intends to sleep soon.
## This must be called before the formal notification
## via a channel.
if not en.signaled.load(moRelaxed):
en.ticket = en.phase.load(moRelaxed)
proc park*(en: var EventNotifier) {.inline.} =
## Wait until we are signaled of an event
## Thread is parked and does not consume CPU resources
## This may wakeup spuriously.
if not en.signaled.load(moRelaxed):
if en.ticket == en.phase.load(moRelaxed):
when supportsFutex:
en.futex.wait(0)
else:
en.cond.wait(en.lock) # Spurious wakeup are not a problem
en.signaled.store(false, moRelaxed)
when supportsFutex:
en.futex.initialize()
# We still hold the lock but it's not used anyway.

This works fine with the Futex code. I wonder why I didn't get any deadlocks with Windows or Mac condition variables. In particular Windows was using the same machine so it's possible that their signaling works on different assumption than POSIX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants