-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix deadlock on core state lock #10456
Conversation
I think you misread the comment. It says:
Although in practice we only use this function for the stateLock, it could be used for others, so I would probably rewrite this comment as:
|
8e283c1
to
49f5d3c
Compare
I refactored the function comment to make it more clear. Looks like what is being accomplished by the grabLockOrStop function is making sync.Mutex.Lock cancelable. A better way to accomplish this would be to switch c.stateLock to be a channel buffered to 1, but that'd be a much larger change. Our vault enterprise deployment has been getting deadlocked because c.stateLock has an exclusive lock acquired but never released. I'm not sure if this issue caused it, but the original grabLockOrStop code is incorrect because it can return stopped=true and leave the lock held randomly, depending on the timing of when stopCh is closed. It doesn't look like usages of grabLockOrStop on c.stateLock in the open source code would cause this issue. I wonder if there's code in vault enterprise that's acquiring c.stateLock with grabLockOrStop and closing stopCh concurrently? |
Yeah, that's plausible - at least I can't rule it out yet. Do you by any chance have a goroutine dump you could share from when the deadlocking was happening? |
This is a 33MiB file that contains stack traces from our deadlocked vault enterprise 1.5.5 server. Here are my main learnings from it. There are thousands of goroutines blocked (for over 100 minutes) on this line trying to acquire a read lock on c.stateLock. These are coming from a request forwarding HTTP handler that's checking if the vault instance is a performance standby. So every HTTP request that came in leaked a goroutine because of this deadlock. Line 69 in f5d1ddb
There is one goroutine blocked (for over 100 minutes) on this line trying to acquire a write lock on c.stateLock. Line 637 in f5d1ddb
Because we have a bunch of goroutines blocked on c.stateLock.Rlock() and one blocked on c.stateLock.Lock(), it implies that an exclusive lock is already held on c.stateLock. Because the goroutines are blocked for 100+ minutes it implies a deadlock. I also searched for stack traces stuck in critical sections when c.stateLock.Lock was held to check for a deadlock caused by c.stateLock.Lock being called and then getting hung on something before it could unlock. I didn't find any stack traces hung in one of these critical sections though. This makes it look like there was a c.stateLock.Lock that wasn't matched by a c.stateLock.Unlock. The race condition with grabLockOrStop could cause there to be a c.stateLock.Lock that's not matched by a c.stateLock.Unlock, if grabLockOrStop returns stopped=true after locking c.stateLock and not unlocking it (that's the bug). It doesn't look like any of the callers in the open source code would trigger this bug which makes me wonder if there's something in the vault enterprise code. It'd also be good to look for a c.stateLock.Lock in the vault enterprise code to see if it's not matched with a c.stateLock.Unlock. |
The bug is in the grabLockOrStop function. For specific concurrent executions the grabLockOrStop function can return stopped=true when the lock is still held. A comment in grabLockOrStop indicates that the function is only used when the stateLock is held, but grabLockOrStop is being used to acquire the stateLock. If there are concurrent goroutines using grabLockOrStop then some concurrent executions result in stopped=true being returned when the lock is acquired. The fix is to add a lock and some state around which the parent and child goroutine in the grabLockOrStop function can coordinate so that the different concurrent executions can be handled. This change includes a non-deterministic unit test which reliably reproduces the problem before the fix.
57e9b29
to
aafe181
Compare
vault/ha_test.go
Outdated
go func() { | ||
defer closerWg.Done() | ||
// Close the stop channel half the time. | ||
if time.Now().UnixNano() % 2 == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tested this by adding an atomic to count how often the close path was taken, I found it was always taken. I suggest instead using something like rand.Int31()
.
Hi @sbunce, I've gone over the new code carefully and I agree that this should fix the issue you found. The test is good too, modulo a minor comment I added. Since this is such a critical func I'm going to have someone else look at it too. Thanks very much for the contribution! |
Using time.Now().UnixNano()%2 ends up being system dependent because different operating systems and hardware have different clock resolution. A lower resolution will return the same unix time for a longer period of time. It is better to avoid this issue by using a random number generator. This change uses the rand package default random number generator. It's generally good to avoid using the default random number generator, because it creates extra lock contention. For a test it should be fine.
On my system the close is only done roughly 1/2 the time. It looks like what's happening is that time.Now().UnixNano() returns the same value on a bunch of subsequent calls depending on the resolution of the clock, which is dependent on operating system and hardware. Your idea to use a random number generator is definitely better because it does not have this problem. I switched to that and verified it worked by temporarily putting some counters in. Good catch! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code changes look great to me, thanks for the in-depth description of the issue and tracking down this tricky deadlock.
It's nice that we don't have to worry about this restriction anymore, making the helper more useful overall:
If stopCh has been closed, which only happens while the stateLock is held
* fix race that can cause deadlock on core state lock The bug is in the grabLockOrStop function. For specific concurrent executions the grabLockOrStop function can return stopped=true when the lock is still held. A comment in grabLockOrStop indicates that the function is only used when the stateLock is held, but grabLockOrStop is being used to acquire the stateLock. If there are concurrent goroutines using grabLockOrStop then some concurrent executions result in stopped=true being returned when the lock is acquired. The fix is to add a lock and some state around which the parent and child goroutine in the grabLockOrStop function can coordinate so that the different concurrent executions can be handled. This change includes a non-deterministic unit test which reliably reproduces the problem before the fix. * use rand instead of time for random test stopCh close Using time.Now().UnixNano()%2 ends up being system dependent because different operating systems and hardware have different clock resolution. A lower resolution will return the same unix time for a longer period of time. It is better to avoid this issue by using a random number generator. This change uses the rand package default random number generator. It's generally good to avoid using the default random number generator, because it creates extra lock contention. For a test it should be fine.
* fix deadlock on core state lock (#10456) * fix race that can cause deadlock on core state lock The bug is in the grabLockOrStop function. For specific concurrent executions the grabLockOrStop function can return stopped=true when the lock is still held. A comment in grabLockOrStop indicates that the function is only used when the stateLock is held, but grabLockOrStop is being used to acquire the stateLock. If there are concurrent goroutines using grabLockOrStop then some concurrent executions result in stopped=true being returned when the lock is acquired. The fix is to add a lock and some state around which the parent and child goroutine in the grabLockOrStop function can coordinate so that the different concurrent executions can be handled. This change includes a non-deterministic unit test which reliably reproduces the problem before the fix. * use rand instead of time for random test stopCh close Using time.Now().UnixNano()%2 ends up being system dependent because different operating systems and hardware have different clock resolution. A lower resolution will return the same unix time for a longer period of time. It is better to avoid this issue by using a random number generator. This change uses the rand package default random number generator. It's generally good to avoid using the default random number generator, because it creates extra lock contention. For a test it should be fine. * changelog/10456.txt Co-authored-by: Seth Bunce <[email protected]>
* fix race that can cause deadlock on core state lock The bug is in the grabLockOrStop function. For specific concurrent executions the grabLockOrStop function can return stopped=true when the lock is still held. A comment in grabLockOrStop indicates that the function is only used when the stateLock is held, but grabLockOrStop is being used to acquire the stateLock. If there are concurrent goroutines using grabLockOrStop then some concurrent executions result in stopped=true being returned when the lock is acquired. The fix is to add a lock and some state around which the parent and child goroutine in the grabLockOrStop function can coordinate so that the different concurrent executions can be handled. This change includes a non-deterministic unit test which reliably reproduces the problem before the fix. * use rand instead of time for random test stopCh close Using time.Now().UnixNano()%2 ends up being system dependent because different operating systems and hardware have different clock resolution. A lower resolution will return the same unix time for a longer period of time. It is better to avoid this issue by using a random number generator. This change uses the rand package default random number generator. It's generally good to avoid using the default random number generator, because it creates extra lock contention. For a test it should be fine.
* fix race that can cause deadlock on core state lock The bug is in the grabLockOrStop function. For specific concurrent executions the grabLockOrStop function can return stopped=true when the lock is still held. A comment in grabLockOrStop indicates that the function is only used when the stateLock is held, but grabLockOrStop is being used to acquire the stateLock. If there are concurrent goroutines using grabLockOrStop then some concurrent executions result in stopped=true being returned when the lock is acquired. The fix is to add a lock and some state around which the parent and child goroutine in the grabLockOrStop function can coordinate so that the different concurrent executions can be handled. This change includes a non-deterministic unit test which reliably reproduces the problem before the fix. * use rand instead of time for random test stopCh close Using time.Now().UnixNano()%2 ends up being system dependent because different operating systems and hardware have different clock resolution. A lower resolution will return the same unix time for a longer period of time. It is better to avoid this issue by using a random number generator. This change uses the rand package default random number generator. It's generally good to avoid using the default random number generator, because it creates extra lock contention. For a test it should be fine. Co-authored-by: Seth Bunce <[email protected]> Co-authored-by: Meggie <[email protected]>
The most important part of evaluating this change is to understand the invariant that if grabLockOrStop returns stopped=true then the lock is not supposed to be held. The original code was violating that invariant for some concurrent executions. The included unit test demonstrates that it was possible for grabLockOrStop to return stopped=true and for the lock to be held. You can see the unit test fail with the old code by copy/paste the unit test to run on the old code.
The first solution I tried was to add a sync.WaitGroup so the grabLockOrStop goroutine waits for the child goroutine to exit. This created an extra delay when the stop channel was closed which upset a lot of unit tests. I mention this because it helps to explain the required behavior of the grabLockOrStop function, which is subtle.
The second solution is to introduce a lock and some locked state around which the parent and child can handle the different concurrent executions.
Concurrent Executions: