fix deadlock on core state lock #10456

sbunce · 2020-11-25T23:03:52Z

The most important part of evaluating this change is to understand the invariant that if grabLockOrStop returns stopped=true then the lock is not supposed to be held. The original code was violating that invariant for some concurrent executions. The included unit test demonstrates that it was possible for grabLockOrStop to return stopped=true and for the lock to be held. You can see the unit test fail with the old code by copy/paste the unit test to run on the old code.

The first solution I tried was to add a sync.WaitGroup so the grabLockOrStop goroutine waits for the child goroutine to exit. This created an extra delay when the stop channel was closed which upset a lot of unit tests. I mention this because it helps to explain the required behavior of the grabLockOrStop function, which is subtle.

The second solution is to introduce a lock and some locked state around which the parent and child can handle the different concurrent executions.

Concurrent Executions:

The parent gets stopped (by stopCh) and returns before the child is done locking. To handle this the parent sets parentWaiting=false. The child will see that the parent isn't waiting and unlocks the lock.
The parent gets stopped (by stopCh) after the child has locked the lock. To handle this the parent notices that the child set locked=true, so it knows that it has to unlock before returning stopped=true.

hashicorp-cla · 2020-11-25T23:03:56Z

All committers have signed the CLA.

ncabatoff · 2020-11-25T23:23:08Z

A comment in grabLockOrStop indicates that the function is only used when the stateLock is held, but grabLockOrStop is
being used to acquire the stateLock.

I think you misread the comment. It says:

// If stopCh has been closed, which only happens while the
// stateLock is held, we have actually terminated, so we just
// instantly give up the lock, otherwise we notify that it's ready
// for consumption

Although in practice we only use this function for the stateLock, it could be used for others, so I would probably rewrite this comment as:

// stopCh should only be closed by the holder of a write lock on the underlying mutex
// the caller is trying to acquire a lock on (regardless of whether lockFunc is a read
// or write lock function.). If stopCh is closed, we give up, release the lock if we managed
// to acquire it, and return true.  Otherwise we block until we acquire the lock and return false.

sbunce · 2020-11-30T17:52:01Z

@ncabatoff

I refactored the function comment to make it more clear. Looks like what is being accomplished by the grabLockOrStop function is making sync.Mutex.Lock cancelable. A better way to accomplish this would be to switch c.stateLock to be a channel buffered to 1, but that'd be a much larger change.

Our vault enterprise deployment has been getting deadlocked because c.stateLock has an exclusive lock acquired but never released. I'm not sure if this issue caused it, but the original grabLockOrStop code is incorrect because it can return stopped=true and leave the lock held randomly, depending on the timing of when stopCh is closed. It doesn't look like usages of grabLockOrStop on c.stateLock in the open source code would cause this issue. I wonder if there's code in vault enterprise that's acquiring c.stateLock with grabLockOrStop and closing stopCh concurrently?

ncabatoff · 2020-11-30T18:07:11Z

I wonder if there's code in vault enterprise that's acquiring c.stateLock with grabLockOrStop and closing stopCh concurrently?

Yeah, that's plausible - at least I can't rule it out yet. Do you by any chance have a goroutine dump you could share from when the deadlocking was happening?

sbunce · 2020-11-30T18:36:54Z

This is a 33MiB file that contains stack traces from our deadlocked vault enterprise 1.5.5 server.
https://drive.google.com/file/d/1V1GqHDOw2LTr8NHBLaCt0GEegZmclk5z/view?usp=sharing

Here are my main learnings from it.

There are thousands of goroutines blocked (for over 100 minutes) on this line trying to acquire a read lock on c.stateLock. These are coming from a request forwarding HTTP handler that's checking if the vault instance is a performance standby. So every HTTP request that came in leaked a goroutine because of this deadlock.

vault/vault/ha.go

Line 69 in f5d1ddb

c.stateLock.RLock()

goroutine 12126102 [semacquire, 111 minutes]:
sync.runtime_SemacquireMutex(0xc0003e7a3c, 0xc037023300, 0x0)
	/goroot/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).RLock(...)
	/goroot/src/sync/rwmutex.go:50
github.com/hashicorp/vault/vault.(*Core).PerfStandby(0xc0003e7800, 0xc000a79a40)
	/gopath/src/github.com/hashicorp/vault/vault/ha.go:69 +0x9f
github.com/hashicorp/vault/http.handleRequestForwarding.func1(0x44197a0, 0xc036e02380, 0xc036e84600)
	/gopath/src/github.com/hashicorp/vault/http/handler.go:665 +0x56
net/http.HandlerFunc.ServeHTTP(0xc0005b9120, 0x44197a0, 0xc036e02380, 0xc036e84600)
	/goroot/src/net/http/server.go:2041 +0x44
<truncated uninteresting stuff>

There is one goroutine blocked (for over 100 minutes) on this line trying to acquire a write lock on c.stateLock.

vault/vault/ha.go

Line 637 in f5d1ddb

lockFunc()

goroutine 12123671 [semacquire, 111 minutes]:
sync.runtime_SemacquireMutex(0xc0003e7a38, 0x0, 0x0)
	/goroot/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).Lock(0xc0003e7a30)
	/goroot/src/sync/rwmutex.go:103 +0x88
github.com/hashicorp/vault/vault.grabLockOrStop.func1(0xc028ae8580, 0xc00046a7e0, 0xc028ae8590, 0xc00564f980)
	/gopath/src/github.com/hashicorp/vault/vault/ha.go:637 +0x27
created by github.com/hashicorp/vault/vault.grabLockOrStop
	/gopath/src/github.com/hashicorp/vault/vault/ha.go:635 +0x96

Because we have a bunch of goroutines blocked on c.stateLock.Rlock() and one blocked on c.stateLock.Lock(), it implies that an exclusive lock is already held on c.stateLock. Because the goroutines are blocked for 100+ minutes it implies a deadlock.

I also searched for stack traces stuck in critical sections when c.stateLock.Lock was held to check for a deadlock caused by c.stateLock.Lock being called and then getting hung on something before it could unlock. I didn't find any stack traces hung in one of these critical sections though. This makes it look like there was a c.stateLock.Lock that wasn't matched by a c.stateLock.Unlock.

The race condition with grabLockOrStop could cause there to be a c.stateLock.Lock that's not matched by a c.stateLock.Unlock, if grabLockOrStop returns stopped=true after locking c.stateLock and not unlocking it (that's the bug). It doesn't look like any of the callers in the open source code would trigger this bug which makes me wonder if there's something in the vault enterprise code.

It'd also be good to look for a c.stateLock.Lock in the vault enterprise code to see if it's not matched with a c.stateLock.Unlock.

The bug is in the grabLockOrStop function. For specific concurrent executions the grabLockOrStop function can return stopped=true when the lock is still held. A comment in grabLockOrStop indicates that the function is only used when the stateLock is held, but grabLockOrStop is being used to acquire the stateLock. If there are concurrent goroutines using grabLockOrStop then some concurrent executions result in stopped=true being returned when the lock is acquired. The fix is to add a lock and some state around which the parent and child goroutine in the grabLockOrStop function can coordinate so that the different concurrent executions can be handled. This change includes a non-deterministic unit test which reliably reproduces the problem before the fix.

ncabatoff · 2020-12-08T13:46:08Z

vault/ha_test.go

+				go func() {
+					defer closerWg.Done()
+					// Close the stop channel half the time.
+					if time.Now().UnixNano() % 2 == 0 {


When I tested this by adding an atomic to count how often the close path was taken, I found it was always taken. I suggest instead using something like rand.Int31().

ncabatoff · 2020-12-08T13:49:34Z

Hi @sbunce,

I've gone over the new code carefully and I agree that this should fix the issue you found. The test is good too, modulo a minor comment I added. Since this is such a critical func I'm going to have someone else look at it too. Thanks very much for the contribution!

Using time.Now().UnixNano()%2 ends up being system dependent because different operating systems and hardware have different clock resolution. A lower resolution will return the same unix time for a longer period of time. It is better to avoid this issue by using a random number generator. This change uses the rand package default random number generator. It's generally good to avoid using the default random number generator, because it creates extra lock contention. For a test it should be fine.

sbunce · 2020-12-08T17:37:30Z

On my system the close is only done roughly 1/2 the time. It looks like what's happening is that time.Now().UnixNano() returns the same value on a bunch of subsequent calls depending on the resolution of the clock, which is dependent on operating system and hardware.

Your idea to use a random number generator is definitely better because it does not have this problem. I switched to that and verified it worked by temporarily putting some counters in.

Good catch!

briankassouf

The code changes look great to me, thanks for the in-depth description of the issue and tracking down this tricky deadlock.

It's nice that we don't have to worry about this restriction anymore, making the helper more useful overall:

If stopCh has been closed, which only happens while the stateLock is held

* fix race that can cause deadlock on core state lock The bug is in the grabLockOrStop function. For specific concurrent executions the grabLockOrStop function can return stopped=true when the lock is still held. A comment in grabLockOrStop indicates that the function is only used when the stateLock is held, but grabLockOrStop is being used to acquire the stateLock. If there are concurrent goroutines using grabLockOrStop then some concurrent executions result in stopped=true being returned when the lock is acquired. The fix is to add a lock and some state around which the parent and child goroutine in the grabLockOrStop function can coordinate so that the different concurrent executions can be handled. This change includes a non-deterministic unit test which reliably reproduces the problem before the fix. * use rand instead of time for random test stopCh close Using time.Now().UnixNano()%2 ends up being system dependent because different operating systems and hardware have different clock resolution. A lower resolution will return the same unix time for a longer period of time. It is better to avoid this issue by using a random number generator. This change uses the rand package default random number generator. It's generally good to avoid using the default random number generator, because it creates extra lock contention. For a test it should be fine.

* fix deadlock on core state lock (#10456) * fix race that can cause deadlock on core state lock The bug is in the grabLockOrStop function. For specific concurrent executions the grabLockOrStop function can return stopped=true when the lock is still held. A comment in grabLockOrStop indicates that the function is only used when the stateLock is held, but grabLockOrStop is being used to acquire the stateLock. If there are concurrent goroutines using grabLockOrStop then some concurrent executions result in stopped=true being returned when the lock is acquired. The fix is to add a lock and some state around which the parent and child goroutine in the grabLockOrStop function can coordinate so that the different concurrent executions can be handled. This change includes a non-deterministic unit test which reliably reproduces the problem before the fix. * use rand instead of time for random test stopCh close Using time.Now().UnixNano()%2 ends up being system dependent because different operating systems and hardware have different clock resolution. A lower resolution will return the same unix time for a longer period of time. It is better to avoid this issue by using a random number generator. This change uses the rand package default random number generator. It's generally good to avoid using the default random number generator, because it creates extra lock contention. For a test it should be fine. * changelog/10456.txt Co-authored-by: Seth Bunce <[email protected]>

* fix race that can cause deadlock on core state lock The bug is in the grabLockOrStop function. For specific concurrent executions the grabLockOrStop function can return stopped=true when the lock is still held. A comment in grabLockOrStop indicates that the function is only used when the stateLock is held, but grabLockOrStop is being used to acquire the stateLock. If there are concurrent goroutines using grabLockOrStop then some concurrent executions result in stopped=true being returned when the lock is acquired. The fix is to add a lock and some state around which the parent and child goroutine in the grabLockOrStop function can coordinate so that the different concurrent executions can be handled. This change includes a non-deterministic unit test which reliably reproduces the problem before the fix. * use rand instead of time for random test stopCh close Using time.Now().UnixNano()%2 ends up being system dependent because different operating systems and hardware have different clock resolution. A lower resolution will return the same unix time for a longer period of time. It is better to avoid this issue by using a random number generator. This change uses the rand package default random number generator. It's generally good to avoid using the default random number generator, because it creates extra lock contention. For a test it should be fine.

* fix race that can cause deadlock on core state lock The bug is in the grabLockOrStop function. For specific concurrent executions the grabLockOrStop function can return stopped=true when the lock is still held. A comment in grabLockOrStop indicates that the function is only used when the stateLock is held, but grabLockOrStop is being used to acquire the stateLock. If there are concurrent goroutines using grabLockOrStop then some concurrent executions result in stopped=true being returned when the lock is acquired. The fix is to add a lock and some state around which the parent and child goroutine in the grabLockOrStop function can coordinate so that the different concurrent executions can be handled. This change includes a non-deterministic unit test which reliably reproduces the problem before the fix. * use rand instead of time for random test stopCh close Using time.Now().UnixNano()%2 ends up being system dependent because different operating systems and hardware have different clock resolution. A lower resolution will return the same unix time for a longer period of time. It is better to avoid this issue by using a random number generator. This change uses the rand package default random number generator. It's generally good to avoid using the default random number generator, because it creates extra lock contention. For a test it should be fine. Co-authored-by: Seth Bunce <[email protected]> Co-authored-by: Meggie <[email protected]>

sbunce force-pushed the sbunce/fix-core-deadlock branch from 8e283c1 to 49f5d3c Compare November 26, 2020 01:43

sbunce marked this pull request as ready for review November 26, 2020 02:02

sbunce force-pushed the sbunce/fix-core-deadlock branch from 57e9b29 to aafe181 Compare November 30, 2020 20:44

ncabatoff reviewed Dec 8, 2020

View reviewed changes

ncabatoff added this to the 1.6.1 milestone Dec 8, 2020

ncabatoff added the pod/distsys label Dec 8, 2020

Merge branch 'master' into sbunce/fix-core-deadlock

7d4c8b6

briankassouf approved these changes Dec 10, 2020

View reviewed changes

ncabatoff approved these changes Dec 10, 2020

View reviewed changes

ncabatoff merged commit 0998d51 into hashicorp:master Dec 10, 2020

ncabatoff added the backport label Dec 10, 2020

ncabatoff mentioned this pull request Apr 6, 2021

Post-election leader setup failure #11276

Closed

etherandrius mentioned this pull request Apr 21, 2021

Deadlock after a leader election. #11436

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix deadlock on core state lock #10456

fix deadlock on core state lock #10456

sbunce commented Nov 25, 2020 •

edited

Loading

hashicorp-cla commented Nov 25, 2020 •

edited

Loading

ncabatoff commented Nov 25, 2020

sbunce commented Nov 30, 2020

ncabatoff commented Nov 30, 2020

sbunce commented Nov 30, 2020

ncabatoff Dec 8, 2020

ncabatoff commented Dec 8, 2020

sbunce commented Dec 8, 2020

briankassouf left a comment

fix deadlock on core state lock #10456

fix deadlock on core state lock #10456

Conversation

sbunce commented Nov 25, 2020 • edited Loading

hashicorp-cla commented Nov 25, 2020 • edited Loading

ncabatoff commented Nov 25, 2020

sbunce commented Nov 30, 2020

ncabatoff commented Nov 30, 2020

sbunce commented Nov 30, 2020

ncabatoff Dec 8, 2020

Choose a reason for hiding this comment

ncabatoff commented Dec 8, 2020

sbunce commented Dec 8, 2020

briankassouf left a comment

Choose a reason for hiding this comment

sbunce commented Nov 25, 2020 •

edited

Loading

hashicorp-cla commented Nov 25, 2020 •

edited

Loading