Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break grabLockOrStop into two pieces to facilitate investigating deadlocks #17187

Merged
merged 4 commits into from
Sep 20, 2022

Conversation

ncabatoff
Copy link
Collaborator

Without this change, the "grab" locking goroutine looks the same regardless of who was calling grabLockOrStop, so there's no way to identify one of the deadlock parties.

Example of the confusion we're trying to eliminate:

=== CONT  TestSecret_TokenAccessor/auth
POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 8799 lock 0xc000ccd298
../../init.go:230 vault.(*Core).Initialize { c.stateLock.Lock() } <<<<<
../../testing.go:341 vault.TestCoreInitClusterWrapperSetup { result, err := core.Initialize(context.Background(), initParams) }
../../testing.go:2037 vault.(*TestCluster).initCores { bKeys, rKeys, root := TestCoreInitClusterWrapperSetup(t, leader.Core, leader.Handler) }
../../testing.go:1759 vault.NewTestCluster { testCluster.initCores(t, opts, addAuditBackend) }
api_integration_test.go:61 api.testVaultServerCoreConfig { cluster := vault.NewTestCluster(benchhelpers.TBtoT(t), coreConfig, &vault.TestClusterOptions{ }
api_integration_test.go:36 api.testVaultServerUnseal { return testVaultServerCoreConfig(t, &vault.CoreConfig{ }
api_integration_test.go:27 api.testVaultServer { client, _, closer := testVaultServerUnseal(t) }
secret_test.go:1005 api.TestSecret_TokenPolicies.func3 { client, closer := testVaultServer(t) }

Have been trying to lock it again for more than 30s
goroutine 10172 lock 0xc000ccd298
../../ha.go:670 vault.grabLockOrStop.func1 { lockFunc() } <<<<<

…locks. Without this change, the "grab" goroutine looks the same regardless of who was calling grabLockOrStop, so there's no way to identify one of the deadlock parties.
@raskchanky
Copy link
Contributor

It looks like there's a test using the old way of checking grabLockOrStop() here. Does that need to be updated to be called the new way?

@mpalmi mpalmi added the core Issues and Pull-Requests specific to Vault Core label Sep 19, 2022
@ncabatoff
Copy link
Collaborator Author

It looks like there's a test using the old way of checking grabLockOrStop() here. Does that need to be updated to be called the new way?

No, I kept the old func around for now, in part so I don't break ent when we merge to it. I might get rid of it in another pass later.

@mpalmi mpalmi requested a review from a team September 20, 2022 13:06
Copy link
Contributor

@hghaf099 hghaf099 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I just posted a nit suggestion.

defer close(l.doneCh)
l.lockFunc()

// The parent goroutine may or may not be waiting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it worth adding godoc for this function explaining how this function should be used? Although all the examples usages show that, it is not clear maybe for a future developer as to why this function should be used in a separate go routine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, added.

@ncabatoff ncabatoff enabled auto-merge (squash) September 20, 2022 14:41
@ncabatoff ncabatoff disabled auto-merge September 20, 2022 15:02
@ncabatoff ncabatoff merged commit cbbf1a5 into main Sep 20, 2022
@ncabatoff ncabatoff deleted the split-grabLockOrStop branch September 20, 2022 15:03
@raskchanky
Copy link
Contributor

It looks like there's a test using the old way of checking grabLockOrStop() here. Does that need to be updated to be called the new way?

No, I kept the old func around for now, in part so I don't break ent when we merge to it. I might get rid of it in another pass later.

Right, I noticed that the function itself is still there, but the way the function is used has changed, no? Previously it was:

stopped := grabLockOrStop(c.stateLock.RLock, c.stateLock.RUnlock, stopCh)
if stopped {
    // something
}

And now it's:

l := newLockGrabber(c.stateLock.RLock, c.stateLock.RUnlock, stopCh)
go l.grab()
if stopped := l.lockOrStop(); stopped {
    // something
}

Everything in this PR uses the new way and that test still uses the old way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues and Pull-Requests specific to Vault Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants