Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic after upgrade to 1.2.4. #51

Closed
mberhault opened this issue Nov 12, 2019 · 6 comments · Fixed by #52
Closed

Panic after upgrade to 1.2.4. #51

mberhault opened this issue Nov 12, 2019 · 6 comments · Fixed by #52

Comments

@mberhault
Copy link

We just upgraded our 3 node cluster (backed by GCS) and received the following panic in the gcp secrets engine about 4 hours later:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1a4e7aa]

goroutine 1156032 [running]:
github.com/hashicorp/vault/vendor/github.com/hashicorp/vault-plugin-secrets-gcp/plugin.(*backend).serviceAccountKeyRollback(0xc007514ff0, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0x2c2da20, 0xc018109530, 0x0, 0x0)
	/go/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/vault-plugin-secrets-gcp/plugin/rollback.go:113 +0x1fa
github.com/hashicorp/vault/vendor/github.com/hashicorp/vault-plugin-secrets-gcp/plugin.(*backend).walRollback(0xc007514ff0, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0xc00f170ea0, 0xb, 0x2c2da20, 0xc018109530, 0xc0181094d0, 0xc01c9c7c80)
	/go/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/vault-plugin-secrets-gcp/plugin/rollback.go:31 +0x17f
github.com/hashicorp/vault/vendor/github.com/hashicorp/vault/sdk/framework.(*Backend).handleWALRollback(0xc00132eea0, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0x2, 0x10, 0xc0002e08b8)
	/go/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/vault/sdk/framework/backend.go:508 +0x49b
github.com/hashicorp/vault/vendor/github.com/hashicorp/vault/sdk/framework.(*Backend).handleRollback(0xc00132eea0, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0xc0002e0920, 0x5b7c97, 0xc0002e0a60)
	/go/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/vault/sdk/framework/backend.go:451 +0xc5
github.com/hashicorp/vault/vendor/github.com/hashicorp/vault/sdk/framework.(*Backend).HandleRequest(0xc00132eea0, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0x0, 0x0, 0x0)
	/go/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/vault/sdk/framework/backend.go:187 +0x7b9
github.com/hashicorp/vault/builtin/plugin.(*PluginBackend).HandleRequest.func1(0x0, 0x28)
	/go/src/github.com/hashicorp/vault/builtin/plugin/backend.go:198 +0x5a
github.com/hashicorp/vault/builtin/plugin.(*PluginBackend).lazyLoadBackend(0xc000874780, 0x38bbb00, 0xc0045b9800, 0x38bc600, 0xc0006be140, 0xc01cfa9b70, 0x0, 0x0)
	/go/src/github.com/hashicorp/vault/builtin/plugin/backend.go:160 +0x8c
github.com/hashicorp/vault/builtin/plugin.(*PluginBackend).HandleRequest(0xc000874780, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0x0, 0xc000678db8, 0x4)
	/go/src/github.com/hashicorp/vault/builtin/plugin/backend.go:196 +0xbb
github.com/hashicorp/vault/vault.(*Router).routeCommon(0xc0005de320, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0xc0045b9800, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/hashicorp/vault/vault/router.go:676 +0x919
github.com/hashicorp/vault/vault.(*Router).Route(...)
	/go/src/github.com/hashicorp/vault/vault/router.go:476
github.com/hashicorp/vault/vault.(*RollbackManager).attemptRollback(0xc0002f4d80, 0x38bbb40, 0xc017ddfe30, 0xc000678db8, 0x4, 0xc007a1d440, 0xc002f8ae01, 0x0, 0x0)
	/go/src/github.com/hashicorp/vault/vault/rollback.go:226 +0x36e
created by github.com/hashicorp/vault/vault.(*RollbackManager).startOrLookupRollback
	/go/src/github.com/hashicorp/vault/vault/rollback.go:162 +0x246

Prior to this, we were running 1.2.3 for about a week and 1.2.2 for two months.

The secrets engine has been there for quite some time. It currently has over 1K rolesets defined (this is intentional, we make use of a lot of projects and create two rolesets per project).

The last request to this node in the logs was 3 minutes before the crash on a completely different secrets engine.

@mberhault
Copy link
Author

Glancing at the code, there is nothing to make me believe this was due to the 1.2.4 upgrade. It is merely the last major thing to happen to this cluster.

It seems that getRoleSet can return nil, nil but this is not handled in serviceAccountKeyRollback. This doesn't seem new but I can't speak to the underlying condition.

kalafut pushed a commit that referenced this issue Nov 12, 2019
kalafut pushed a commit that referenced this issue Nov 12, 2019
kalafut pushed a commit that referenced this issue Nov 12, 2019
@mberhault
Copy link
Author

@kalafut: thank you for the quick fix. Could you elaborate on this code path? Just trying to understand what happened.

@kalafut
Copy link
Contributor

kalafut commented Nov 13, 2019

@mberhault Thanks for the bug report! This occurred in the WAL (write-ahead log) system, so it's a bit tough to say what the root cause was. Possibly some operation involving account keys crashed/paniced earlier (but after the WAL was written), and the roleset of that operation was deleted before the WAL rollback operation ran. Was there any other panic prior to this one? Another path was that after the WAL was written, there was a GCP-related error such that the WAL wasn't able to be removed.

@mberhault
Copy link
Author

We had some GCS unavailability on Monday, then the minor upgrade yesterday, but I don't recall seeing a panic in quite some time.

@kalafut
Copy link
Contributor

kalafut commented Nov 13, 2019

If your storage is on GCS then it could definitely be related, and panic-free. e.g. one "success" sequence could be like:

  • Operation requested
  • WAL written
  • Operation succeeds
  • GCP becomes unavailable so WAL can't be deleted
  • GCP restored
  • Roleset is removed
  • Rollback operation run, finds WAL that it wants to roll back, panics (bug you're seeing) b/c the associated roleset isn't present.

There are other ways too, but if GCS was having issues then this isn't unexpected. Normally the WALs would just get cleaned up, but the bug you found prevented it in this case.

@mberhault
Copy link
Author

Makes sense. Thank you for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants