Panic after upgrade to 1.2.4. #51

mberhault · 2019-11-12T18:08:01Z

We just upgraded our 3 node cluster (backed by GCS) and received the following panic in the gcp secrets engine about 4 hours later:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1a4e7aa]

goroutine 1156032 [running]:
github.com/hashicorp/vault/vendor/github.com/hashicorp/vault-plugin-secrets-gcp/plugin.(*backend).serviceAccountKeyRollback(0xc007514ff0, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0x2c2da20, 0xc018109530, 0x0, 0x0)
	/go/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/vault-plugin-secrets-gcp/plugin/rollback.go:113 +0x1fa
github.com/hashicorp/vault/vendor/github.com/hashicorp/vault-plugin-secrets-gcp/plugin.(*backend).walRollback(0xc007514ff0, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0xc00f170ea0, 0xb, 0x2c2da20, 0xc018109530, 0xc0181094d0, 0xc01c9c7c80)
	/go/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/vault-plugin-secrets-gcp/plugin/rollback.go:31 +0x17f
github.com/hashicorp/vault/vendor/github.com/hashicorp/vault/sdk/framework.(*Backend).handleWALRollback(0xc00132eea0, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0x2, 0x10, 0xc0002e08b8)
	/go/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/vault/sdk/framework/backend.go:508 +0x49b
github.com/hashicorp/vault/vendor/github.com/hashicorp/vault/sdk/framework.(*Backend).handleRollback(0xc00132eea0, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0xc0002e0920, 0x5b7c97, 0xc0002e0a60)
	/go/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/vault/sdk/framework/backend.go:451 +0xc5
github.com/hashicorp/vault/vendor/github.com/hashicorp/vault/sdk/framework.(*Backend).HandleRequest(0xc00132eea0, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0x0, 0x0, 0x0)
	/go/src/github.com/hashicorp/vault/vendor/github.com/hashicorp/vault/sdk/framework/backend.go:187 +0x7b9
github.com/hashicorp/vault/builtin/plugin.(*PluginBackend).HandleRequest.func1(0x0, 0x28)
	/go/src/github.com/hashicorp/vault/builtin/plugin/backend.go:198 +0x5a
github.com/hashicorp/vault/builtin/plugin.(*PluginBackend).lazyLoadBackend(0xc000874780, 0x38bbb00, 0xc0045b9800, 0x38bc600, 0xc0006be140, 0xc01cfa9b70, 0x0, 0x0)
	/go/src/github.com/hashicorp/vault/builtin/plugin/backend.go:160 +0x8c
github.com/hashicorp/vault/builtin/plugin.(*PluginBackend).HandleRequest(0xc000874780, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0x0, 0xc000678db8, 0x4)
	/go/src/github.com/hashicorp/vault/builtin/plugin/backend.go:196 +0xbb
github.com/hashicorp/vault/vault.(*Router).routeCommon(0xc0005de320, 0x38bbb00, 0xc0045b9800, 0xc004257400, 0xc0045b9800, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/hashicorp/vault/vault/router.go:676 +0x919
github.com/hashicorp/vault/vault.(*Router).Route(...)
	/go/src/github.com/hashicorp/vault/vault/router.go:476
github.com/hashicorp/vault/vault.(*RollbackManager).attemptRollback(0xc0002f4d80, 0x38bbb40, 0xc017ddfe30, 0xc000678db8, 0x4, 0xc007a1d440, 0xc002f8ae01, 0x0, 0x0)
	/go/src/github.com/hashicorp/vault/vault/rollback.go:226 +0x36e
created by github.com/hashicorp/vault/vault.(*RollbackManager).startOrLookupRollback
	/go/src/github.com/hashicorp/vault/vault/rollback.go:162 +0x246

Prior to this, we were running 1.2.3 for about a week and 1.2.2 for two months.

The secrets engine has been there for quite some time. It currently has over 1K rolesets defined (this is intentional, we make use of a lot of projects and create two rolesets per project).

The last request to this node in the logs was 3 minutes before the crash on a completely different secrets engine.

The text was updated successfully, but these errors were encountered:

mberhault · 2019-11-12T19:29:00Z

Glancing at the code, there is nothing to make me believe this was due to the 1.2.4 upgrade. It is merely the last major thing to happen to this cluster.

It seems that getRoleSet can return nil, nil but this is not handled in serviceAccountKeyRollback. This doesn't seem new but I can't speak to the underlying condition.

Fixes #51

mberhault · 2019-11-12T20:05:11Z

@kalafut: thank you for the quick fix. Could you elaborate on this code path? Just trying to understand what happened.

kalafut · 2019-11-13T00:20:47Z

@mberhault Thanks for the bug report! This occurred in the WAL (write-ahead log) system, so it's a bit tough to say what the root cause was. Possibly some operation involving account keys crashed/paniced earlier (but after the WAL was written), and the roleset of that operation was deleted before the WAL rollback operation ran. Was there any other panic prior to this one? Another path was that after the WAL was written, there was a GCP-related error such that the WAL wasn't able to be removed.

mberhault · 2019-11-13T11:50:33Z

We had some GCS unavailability on Monday, then the minor upgrade yesterday, but I don't recall seeing a panic in quite some time.

kalafut · 2019-11-13T16:18:31Z

If your storage is on GCS then it could definitely be related, and panic-free. e.g. one "success" sequence could be like:

Operation requested
WAL written
Operation succeeds
GCP becomes unavailable so WAL can't be deleted
GCP restored
Roleset is removed
Rollback operation run, finds WAL that it wants to roll back, panics (bug you're seeing) b/c the associated roleset isn't present.

There are other ways too, but if GCS was having issues then this isn't unexpected. Normally the WALs would just get cleaned up, but the bug you found prevented it in this case.

mberhault · 2019-11-13T16:34:00Z

Makes sense. Thank you for the explanation.

kalafut pushed a commit that referenced this issue Nov 12, 2019

Handle missing roll during rollback

cea6f56

Fixes #51

kalafut pushed a commit that referenced this issue Nov 12, 2019

Handle missing roll during rollback

f143b1e

Fixes #51

kalafut mentioned this issue Nov 12, 2019

Handle missing roll during rollback #52

Merged

kalafut closed this as completed in #52 Nov 12, 2019

kalafut pushed a commit that referenced this issue Nov 12, 2019

Handle missing roll during rollback (#52)

3c79853

Fixes #51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic after upgrade to 1.2.4. #51

Panic after upgrade to 1.2.4. #51

mberhault commented Nov 12, 2019

mberhault commented Nov 12, 2019

mberhault commented Nov 12, 2019

kalafut commented Nov 13, 2019 •

edited

Loading

mberhault commented Nov 13, 2019

kalafut commented Nov 13, 2019

mberhault commented Nov 13, 2019

Panic after upgrade to 1.2.4. #51

Panic after upgrade to 1.2.4. #51

Comments

mberhault commented Nov 12, 2019

mberhault commented Nov 12, 2019

mberhault commented Nov 12, 2019

kalafut commented Nov 13, 2019 • edited Loading

mberhault commented Nov 13, 2019

kalafut commented Nov 13, 2019

mberhault commented Nov 13, 2019

kalafut commented Nov 13, 2019 •

edited

Loading