-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agent: tolerate partial restore failure from persistent cache #12718
Conversation
+```release-note:improvement | ||
+auth/kubernetes: validate JWT against the provided role on alias look ahead operations | ||
+``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing a small syntax error in the previous changelog
@@ -969,56 +970,68 @@ func (c *LeaseCache) Flush() error { | |||
// tokens first, since restoring a lease's renewal context and watcher requires | |||
// looking up the token in the cachememdb. | |||
func (c *LeaseCache) Restore(ctx context.Context, storage *cacheboltdb.BoltStorage) error { | |||
var errors *multierror.Error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be worth looking into down the road for how we define custom errors.
https://github.com/uber-go/guide/blob/master/style.md#error-types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which bit of the guide are you referring to? The only bit I see that refers to multiple errors is wrapping, and that's not appropriate here because the errors we're collecting at this level are "peers" rather than forming part of a stack. In the case that there are errors, we want to log a message about what failed and optionally continue (if exit_on_err
is false in the cache config).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One meta-comment that I want to leave here is that this could make it trickier for us to implement VAULT-1096 (ordered restores) since we could get into a state where the dependency graph is incomplete, say if we failed to restore a parent. For instance, what happens to the lifecycle of its child tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Good question - I've addressed this in #12765. If we fail to restore a parent, we need to ensure we don't block waiting for it forever, and when we process that child lease, we won't be able to restore it just as before. |
Agent will continue with the restore process after encountering any errors, and then log any errors. If
exit_on_err
is false it will continue after that.I've written some unit testing, but I haven't thought of a good way to introduce errors in an integration or manual testing environment yet. Although I think the unit testing does have decent coverage.