-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vault 6815/dont prune servers that are joining #22
Conversation
…t to become voters, during dead server pruning.
…d potentially be voters in the future
…s basd on node type
…eak things (more)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a handful of comments but one meta comment is that I found the control flow within pruneDeadServers
quite confusing with all the func closures and such. As this codes correctness is critical I think it would be worthwhile to have more verbose code that is easier to reason about its correctness.
reconcile.go
Outdated
return voters-1 >= int(minQuorum) | ||
} | ||
|
||
func removeStale(a *Autopilot) func(id raft.ServerID) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not define these as methods on Autopilot and get rid of the func returns.
You could pass a.removeStale to anything that would take a func(raft.ServerID) error
argument.
reconcile.go
Outdated
failureTolerance := getFailureTolerance(voterCountProvider()) | ||
|
||
for id, v := range s { | ||
if failureTolerance < 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If failure tolerance is less than 1, removing a read replica would still be fine. So the logic seems a little off here.
Co-authored-by: Matt Keeler <[email protected]>
…ure tolerance is too low
|
||
// FilterVoters can be used to return all servers that are currently voters, | ||
// or all servers that are not. | ||
func (s *RaftServerEligibility) FilterVoters(isCurrentVoter bool) RaftServerEligibility { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple things here that are a bit out of the ordinary. There's nothing wrong with this code, but here's how you could make it less surprising:
type RaftServerEligibility struct {
eligible map[raft.ServerID]*VoterEligibility
}
func (s *RaftServerEligibility) FilterVoters(isCurrentVoter bool) []raft.ServerID
return potentialVoters | ||
} | ||
|
||
// FailedServers is essentially a DTO to support the promoter interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "DTO" is in most gopher's lexicons.
@@ -61,8 +61,8 @@ type Config struct { | |||
// be behind before being considered unhealthy. | |||
MaxTrailingLogs uint64 | |||
|
|||
// MinQuorum sets the minimum number of servers required in a cluster | |||
// before autopilot can prune dead servers. | |||
// MinQuorum set the minimum number of servers that should always be present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it "servers" or "voters"?
|
||
type CategorizedServers struct { | ||
// StaleNonVoters are the IDs of non-voting server nodes in the raft configuration | ||
// that are not present in the delegates view of the server nodes should be available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble parsing this sentence, where does "should be available" fit in?
|
||
// convertToFailedServers uses CategorizedServers to create the FailedServers | ||
// struct which can be used to maintain compatibility with the promoter interface | ||
func (s *CategorizedServers) convertToFailedServers(state *State) *FailedServers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid the need for conversion, and instead of introducing CategorizedServers just augment FailedServers? After all, the two structs are very similar. And it seems odd to make every field (stale nonvoters, stale voters, etc) contain the redundant information about eligibility. Couldn't you instead just have a single map from serverid->VoterEligibility that contains all nodes, instead of making each node list be a map?
Adjusted
raft-autopilot
to respect the minimum quorum setting regardless of what type of server it's dealing with (stale, failed, voters, non-voters).Servers will be allowed to be removed in order of increasing precedence until the threshold is reached (stale-non-voters => stale=> voters, failed-non-voters => failed-voters).
The following has been added to the
Promoter
interface:PotentialVoterPredicate(NodeType) bool
. This is to allow the application to make decisions about which types of node can be voters, as they may have added additional types toNodeType
.Finally, we should note that any downstream integrators of this repo may have work to do when moving to a new version due to the addition to the interface, however at the time of creation there were only HashiCorp repos consuming this and 2 external repos (https://pkg.go.dev/github.com/hashicorp/raft-autopilot?tab=importedby).
Related issues:
#17
hashicorp/vault#16919