Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault 6815/dont prune servers that are joining #22

Closed

Conversation

peteski22
Copy link

Adjusted raft-autopilot to respect the minimum quorum setting regardless of what type of server it's dealing with (stale, failed, voters, non-voters).

Servers will be allowed to be removed in order of increasing precedence until the threshold is reached (stale-non-voters => stale=> voters, failed-non-voters => failed-voters).

The following has been added to the Promoter interface: PotentialVoterPredicate(NodeType) bool. This is to allow the application to make decisions about which types of node can be voters, as they may have added additional types to NodeType.

Finally, we should note that any downstream integrators of this repo may have work to do when moving to a new version due to the addition to the interface, however at the time of creation there were only HashiCorp repos consuming this and 2 external repos (https://pkg.go.dev/github.com/hashicorp/raft-autopilot?tab=importedby).

Related issues:

#17
hashicorp/vault#16919

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a handful of comments but one meta comment is that I found the control flow within pruneDeadServers quite confusing with all the func closures and such. As this codes correctness is critical I think it would be worthwhile to have more verbose code that is easier to reason about its correctness.

reconcile.go Outdated Show resolved Hide resolved
reconcile.go Show resolved Hide resolved
reconcile.go Outdated Show resolved Hide resolved
reconcile.go Outdated Show resolved Hide resolved
reconcile.go Outdated
return voters-1 >= int(minQuorum)
}

func removeStale(a *Autopilot) func(id raft.ServerID) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not define these as methods on Autopilot and get rid of the func returns.

You could pass a.removeStale to anything that would take a func(raft.ServerID) error argument.

reconcile.go Outdated Show resolved Hide resolved
reconcile.go Outdated
failureTolerance := getFailureTolerance(voterCountProvider())

for id, v := range s {
if failureTolerance < 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If failure tolerance is less than 1, removing a read replica would still be fine. So the logic seems a little off here.

state.go Outdated Show resolved Hide resolved

// FilterVoters can be used to return all servers that are currently voters,
// or all servers that are not.
func (s *RaftServerEligibility) FilterVoters(isCurrentVoter bool) RaftServerEligibility {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple things here that are a bit out of the ordinary. There's nothing wrong with this code, but here's how you could make it less surprising:

type RaftServerEligibility struct {
  eligible map[raft.ServerID]*VoterEligibility
}

func (s *RaftServerEligibility) FilterVoters(isCurrentVoter bool) []raft.ServerID

return potentialVoters
}

// FailedServers is essentially a DTO to support the promoter interface
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "DTO" is in most gopher's lexicons.

@@ -61,8 +61,8 @@ type Config struct {
// be behind before being considered unhealthy.
MaxTrailingLogs uint64

// MinQuorum sets the minimum number of servers required in a cluster
// before autopilot can prune dead servers.
// MinQuorum set the minimum number of servers that should always be present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it "servers" or "voters"?


type CategorizedServers struct {
// StaleNonVoters are the IDs of non-voting server nodes in the raft configuration
// that are not present in the delegates view of the server nodes should be available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble parsing this sentence, where does "should be available" fit in?


// convertToFailedServers uses CategorizedServers to create the FailedServers
// struct which can be used to maintain compatibility with the promoter interface
func (s *CategorizedServers) convertToFailedServers(state *State) *FailedServers {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid the need for conversion, and instead of introducing CategorizedServers just augment FailedServers? After all, the two structs are very similar. And it seems odd to make every field (stale nonvoters, stale voters, etc) contain the redundant information about eligibility. Couldn't you instead just have a single map from serverid->VoterEligibility that contains all nodes, instead of making each node list be a map?

@peteski22 peteski22 closed this Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants