-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RoleManager will remove deleted nodes from the cluster membership #2551
Conversation
20a887e
to
25f8a99
Compare
Codecov Report
@@ Coverage Diff @@
## master #2551 +/- ##
=========================================
+ Coverage 61.46% 61.57% +0.1%
=========================================
Files 134 133 -1
Lines 21800 21789 -11
=========================================
+ Hits 13399 13416 +17
+ Misses 6958 6926 -32
- Partials 1443 1447 +4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we should also prevent node removal if the demotion process hasn't completed for a manager (e.g. if the node is still in the raft cluster, we disallow the remove node API call and require that the user try again later).
I think yes. Would that mean we don't need the changes in the PR ?
manager/role_manager.go
Outdated
// Run is roleManager's main loop. | ||
// ctx is only used for logging. | ||
func (rm *roleManager) Run(ctx context.Context) { | ||
defer close(rm.doneChan) | ||
|
||
var ( | ||
nodes []*api.Node | ||
ticker *time.Ticker | ||
ticker clock.Ticker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment on how this is used
manager/role_manager.go
Outdated
for _, removed := range rm.removed { | ||
rm.evictRemovedNode(ctx, removed) | ||
} | ||
if len(rm.pending) != 0 || len(rm.removed) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment.
manager/role_manager.go
Outdated
if len(rm.pending) != 0 { | ||
ticker = time.NewTicker(roleReconcileInterval) | ||
tickerCh = ticker.C | ||
for _, removed := range rm.removed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify what kind of comment would make sense here?
manager/role_manager.go
Outdated
@@ -62,13 +82,29 @@ func (rm *roleManager) Run(ctx context.Context) { | |||
if err != nil { | |||
log.G(ctx).WithError(err).Error("failed to check nodes for role changes") | |||
} else { | |||
// Put all members in the removed list first - if the node object exists | |||
// for this member, then it is deleted from the removed list. This is safe | |||
// because a node object is created when a node first requests TLS certificates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is safe because a node object is created when a node first requests TLS certificates to be able to communicate with other raft members, so the node object
Sorry but I don't really follow what this means. Maybe clarify why we're doing this ?
manager/role_manager.go
Outdated
|
||
// removed contains raft members, indexed by node ID, whose nodes have since | ||
// been deleted - these members need to be removed from raft | ||
removed map[string]*api.RaftMember |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename to to_be_removed to clarify the meaning here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add an explanation of the process, rather than butchering variables with verbs and prepositions. ;)
As far as I understand, this is a removal queue for nodes that have been "deleted" but now need to be evicted from the raft member list. Does this need to be persisted or should the code below handle the case where this gets interrupted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it were interrupted, we'd have to try again in another tick, so wouldn't it have to be persisted?
manager/role_manager.go
Outdated
@@ -24,29 +26,47 @@ type roleManager struct { | |||
// pending contains changed nodes that have not yet been reconciled in | |||
// the raft member list. | |||
pending map[string]*api.Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated: pending => pending_reconciliation to make it easier to understand.
manager/role_manager.go
Outdated
@@ -62,13 +82,29 @@ func (rm *roleManager) Run(ctx context.Context) { | |||
if err != nil { | |||
log.G(ctx).WithError(err).Error("failed to check nodes for role changes") | |||
} else { | |||
// Put all members in the removed list first - if the node object exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated: Please also add a high level comment on what all of this method is trying to achieve.
manager/role_manager.go
Outdated
// removeMember removes a member from the raft cluster membership | ||
func (rm *roleManager) removeMember(ctx context.Context, member *membership.Member) { | ||
// Quorum safeguard | ||
if !rm.raft.CanRemoveMember(member.RaftID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would this happen ? Please add a comment.
manager/role_manager.go
Outdated
return | ||
} | ||
|
||
rmCtx, rmCancel := context.WithTimeout(rm.ctx, 5*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a global for this timeout.
manager/role_manager.go
Outdated
// Check if the member still exists in the membership | ||
member := rm.raft.GetMemberByNodeID(raftMember.NodeID) | ||
if member != nil { | ||
// We first try to remove the raft node from the raft cluster. On the next tick, if the node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we do it in the same tick ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I'm not sure, as I am not super familiar with this code either - this was just how the logic was previously for demotion. I think we probably can just update the node immediately, since if we remove the node, eventually https://github.com/docker/swarmkit/blob/master/manager/state/raft/raft.go#L1850 gets called, which seems to wait for the configuration change to be processed and applied to the server before returning. Does that sound right cc @aaronlehmann?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(As the code is written, this allows a node reconciliation to be retried if there's a logged error, but the code can be refactored. Not sure if we require that updating the node objects happen in the next tick on successful cluster membership removal, though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely sure but it may just be to simplify error handling. If we did it in the same tick, we'd have to make sure it was actually removed before deleting it from the list. I don't think there's any downside to deferring the removal from the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronlehmann Cool, thanks! I agree I think there's no downside. I just wanted to be able to add a comment as to why it is deferred until the next tick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably bikeshedding, but this makes the code somewhat harder to read and reason with. I agree that there might not be a downside to doing it this way, but I'm not sure that's motivation enough. Thats my take, but I'll let you decide :) @cyli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anshulpundir If we were 100% sure that error handling is the only reason, I'd be happy to make the change. As it is I think I'm only pretty sure it it can be updated in the same tick, so perhaps not making the change would be safer. I think I'd want more comprehensive tests to make sure that making that change wouldn't break a corner case I didn't understand.
I will see if I can figure out how to test that if removing a node errors, that the removal will be retried later, though. I haven't added that cases because I couldn't think of a good way to write that test yet. (Although I don't think these tests would be sufficient to make sure updating the node in the same tick wouldn't break something - it's possible, although I don't see it, that there may be a timing issue where waiting out another tick is safer :|)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were 100% sure that error handling is the only reason, I'd be happy to make the change.
I guess we're not, so I'm OK with leaving this as is for now.
Don't know enough about this code path, but I hope my comments are not too annoying @cyli |
5219728
to
5e66778
Compare
I think the changes in the PR make the role reconciler more correct, and I think it'd be safer to have. Particularly if an existing cluster already is in the state, this may be able to remove the extra membership. But that would hopefully prevent this in the future?
Not at all - thanks for the requests for clarification. My context is not your context and vice versa, so it's helpful to know whether my comments are actually useful :) |
Also I can't seem to replicate the error in CI - my changes were just (1) rebasing and (2) comments and renaming, and it was not these tests that failed. :| |
Thats a valid point, although I'm wondering if its important enough to write this logic specifically to address existing clusters in this state. We have seen a few customers in this state and in all of those cases, we fell back to Are there any other opinions on this ? @nishanttotla @dperny @aluzzardi |
manager/role_manager.go
Outdated
const roleReconcileInterval = 5 * time.Second | ||
const ( | ||
roleReconcileInterval = 5 * time.Second | ||
removalTimeout = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please add a note on what these are ?
BTW, we have been seeing problems recently where the node remains in the memberlist but is removed from the store and there is no way to remove this member from raft. Thx @cyli for discovering and fixing it! |
5e66778
to
042acc9
Compare
It it also possible that there are ghost raft IDs in a membership cluster that aren't currently causing problems, but could cause problems later if another node or two get demoted/removed? If so, this would remove the ghost node before it becomes a problem, and they have to force a new cluster. Also, this is probably bikeshedding, but if we agree that this is the more correct behavior for node reconciliation, I favor fixing the misbehavior. It's true that updating the API to disallow removal before the role manager has successfully reconciled the role would prevent the bug from surfacing, but I'd consider these separate updates - one to fix the role manager misbehavior, and the other would provide a better user experience due to better feedback. Especially since it is easier to test that the role manager now removes deleted nodes, but testing that changing the control API prevents this behavior is harder. This is more of a unit test, and that would be an integration test, and it's hard to get the timing right (I wouldn't be sure how to do it atm). |
I am afraid I don't fully agree with this. The only reason we need this is because of the bug in node remove not waiting for node demote.
Ageed. But, let me clarify my concern: it is leaving dead code in the product. At some point in the future we will have no customers on a version before this fix. At that point, this code is dead (assuming we also have the fix to disallow removing nodes while the reconciliation is in progress, or demotion is made synchronous). |
Ah, I see. I had not considered node remove not waiting for demote a bug, just a change in behavior that would make a difference in user experience. From that perspective that makes sense. How would you test the controller behavior? |
042acc9
to
af81379
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM on first pass, but I'd like to do another pass before approving. I'm still not 100% sure about the new logic.
Thanks for this fix @cyli.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, my one API comment isn't a blocker.
manager/role_manager.go
Outdated
} | ||
|
||
// newRoleManager creates a new roleManager. | ||
func newRoleManager(store *store.MemoryStore, raftNode *raft.Node) *roleManager { | ||
func newRoleManager(store *store.MemoryStore, raftNode *raft.Node, clocksource clock.Clock) *roleManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
- newRoleManager is an unexported method, meaning we know no external package, including external tests depends on a role manager
- clocksource is only used for testing, and is passed nil in prod code.
Therefore, it would probably be a cleaner API to just set clocksource directly by accessing the unexported field before calling Run in the tests. That way, we don't expose complexity resulting from testing requirements if we later on decide to move the roleManager
to its own package or whatever.
Complicated explanation for a very small nit lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks for the suggestion!
manager/role_manager.go
Outdated
// be reconciled. As node updates come in, any promotions or demotions are also added to the | ||
// reconciliation queue and reconciled. | ||
|
||
// The removal queue is never added to beyond the startup of the loop, because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
The removal queue is never added to beyond the startup of the loop, because before a node can be removed, it must be demoted, so the demotion should be sufficient to get it removed from the cluster membership.
Not sure I completely follow this part though :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about:
"Nothing will be added to the removal queue after the loop startup. The removal queue consists of nodes that have been already removed from the cluster. Before a manager (and hence raft) node can be removed from a cluster via the control API, it must first be demoted. Because we watch for node updates, and we add demoted nodes to the reconciliation queue, the node's removal from the raft cluster should be handled upon reconciliation of that queue."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nm, as per #2551 (comment), this comment won't apply any more anyway :|
manager/role_manager.go
Outdated
// Check if the member still exists in the membership | ||
member := rm.raft.GetMemberByNodeID(raftMember.NodeID) | ||
if member != nil { | ||
// We first try to remove the raft node from the raft cluster. On the next tick, if the node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were 100% sure that error handling is the only reason, I'd be happy to make the change.
I guess we're not, so I'm OK with leaving this as is for now.
@@ -0,0 +1,213 @@ | |||
package manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we test the specific condition which this fix handles: start with a raft node in the memberlist but not in the store and make sure that role manager removes it when it starts up. @cyli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/docker/swarmkit/pull/2551/files#diff-5ebf7b379f77f441db1a9548cabf2a89R170:
Because the role manager is being independently tested, so it can be started at any time. I create a 3-node raft cluster, and update the store to reflect only 2 node objects. The membership should reflect 3 raft nodes, but the store says that there are 2 canonical node objects. Then I create a new role manager using one of the raft nodes, and I test that after a while one of the nodes has been kicked out.
I'll add another check that the correct raft node has been kicked out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, maybe we can do the same thing for the control api server.
af81379
to
e82922d
Compare
Sorry, I was looking through the code again and I maybe found one more there could be a bug that is not covered by the changes in this PR. When a node is removed, if its desired role is a manager, and it's a member of the raft cluster, we require that it be demoted before removal. If it's not a member of the raft cluster, we allow removal, even if the desired role is a manager. Possibly if a node was promoted, and then immediately removed from the cluster before it managed to join the raft cluster or at least before the conf changes were applied, then we'd allow a node removal and then right afterward the node will join the raft cluster membership. It will never be removed from the raft cluster membership until the next leadership election (because roleManager does not watch for role deletions). I can update the PR to also watch for node deletions and handle that case, or we can just wait for the leader election. |
both on startup and while a node is demoted while it's running. Also add a failing test, where the roleManager should remove deleted nodes from the cluster membership on startup. These tests involve injecting a fake clock source into roleManager. Signed-off-by: cyli <[email protected]>
…nger in the raft store. Signed-off-by: cyli <[email protected]>
… node that is a manager can be removed if it hasn't joined the raft cluster yet. However, perhaps the raft conf change happens right after the check. Signed-off-by: Ying Li <[email protected]>
e82922d
to
36954b5
Compare
Updated to remove deleted nodes while running, as well. |
@anshulpundir In light of our discussions in #2565 and how role manager should change, do you want me to make more changes in this PR, or make additional changes in a second PR? (e.g. `RoleChangeInProgress, monitoring raft cluster membership, etc.)? |
I'd suggest follow up PRs. I can also take up some of those work items btw. @cyli |
@anshulpundir That'd be awesome, thank you! |
Relevant changes: - moby/swarmkit#2551 RoleManager will remove deleted nodes from the cluster membership - moby/swarmkit#2574 Scheduler/TaskReaper: handle unassigned tasks marked for shutdown - moby/swarmkit#2561 Avoid predefined error log - moby/swarmkit#2557 Task reaper should delete tasks with removed slots that were not yet assigned - moby/swarmkit#2587 [fips] Agent reports FIPS status - moby/swarmkit#2603 Fix manager/state/store.timedMutex Signed-off-by: Sebastiaan van Stijn <[email protected]>
Relevant changes: - moby/swarmkit#2551 RoleManager will remove deleted nodes from the cluster membership - moby/swarmkit#2574 Scheduler/TaskReaper: handle unassigned tasks marked for shutdown - moby/swarmkit#2561 Avoid predefined error log - moby/swarmkit#2557 Task reaper should delete tasks with removed slots that were not yet assigned - moby/swarmkit#2587 [fips] Agent reports FIPS status - moby/swarmkit#2603 Fix manager/state/store.timedMutex Signed-off-by: Sebastiaan van Stijn <[email protected]> Upstream-commit: 333b2f28fef4ba857905e7263e7b9bbbf7c522fc Component: engine
Relevant changes: - moby/swarmkit#2551 RoleManager will remove deleted nodes from the cluster membership - moby/swarmkit#2574 Scheduler/TaskReaper: handle unassigned tasks marked for shutdown - moby/swarmkit#2561 Avoid predefined error log - moby/swarmkit#2557 Task reaper should delete tasks with removed slots that were not yet assigned - moby/swarmkit#2587 [fips] Agent reports FIPS status - moby/swarmkit#2603 Fix manager/state/store.timedMutex Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 333b2f28fef4ba857905e7263e7b9bbbf7c522fc) Signed-off-by: Sebastiaan van Stijn <[email protected]>
Fixes #2548 (please see description there), I think.
Wondering if we should also prevent node removal if the demotion process hasn't completed for a manager (e.g. if the node is still in the raft cluster, we disallow the remove node API call and require that the user try again later).