RoleManager will remove deleted nodes from the cluster membership #2551

cyli · 2018-03-10T00:31:50Z

Fixes #2548 (please see description there), I think.

Wondering if we should also prevent node removal if the demotion process hasn't completed for a manager (e.g. if the node is still in the raft cluster, we disallow the remove node API call and require that the user try again later).

codecov · 2018-03-10T00:51:50Z

Codecov Report

Merging #2551 into master will increase coverage by 0.1%.
The diff coverage is 73.33%.

@@            Coverage Diff            @@
##           master    #2551     +/-   ##
=========================================
+ Coverage   61.46%   61.57%   +0.1%     
=========================================
  Files         134      133      -1     
  Lines       21800    21789     -11     
=========================================
+ Hits        13399    13416     +17     
+ Misses       6958     6926     -32     
- Partials     1443     1447      +4

anshulpundir

Wondering if we should also prevent node removal if the demotion process hasn't completed for a manager (e.g. if the node is still in the raft cluster, we disallow the remove node API call and require that the user try again later).

I think yes. Would that mean we don't need the changes in the PR ?

anshulpundir · 2018-03-10T01:14:50Z

manager/role_manager.go

 // Run is roleManager's main loop.
 // ctx is only used for logging.
 func (rm *roleManager) Run(ctx context.Context) {
 	defer close(rm.doneChan)

 	var (
 		nodes    []*api.Node
-		ticker   *time.Ticker
+		ticker   clock.Ticker


please add a comment on how this is used

anshulpundir · 2018-03-10T01:19:39Z

manager/role_manager.go

+		for _, removed := range rm.removed {
+			rm.evictRemovedNode(ctx, removed)
+		}
+		if len(rm.pending) != 0 || len(rm.removed) != 0 {


Please add a comment.

anshulpundir · 2018-03-10T01:19:50Z

manager/role_manager.go

-		if len(rm.pending) != 0 {
-			ticker = time.NewTicker(roleReconcileInterval)
-			tickerCh = ticker.C
+		for _, removed := range rm.removed {


Please add a comment.

Could you clarify what kind of comment would make sense here?

anshulpundir · 2018-03-10T01:22:04Z

manager/role_manager.go

@@ -62,13 +82,29 @@ func (rm *roleManager) Run(ctx context.Context) {
 	if err != nil {
 		log.G(ctx).WithError(err).Error("failed to check nodes for role changes")
 	} else {
+		// Put all members in the removed list first - if the node object exists
+		// for this member, then it is deleted from the removed list.  This is safe
+		// because a node object is created when a node first requests TLS certificates


This is safe because a node object is created when a node first requests TLS certificates to be able to communicate with other raft members, so the node object

Sorry but I don't really follow what this means. Maybe clarify why we're doing this ?

anshulpundir · 2018-03-10T01:22:48Z

manager/role_manager.go

+
+	// removed contains raft members, indexed by node ID, whose nodes have since
+	// been deleted - these members need to be removed from raft
+	removed map[string]*api.RaftMember


maybe rename to to_be_removed to clarify the meaning here ?

Just add an explanation of the process, rather than butchering variables with verbs and prepositions. ;)

As far as I understand, this is a removal queue for nodes that have been "deleted" but now need to be evicted from the raft member list. Does this need to be persisted or should the code below handle the case where this gets interrupted?

If it were interrupted, we'd have to try again in another tick, so wouldn't it have to be persisted?

anshulpundir · 2018-03-10T01:24:54Z

manager/role_manager.go

@@ -24,29 +26,47 @@ type roleManager struct {
 	// pending contains changed nodes that have not yet been reconciled in
 	// the raft member list.
 	pending map[string]*api.Node


unrelated: pending => pending_reconciliation to make it easier to understand.

anshulpundir · 2018-03-10T01:27:46Z

manager/role_manager.go

@@ -62,13 +82,29 @@ func (rm *roleManager) Run(ctx context.Context) {
 	if err != nil {
 		log.G(ctx).WithError(err).Error("failed to check nodes for role changes")
 	} else {
+		// Put all members in the removed list first - if the node object exists


unrelated: Please also add a high level comment on what all of this method is trying to achieve.

anshulpundir · 2018-03-10T01:34:54Z

manager/role_manager.go

+// removeMember removes a member from the raft cluster membership
+func (rm *roleManager) removeMember(ctx context.Context, member *membership.Member) {
+	// Quorum safeguard
+	if !rm.raft.CanRemoveMember(member.RaftID) {


Why would this happen ? Please add a comment.

anshulpundir · 2018-03-10T01:35:11Z

manager/role_manager.go

+		return
+	}
+
+	rmCtx, rmCancel := context.WithTimeout(rm.ctx, 5*time.Second)


Please add a global for this timeout.

anshulpundir · 2018-03-10T01:36:20Z

manager/role_manager.go

+	// Check if the member still exists in the membership
+	member := rm.raft.GetMemberByNodeID(raftMember.NodeID)
+	if member != nil {
+		// We first try to remove the raft node from the raft cluster.  On the next tick, if the node


Why can't we do it in the same tick ?

Honestly I'm not sure, as I am not super familiar with this code either - this was just how the logic was previously for demotion. I think we probably can just update the node immediately, since if we remove the node, eventually https://github.com/docker/swarmkit/blob/master/manager/state/raft/raft.go#L1850 gets called, which seems to wait for the configuration change to be processed and applied to the server before returning. Does that sound right cc @aaronlehmann?

(As the code is written, this allows a node reconciliation to be retried if there's a logged error, but the code can be refactored. Not sure if we require that updating the node objects happen in the next tick on successful cluster membership removal, though)

I'm not completely sure but it may just be to simplify error handling. If we did it in the same tick, we'd have to make sure it was actually removed before deleting it from the list. I don't think there's any downside to deferring the removal from the list.

@aaronlehmann Cool, thanks! I agree I think there's no downside. I just wanted to be able to add a comment as to why it is deferred until the next tick.

This is probably bikeshedding, but this makes the code somewhat harder to read and reason with. I agree that there might not be a downside to doing it this way, but I'm not sure that's motivation enough. Thats my take, but I'll let you decide :) @cyli

@anshulpundir If we were 100% sure that error handling is the only reason, I'd be happy to make the change. As it is I think I'm only pretty sure it it can be updated in the same tick, so perhaps not making the change would be safer. I think I'd want more comprehensive tests to make sure that making that change wouldn't break a corner case I didn't understand.

I will see if I can figure out how to test that if removing a node errors, that the removal will be retried later, though. I haven't added that cases because I couldn't think of a good way to write that test yet. (Although I don't think these tests would be sufficient to make sure updating the node in the same tick wouldn't break something - it's possible, although I don't see it, that there may be a timing issue where waiting out another tick is safer :|)

If we were 100% sure that error handling is the only reason, I'd be happy to make the change.

I guess we're not, so I'm OK with leaving this as is for now.

anshulpundir · 2018-03-10T01:41:05Z

Don't know enough about this code path, but I hope my comments are not too annoying @cyli

cyli · 2018-03-13T19:48:17Z

@anshulpundir

I think yes. Would that mean we don't need the changes in the PR ?

I think the changes in the PR make the role reconciler more correct, and I think it'd be safer to have. Particularly if an existing cluster already is in the state, this may be able to remove the extra membership. But that would hopefully prevent this in the future?

Don't know enough about this code path, but I hope my comments are not too annoying

Not at all - thanks for the requests for clarification. My context is not your context and vice versa, so it's helpful to know whether my comments are actually useful :)

cyli · 2018-03-13T19:48:58Z

Also I can't seem to replicate the error in CI - my changes were just (1) rebasing and (2) comments and renaming, and it was not these tests that failed. :|

anshulpundir · 2018-03-14T21:49:15Z

Particularly if an existing cluster already is in the state, this may be able to remove the extra membership. But that would hopefully prevent this in the future?

Thats a valid point, although I'm wondering if its important enough to write this logic specifically to address existing clusters in this state. We have seen a few customers in this state and in all of those cases, we fell back to docker init --force-new-cluster, my point being customers will likely not sit around in this state waiting to upgrade. What do you think ?

Are there any other opinions on this ? @nishanttotla @dperny @aluzzardi

anshulpundir · 2018-03-14T21:03:55Z

manager/role_manager.go

-const roleReconcileInterval = 5 * time.Second
+const (
+	roleReconcileInterval = 5 * time.Second
+	removalTimeout        = 5 * time.Second


nit: please add a note on what these are ?

anshulpundir · 2018-03-14T22:09:09Z

BTW, we have been seeing problems recently where the node remains in the memberlist but is removed from the store and there is no way to remove this member from raft. Thx @cyli for discovering and fixing it!

cyli · 2018-03-15T17:37:42Z

Thats a valid point, although I'm wondering if its important enough to write this logic specifically to address existing clusters in this state. We have seen a few customers in this state and in all of those cases, we fell back to docker init --force-new-cluster, my point being customers will likely not sit around in this state waiting to upgrade. What do you think ?

It it also possible that there are ghost raft IDs in a membership cluster that aren't currently causing problems, but could cause problems later if another node or two get demoted/removed? If so, this would remove the ghost node before it becomes a problem, and they have to force a new cluster.

Also, this is probably bikeshedding, but if we agree that this is the more correct behavior for node reconciliation, I favor fixing the misbehavior. It's true that updating the API to disallow removal before the role manager has successfully reconciled the role would prevent the bug from surfacing, but I'd consider these separate updates - one to fix the role manager misbehavior, and the other would provide a better user experience due to better feedback.

Especially since it is easier to test that the role manager now removes deleted nodes, but testing that changing the control API prevents this behavior is harder. This is more of a unit test, and that would be an integration test, and it's hard to get the timing right (I wouldn't be sure how to do it atm).

anshulpundir · 2018-03-15T18:13:08Z

but if we agree that this is the more correct behavior for node reconciliation, I favor fixing the misbehavior

I am afraid I don't fully agree with this. The only reason we need this is because of the bug in node remove not waiting for node demote.

It it also possible that there are ghost raft IDs in a membership cluster that aren't currently causing problems, but could cause problems later if another node or two get demoted/removed? If so, this would remove the ghost node before it becomes a problem, and they have to force a new cluster.

Ageed.

But, let me clarify my concern: it is leaving dead code in the product. At some point in the future we will have no customers on a version before this fix. At that point, this code is dead (assuming we also have the fix to disallow removing nodes while the reconciliation is in progress, or demotion is made synchronous).

cyli · 2018-03-15T18:29:11Z

The only reason we need this is because of the bug in node remove not waiting for node demote.

Ah, I see. I had not considered node remove not waiting for demote a bug, just a change in behavior that would make a difference in user experience. From that perspective that makes sense.

How would you test the controller behavior?

nishanttotla

Code LGTM on first pass, but I'd like to do another pass before approving. I'm still not 100% sure about the new logic.

Thanks for this fix @cyli.

dperny

LGTM, my one API comment isn't a blocker.

dperny · 2018-03-19T18:25:46Z

manager/role_manager.go

 }

 // newRoleManager creates a new roleManager.
-func newRoleManager(store *store.MemoryStore, raftNode *raft.Node) *roleManager {
+func newRoleManager(store *store.MemoryStore, raftNode *raft.Node, clocksource clock.Clock) *roleManager {


nit:

newRoleManager is an unexported method, meaning we know no external package, including external tests depends on a role manager

clocksource is only used for testing, and is passed nil in prod code.

Therefore, it would probably be a cleaner API to just set clocksource directly by accessing the unexported field before calling Run in the tests. That way, we don't expose complexity resulting from testing requirements if we later on decide to move the roleManager to its own package or whatever.

Complicated explanation for a very small nit lol.

Good point, thanks for the suggestion!

anshulpundir · 2018-03-19T21:44:57Z

manager/role_manager.go

+// be reconciled.  As node updates come in, any promotions or demotions are also added to the
+// reconciliation queue and reconciled.
+
+// The removal queue is never added to beyond the startup of the loop, because


Thanks for adding this!

The removal queue is never added to beyond the startup of the loop, because before a node can be removed, it must be demoted, so the demotion should be sufficient to get it removed from the cluster membership.

Not sure I completely follow this part though :(

What do you think about:

"Nothing will be added to the removal queue after the loop startup. The removal queue consists of nodes that have been already removed from the cluster. Before a manager (and hence raft) node can be removed from a cluster via the control API, it must first be demoted. Because we watch for node updates, and we add demoted nodes to the reconciliation queue, the node's removal from the raft cluster should be handled upon reconciliation of that queue."

Nm, as per #2551 (comment), this comment won't apply any more anyway :|

anshulpundir · 2018-03-19T21:47:25Z

manager/role_manager.go

+	// Check if the member still exists in the membership
+	member := rm.raft.GetMemberByNodeID(raftMember.NodeID)
+	if member != nil {
+		// We first try to remove the raft node from the raft cluster.  On the next tick, if the node


If we were 100% sure that error handling is the only reason, I'd be happy to make the change.

I guess we're not, so I'm OK with leaving this as is for now.

anshulpundir · 2018-03-19T21:55:11Z

manager/role_manager_test.go

@@ -0,0 +1,213 @@
+package manager


How do we test the specific condition which this fix handles: start with a raft node in the memberlist but not in the store and make sure that role manager removes it when it starts up. @cyli

https://github.com/docker/swarmkit/pull/2551/files#diff-5ebf7b379f77f441db1a9548cabf2a89R170:

Because the role manager is being independently tested, so it can be started at any time. I create a 3-node raft cluster, and update the store to reflect only 2 node objects. The membership should reflect 3 raft nodes, but the store says that there are 2 canonical node objects. Then I create a new role manager using one of the raft nodes, and I test that after a while one of the nodes has been kicked out.

I'll add another check that the correct raft node has been kicked out.

Huh, maybe we can do the same thing for the control api server.

cyli · 2018-03-21T00:23:52Z

Sorry, I was looking through the code again and I maybe found one more there could be a bug that is not covered by the changes in this PR.

When a node is removed, if its desired role is a manager, and it's a member of the raft cluster, we require that it be demoted before removal. If it's not a member of the raft cluster, we allow removal, even if the desired role is a manager. Possibly if a node was promoted, and then immediately removed from the cluster before it managed to join the raft cluster or at least before the conf changes were applied, then we'd allow a node removal and then right afterward the node will join the raft cluster membership. It will never be removed from the raft cluster membership until the next leadership election (because roleManager does not watch for role deletions). I can update the PR to also watch for node deletions and handle that case, or we can just wait for the leader election.

both on startup and while a node is demoted while it's running. Also add a failing test, where the roleManager should remove deleted nodes from the cluster membership on startup. These tests involve injecting a fake clock source into roleManager. Signed-off-by: cyli <[email protected]>

…nger in the raft store. Signed-off-by: cyli <[email protected]>

… node that is a manager can be removed if it hasn't joined the raft cluster yet. However, perhaps the raft conf change happens right after the check. Signed-off-by: Ying Li <[email protected]>

cyli · 2018-03-22T00:03:40Z

Updated to remove deleted nodes while running, as well.

cyli · 2018-03-27T21:52:30Z

@anshulpundir In light of our discussions in #2565 and how role manager should change, do you want me to make more changes in this PR, or make additional changes in a second PR? (e.g. `RoleChangeInProgress, monitoring raft cluster membership, etc.)?

anshulpundir · 2018-03-27T22:07:11Z

I'd suggest follow up PRs. I can also take up some of those work items btw. @cyli

cyli · 2018-03-27T22:09:05Z

@anshulpundir That'd be awesome, thank you!

Relevant changes: - moby/swarmkit#2551 RoleManager will remove deleted nodes from the cluster membership - moby/swarmkit#2574 Scheduler/TaskReaper: handle unassigned tasks marked for shutdown - moby/swarmkit#2561 Avoid predefined error log - moby/swarmkit#2557 Task reaper should delete tasks with removed slots that were not yet assigned - moby/swarmkit#2587 [fips] Agent reports FIPS status - moby/swarmkit#2603 Fix manager/state/store.timedMutex Signed-off-by: Sebastiaan van Stijn <[email protected]>

Relevant changes: - moby/swarmkit#2551 RoleManager will remove deleted nodes from the cluster membership - moby/swarmkit#2574 Scheduler/TaskReaper: handle unassigned tasks marked for shutdown - moby/swarmkit#2561 Avoid predefined error log - moby/swarmkit#2557 Task reaper should delete tasks with removed slots that were not yet assigned - moby/swarmkit#2587 [fips] Agent reports FIPS status - moby/swarmkit#2603 Fix manager/state/store.timedMutex Signed-off-by: Sebastiaan van Stijn <[email protected]> Upstream-commit: 333b2f28fef4ba857905e7263e7b9bbbf7c522fc Component: engine

Relevant changes: - moby/swarmkit#2551 RoleManager will remove deleted nodes from the cluster membership - moby/swarmkit#2574 Scheduler/TaskReaper: handle unassigned tasks marked for shutdown - moby/swarmkit#2561 Avoid predefined error log - moby/swarmkit#2557 Task reaper should delete tasks with removed slots that were not yet assigned - moby/swarmkit#2587 [fips] Agent reports FIPS status - moby/swarmkit#2603 Fix manager/state/store.timedMutex Signed-off-by: Sebastiaan van Stijn <[email protected]> (cherry picked from commit 333b2f28fef4ba857905e7263e7b9bbbf7c522fc) Signed-off-by: Sebastiaan van Stijn <[email protected]>

cyli force-pushed the role-reconciler-edge branch from 20a887e to 25f8a99 Compare March 10, 2018 00:40

cyli added the area/raft label Mar 10, 2018

anshulpundir reviewed Mar 10, 2018

View reviewed changes

cyli force-pushed the role-reconciler-edge branch 2 times, most recently from 5219728 to 5e66778 Compare March 13, 2018 19:24

anshulpundir reviewed Mar 14, 2018

View reviewed changes

cyli force-pushed the role-reconciler-edge branch from 5e66778 to 042acc9 Compare March 15, 2018 17:30

cyli force-pushed the role-reconciler-edge branch from 042acc9 to af81379 Compare March 15, 2018 18:51

nishanttotla reviewed Mar 19, 2018

View reviewed changes

dperny approved these changes Mar 19, 2018

View reviewed changes

anshulpundir reviewed Mar 19, 2018

View reviewed changes

cyli force-pushed the role-reconciler-edge branch from af81379 to e82922d Compare March 20, 2018 17:22

cyli mentioned this pull request Mar 21, 2018

Node Promotion/Demotion workflow review #2565

Open

cyli and others added 3 commits March 21, 2018 16:00

On startup, remove any nodes in the cluster membership that are no lo…

5322ff5

…nger in the raft store. Signed-off-by: cyli <[email protected]>

Also delete nodes from the raft cluster if deletions come in, since a…

36954b5

… node that is a manager can be removed if it hasn't joined the raft cluster yet. However, perhaps the raft conf change happens right after the check. Signed-off-by: Ying Li <[email protected]>

cyli force-pushed the role-reconciler-edge branch from e82922d to 36954b5 Compare March 21, 2018 23:40

anshulpundir approved these changes Apr 4, 2018

View reviewed changes

anshulpundir merged commit 15f2859 into moby:master Apr 4, 2018

cyli deleted the role-reconciler-edge branch April 4, 2018 20:15

dani-docker mentioned this pull request Apr 13, 2018

Ghost manager present in swarm, unable to remove it using docker node rm moby/moby#36851

Open

This was referenced Apr 17, 2018

[backport, 17.06] RoleManager will remove deleted nodes from the cluster membership #2607

Merged

[backport, 18.03] RoleManager will remove deleted nodes from the cluster membership #2608

Merged

thaJeztah mentioned this pull request Apr 17, 2018

Bump SwarmKit to 9c2aa152c3054371b833483a7ddad8d15052ec4f moby/moby#36880

Merged

thaJeztah mentioned this pull request Apr 19, 2018

[18.03] update swarmkit docker-archive/docker-ce#521

Closed

RoleManager will remove deleted nodes from the cluster membership #2551

RoleManager will remove deleted nodes from the cluster membership #2551

Conversation

cyli commented Mar 10, 2018

codecov bot commented Mar 10, 2018 • edited Loading

Codecov Report

anshulpundir left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyli Mar 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anshulpundir commented Mar 10, 2018

cyli commented Mar 13, 2018

cyli commented Mar 13, 2018

anshulpundir commented Mar 14, 2018

Choose a reason for hiding this comment

anshulpundir commented Mar 14, 2018

cyli commented Mar 15, 2018 • edited Loading

anshulpundir commented Mar 15, 2018

cyli commented Mar 15, 2018

nishanttotla left a comment

Choose a reason for hiding this comment

dperny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyli commented Mar 21, 2018

cyli commented Mar 22, 2018

cyli commented Mar 27, 2018

anshulpundir commented Mar 27, 2018

cyli commented Mar 27, 2018

codecov bot commented Mar 10, 2018 •

edited

Loading

cyli Mar 14, 2018 •

edited

Loading

cyli commented Mar 15, 2018 •

edited

Loading