Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RoleManager will remove deleted nodes from the cluster membership #2551

Merged
merged 3 commits into from
Apr 4, 2018

Conversation

cyli
Copy link
Contributor

@cyli cyli commented Mar 10, 2018

Fixes #2548 (please see description there), I think.

Wondering if we should also prevent node removal if the demotion process hasn't completed for a manager (e.g. if the node is still in the raft cluster, we disallow the remove node API call and require that the user try again later).

@cyli cyli force-pushed the role-reconciler-edge branch from 20a887e to 25f8a99 Compare March 10, 2018 00:40
@cyli cyli added the area/raft label Mar 10, 2018
@codecov
Copy link

codecov bot commented Mar 10, 2018

Codecov Report

Merging #2551 into master will increase coverage by 0.1%.
The diff coverage is 73.33%.

@@            Coverage Diff            @@
##           master    #2551     +/-   ##
=========================================
+ Coverage   61.46%   61.57%   +0.1%     
=========================================
  Files         134      133      -1     
  Lines       21800    21789     -11     
=========================================
+ Hits        13399    13416     +17     
+ Misses       6958     6926     -32     
- Partials     1443     1447      +4

Copy link
Contributor

@anshulpundir anshulpundir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we should also prevent node removal if the demotion process hasn't completed for a manager (e.g. if the node is still in the raft cluster, we disallow the remove node API call and require that the user try again later).

I think yes. Would that mean we don't need the changes in the PR ?

// Run is roleManager's main loop.
// ctx is only used for logging.
func (rm *roleManager) Run(ctx context.Context) {
defer close(rm.doneChan)

var (
nodes []*api.Node
ticker *time.Ticker
ticker clock.Ticker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment on how this is used

for _, removed := range rm.removed {
rm.evictRemovedNode(ctx, removed)
}
if len(rm.pending) != 0 || len(rm.removed) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment.

if len(rm.pending) != 0 {
ticker = time.NewTicker(roleReconcileInterval)
tickerCh = ticker.C
for _, removed := range rm.removed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what kind of comment would make sense here?

@@ -62,13 +82,29 @@ func (rm *roleManager) Run(ctx context.Context) {
if err != nil {
log.G(ctx).WithError(err).Error("failed to check nodes for role changes")
} else {
// Put all members in the removed list first - if the node object exists
// for this member, then it is deleted from the removed list. This is safe
// because a node object is created when a node first requests TLS certificates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is safe because a node object is created when a node first requests TLS certificates to be able to communicate with other raft members, so the node object

Sorry but I don't really follow what this means. Maybe clarify why we're doing this ?


// removed contains raft members, indexed by node ID, whose nodes have since
// been deleted - these members need to be removed from raft
removed map[string]*api.RaftMember
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename to to_be_removed to clarify the meaning here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add an explanation of the process, rather than butchering variables with verbs and prepositions. ;)

As far as I understand, this is a removal queue for nodes that have been "deleted" but now need to be evicted from the raft member list. Does this need to be persisted or should the code below handle the case where this gets interrupted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it were interrupted, we'd have to try again in another tick, so wouldn't it have to be persisted?

@@ -24,29 +26,47 @@ type roleManager struct {
// pending contains changed nodes that have not yet been reconciled in
// the raft member list.
pending map[string]*api.Node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: pending => pending_reconciliation to make it easier to understand.

@@ -62,13 +82,29 @@ func (rm *roleManager) Run(ctx context.Context) {
if err != nil {
log.G(ctx).WithError(err).Error("failed to check nodes for role changes")
} else {
// Put all members in the removed list first - if the node object exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: Please also add a high level comment on what all of this method is trying to achieve.

// removeMember removes a member from the raft cluster membership
func (rm *roleManager) removeMember(ctx context.Context, member *membership.Member) {
// Quorum safeguard
if !rm.raft.CanRemoveMember(member.RaftID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this happen ? Please add a comment.

return
}

rmCtx, rmCancel := context.WithTimeout(rm.ctx, 5*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a global for this timeout.

// Check if the member still exists in the membership
member := rm.raft.GetMemberByNodeID(raftMember.NodeID)
if member != nil {
// We first try to remove the raft node from the raft cluster. On the next tick, if the node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we do it in the same tick ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm not sure, as I am not super familiar with this code either - this was just how the logic was previously for demotion. I think we probably can just update the node immediately, since if we remove the node, eventually https://github.com/docker/swarmkit/blob/master/manager/state/raft/raft.go#L1850 gets called, which seems to wait for the configuration change to be processed and applied to the server before returning. Does that sound right cc @aaronlehmann?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(As the code is written, this allows a node reconciliation to be retried if there's a logged error, but the code can be refactored. Not sure if we require that updating the node objects happen in the next tick on successful cluster membership removal, though)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure but it may just be to simplify error handling. If we did it in the same tick, we'd have to make sure it was actually removed before deleting it from the list. I don't think there's any downside to deferring the removal from the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronlehmann Cool, thanks! I agree I think there's no downside. I just wanted to be able to add a comment as to why it is deferred until the next tick.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably bikeshedding, but this makes the code somewhat harder to read and reason with. I agree that there might not be a downside to doing it this way, but I'm not sure that's motivation enough. Thats my take, but I'll let you decide :) @cyli

Copy link
Contributor Author

@cyli cyli Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anshulpundir If we were 100% sure that error handling is the only reason, I'd be happy to make the change. As it is I think I'm only pretty sure it it can be updated in the same tick, so perhaps not making the change would be safer. I think I'd want more comprehensive tests to make sure that making that change wouldn't break a corner case I didn't understand.

I will see if I can figure out how to test that if removing a node errors, that the removal will be retried later, though. I haven't added that cases because I couldn't think of a good way to write that test yet. (Although I don't think these tests would be sufficient to make sure updating the node in the same tick wouldn't break something - it's possible, although I don't see it, that there may be a timing issue where waiting out another tick is safer :|)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were 100% sure that error handling is the only reason, I'd be happy to make the change.

I guess we're not, so I'm OK with leaving this as is for now.

@anshulpundir
Copy link
Contributor

Don't know enough about this code path, but I hope my comments are not too annoying @cyli

@cyli cyli force-pushed the role-reconciler-edge branch 2 times, most recently from 5219728 to 5e66778 Compare March 13, 2018 19:24
@cyli
Copy link
Contributor Author

cyli commented Mar 13, 2018

@anshulpundir

I think yes. Would that mean we don't need the changes in the PR ?

I think the changes in the PR make the role reconciler more correct, and I think it'd be safer to have. Particularly if an existing cluster already is in the state, this may be able to remove the extra membership. But that would hopefully prevent this in the future?

Don't know enough about this code path, but I hope my comments are not too annoying

Not at all - thanks for the requests for clarification. My context is not your context and vice versa, so it's helpful to know whether my comments are actually useful :)

@cyli
Copy link
Contributor Author

cyli commented Mar 13, 2018

Also I can't seem to replicate the error in CI - my changes were just (1) rebasing and (2) comments and renaming, and it was not these tests that failed. :|

@anshulpundir
Copy link
Contributor

Particularly if an existing cluster already is in the state, this may be able to remove the extra membership. But that would hopefully prevent this in the future?

Thats a valid point, although I'm wondering if its important enough to write this logic specifically to address existing clusters in this state. We have seen a few customers in this state and in all of those cases, we fell back to docker init --force-new-cluster, my point being customers will likely not sit around in this state waiting to upgrade. What do you think ?

Are there any other opinions on this ? @nishanttotla @dperny @aluzzardi

const roleReconcileInterval = 5 * time.Second
const (
roleReconcileInterval = 5 * time.Second
removalTimeout = 5 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add a note on what these are ?

@anshulpundir
Copy link
Contributor

BTW, we have been seeing problems recently where the node remains in the memberlist but is removed from the store and there is no way to remove this member from raft. Thx @cyli for discovering and fixing it!

@cyli cyli force-pushed the role-reconciler-edge branch from 5e66778 to 042acc9 Compare March 15, 2018 17:30
@cyli
Copy link
Contributor Author

cyli commented Mar 15, 2018

Thats a valid point, although I'm wondering if its important enough to write this logic specifically to address existing clusters in this state. We have seen a few customers in this state and in all of those cases, we fell back to docker init --force-new-cluster, my point being customers will likely not sit around in this state waiting to upgrade. What do you think ?

It it also possible that there are ghost raft IDs in a membership cluster that aren't currently causing problems, but could cause problems later if another node or two get demoted/removed? If so, this would remove the ghost node before it becomes a problem, and they have to force a new cluster.

Also, this is probably bikeshedding, but if we agree that this is the more correct behavior for node reconciliation, I favor fixing the misbehavior. It's true that updating the API to disallow removal before the role manager has successfully reconciled the role would prevent the bug from surfacing, but I'd consider these separate updates - one to fix the role manager misbehavior, and the other would provide a better user experience due to better feedback.

Especially since it is easier to test that the role manager now removes deleted nodes, but testing that changing the control API prevents this behavior is harder. This is more of a unit test, and that would be an integration test, and it's hard to get the timing right (I wouldn't be sure how to do it atm).

@anshulpundir
Copy link
Contributor

but if we agree that this is the more correct behavior for node reconciliation, I favor fixing the misbehavior

I am afraid I don't fully agree with this. The only reason we need this is because of the bug in node remove not waiting for node demote.

It it also possible that there are ghost raft IDs in a membership cluster that aren't currently causing problems, but could cause problems later if another node or two get demoted/removed? If so, this would remove the ghost node before it becomes a problem, and they have to force a new cluster.

Ageed.

But, let me clarify my concern: it is leaving dead code in the product. At some point in the future we will have no customers on a version before this fix. At that point, this code is dead (assuming we also have the fix to disallow removing nodes while the reconciliation is in progress, or demotion is made synchronous).

@cyli
Copy link
Contributor Author

cyli commented Mar 15, 2018

The only reason we need this is because of the bug in node remove not waiting for node demote.

Ah, I see. I had not considered node remove not waiting for demote a bug, just a change in behavior that would make a difference in user experience. From that perspective that makes sense.

How would you test the controller behavior?

@cyli cyli force-pushed the role-reconciler-edge branch from 042acc9 to af81379 Compare March 15, 2018 18:51
Copy link
Contributor

@nishanttotla nishanttotla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM on first pass, but I'd like to do another pass before approving. I'm still not 100% sure about the new logic.

Thanks for this fix @cyli.

Copy link
Collaborator

@dperny dperny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, my one API comment isn't a blocker.

}

// newRoleManager creates a new roleManager.
func newRoleManager(store *store.MemoryStore, raftNode *raft.Node) *roleManager {
func newRoleManager(store *store.MemoryStore, raftNode *raft.Node, clocksource clock.Clock) *roleManager {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

  1. newRoleManager is an unexported method, meaning we know no external package, including external tests depends on a role manager
  2. clocksource is only used for testing, and is passed nil in prod code.

Therefore, it would probably be a cleaner API to just set clocksource directly by accessing the unexported field before calling Run in the tests. That way, we don't expose complexity resulting from testing requirements if we later on decide to move the roleManager to its own package or whatever.

Complicated explanation for a very small nit lol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks for the suggestion!

// be reconciled. As node updates come in, any promotions or demotions are also added to the
// reconciliation queue and reconciled.

// The removal queue is never added to beyond the startup of the loop, because
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

The removal queue is never added to beyond the startup of the loop, because before a node can be removed, it must be demoted, so the demotion should be sufficient to get it removed from the cluster membership.

Not sure I completely follow this part though :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about:

"Nothing will be added to the removal queue after the loop startup. The removal queue consists of nodes that have been already removed from the cluster. Before a manager (and hence raft) node can be removed from a cluster via the control API, it must first be demoted. Because we watch for node updates, and we add demoted nodes to the reconciliation queue, the node's removal from the raft cluster should be handled upon reconciliation of that queue."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nm, as per #2551 (comment), this comment won't apply any more anyway :|

// Check if the member still exists in the membership
member := rm.raft.GetMemberByNodeID(raftMember.NodeID)
if member != nil {
// We first try to remove the raft node from the raft cluster. On the next tick, if the node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were 100% sure that error handling is the only reason, I'd be happy to make the change.

I guess we're not, so I'm OK with leaving this as is for now.

@@ -0,0 +1,213 @@
package manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we test the specific condition which this fix handles: start with a raft node in the memberlist but not in the store and make sure that role manager removes it when it starts up. @cyli

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/docker/swarmkit/pull/2551/files#diff-5ebf7b379f77f441db1a9548cabf2a89R170:

Because the role manager is being independently tested, so it can be started at any time. I create a 3-node raft cluster, and update the store to reflect only 2 node objects. The membership should reflect 3 raft nodes, but the store says that there are 2 canonical node objects. Then I create a new role manager using one of the raft nodes, and I test that after a while one of the nodes has been kicked out.

I'll add another check that the correct raft node has been kicked out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, maybe we can do the same thing for the control api server.

@cyli cyli force-pushed the role-reconciler-edge branch from af81379 to e82922d Compare March 20, 2018 17:22
@cyli
Copy link
Contributor Author

cyli commented Mar 21, 2018

Sorry, I was looking through the code again and I maybe found one more there could be a bug that is not covered by the changes in this PR.

When a node is removed, if its desired role is a manager, and it's a member of the raft cluster, we require that it be demoted before removal. If it's not a member of the raft cluster, we allow removal, even if the desired role is a manager. Possibly if a node was promoted, and then immediately removed from the cluster before it managed to join the raft cluster or at least before the conf changes were applied, then we'd allow a node removal and then right afterward the node will join the raft cluster membership. It will never be removed from the raft cluster membership until the next leadership election (because roleManager does not watch for role deletions). I can update the PR to also watch for node deletions and handle that case, or we can just wait for the leader election.

cyli and others added 3 commits March 21, 2018 16:00
both on startup and while a node is demoted while it's running.  Also add
a failing test, where the roleManager should remove deleted nodes from
the cluster membership on startup.

These tests involve injecting a fake clock source into roleManager.

Signed-off-by: cyli <[email protected]>
… node that

is a manager can be removed if it hasn't joined the raft cluster yet.  However,
perhaps the raft conf change happens right after the check.

Signed-off-by: Ying Li <[email protected]>
@cyli cyli force-pushed the role-reconciler-edge branch from e82922d to 36954b5 Compare March 21, 2018 23:40
@cyli
Copy link
Contributor Author

cyli commented Mar 22, 2018

Updated to remove deleted nodes while running, as well.

@cyli
Copy link
Contributor Author

cyli commented Mar 27, 2018

@anshulpundir In light of our discussions in #2565 and how role manager should change, do you want me to make more changes in this PR, or make additional changes in a second PR? (e.g. `RoleChangeInProgress, monitoring raft cluster membership, etc.)?

@anshulpundir
Copy link
Contributor

I'd suggest follow up PRs. I can also take up some of those work items btw. @cyli

@cyli
Copy link
Contributor Author

cyli commented Mar 27, 2018

@anshulpundir That'd be awesome, thank you!

@anshulpundir anshulpundir merged commit 15f2859 into moby:master Apr 4, 2018
@cyli cyli deleted the role-reconciler-edge branch April 4, 2018 20:15
thaJeztah added a commit to thaJeztah/docker that referenced this pull request Apr 17, 2018
Relevant changes:

- moby/swarmkit#2551 RoleManager will remove deleted nodes from the cluster membership
- moby/swarmkit#2574 Scheduler/TaskReaper: handle unassigned tasks marked for shutdown
- moby/swarmkit#2561 Avoid predefined error log
- moby/swarmkit#2557 Task reaper should delete tasks with removed slots that were not yet assigned
- moby/swarmkit#2587 [fips] Agent reports FIPS status
- moby/swarmkit#2603 Fix manager/state/store.timedMutex

Signed-off-by: Sebastiaan van Stijn <[email protected]>
docker-jenkins pushed a commit to docker-archive/docker-ce that referenced this pull request Apr 18, 2018
Relevant changes:

- moby/swarmkit#2551 RoleManager will remove deleted nodes from the cluster membership
- moby/swarmkit#2574 Scheduler/TaskReaper: handle unassigned tasks marked for shutdown
- moby/swarmkit#2561 Avoid predefined error log
- moby/swarmkit#2557 Task reaper should delete tasks with removed slots that were not yet assigned
- moby/swarmkit#2587 [fips] Agent reports FIPS status
- moby/swarmkit#2603 Fix manager/state/store.timedMutex

Signed-off-by: Sebastiaan van Stijn <[email protected]>
Upstream-commit: 333b2f28fef4ba857905e7263e7b9bbbf7c522fc
Component: engine
thaJeztah added a commit to thaJeztah/docker-ce that referenced this pull request Apr 19, 2018
Relevant changes:

- moby/swarmkit#2551 RoleManager will remove deleted nodes from the cluster membership
- moby/swarmkit#2574 Scheduler/TaskReaper: handle unassigned tasks marked for shutdown
- moby/swarmkit#2561 Avoid predefined error log
- moby/swarmkit#2557 Task reaper should delete tasks with removed slots that were not yet assigned
- moby/swarmkit#2587 [fips] Agent reports FIPS status
- moby/swarmkit#2603 Fix manager/state/store.timedMutex

Signed-off-by: Sebastiaan van Stijn <[email protected]>
(cherry picked from commit 333b2f28fef4ba857905e7263e7b9bbbf7c522fc)
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants