Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: fix disappearing RHS merge bug #31842

Merged
merged 1 commit into from
Oct 28, 2018
Merged

storage: fix disappearing RHS merge bug #31842

merged 1 commit into from
Oct 28, 2018

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Oct 24, 2018

The strategy used by the replica GC queue to determine whether a
subsumed range can be GC'd is flawed. If a replica of the LHS was
uninitialized at the time the merge commmitted, there is a small window
where the replica GC queue can think that it's safe to clean up an RHS
replica when in fact the uninitialized LHS replica could still
initialize and apply a merge trigger that required that RHS to be
present.

Make the replica GC queue's strategy valid by requiring that all
replicas of the LHS are initialized before beginning a merge
transaction. This closes the window during which a replica of the RHS
could be incorrectly GC'd with a patch that is small enough to be
backported to v2.1.1.

Fix #31719.

Release note: None

@benesch benesch requested review from bdarnell, tbg and a team October 24, 2018 23:28
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@benesch benesch added the do-not-merge bors won't merge a PR with this label. label Oct 24, 2018
@benesch benesch changed the title [wip] storage: fix disappearing RHS merge bug storage: fix disappearing RHS merge bug Oct 25, 2018
@benesch benesch removed the do-not-merge bors won't merge a PR with this label. label Oct 25, 2018
@benesch
Copy link
Contributor Author

benesch commented Oct 25, 2018

Ok, this is RFAL.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica_command.go, line 630 at r1 (raw file):

	retryOpts := base.DefaultRetryOptions()
	retryOpts.MaxRetries = 5
retryLoop:

Feel free to dismiss this suggestion, but I always try to avoid labels because they're hard to follow. Here it seems that you could just drop the continuebelow and add a check if lastErr != nil { lastErr = nil; continue } before the return nil. You don't need to propagate lastErr outside of the retry loop because at that point you know that all replicas caught the same error, so you can just return one you manufacture right at the end.


pkg/storage/replica_command.go, line 631 at r1 (raw file):

	retryOpts.MaxRetries = 5
retryLoop:
	for retrier := retry.StartWithCtx(ctx, retryOpts); retrier.Next(); {

Can we fail-fast if we're not the leader? Seems silly to spend a number of retries here and block merges that could actually proceed.


pkg/storage/replica_command.go, line 640 at r1 (raw file):

			// that the replicas are initialized, and the merge will (hopefully) be
			// retried on the leader.
			if raftStatus.Progress[uint64(r.ReplicaID)].Match == 0 {

raftStatus could be nil. Probably not in reality since you've already somehow made sure that it's initialized before getting here, but still.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: with @tschottdorf 's comments

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tobi and I chatted and decided that it would be vastly simpler to just introduce a WaitForReplicaInit RPC into the stores server. Neither of us were fully confident that we wouldn't get stuck in a leader-but-not-leaseholder loop with the previous approach, and the code to transfer Raft leadership to match the leaseholdership is, uhm, hairy. (For example: do you want transferRaftLeader or transferRaftLeadership?)

PTAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_command.go, line 630 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Feel free to dismiss this suggestion, but I always try to avoid labels because they're hard to follow. Here it seems that you could just drop the continuebelow and add a check if lastErr != nil { lastErr = nil; continue } before the return nil. You don't need to propagate lastErr outside of the retry loop because at that point you know that all replicas caught the same error, so you can just return one you manufacture right at the end.

No longer relevant.


pkg/storage/replica_command.go, line 631 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Can we fail-fast if we're not the leader? Seems silly to spend a number of retries here and block merges that could actually proceed.

No longer relevant.


pkg/storage/replica_command.go, line 640 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

raftStatus could be nil. Probably not in reality since you've already somehow made sure that it's initialized before getting here, but still.

No longer relevant.

Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious to know if this has implications for backporting. I don't love the idea of introducing a new RPC in a patch release.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this RPC in a patch release is fine since it'll only be used if the cluster setting is enabled, and that setting isn't changing in a patch release. Even if the user upgrades directly from 2.1.0 to 2.2.0 (which will flip the default), the loop that calls this RPC should just bail out and abort the merge if it gets an unknown RPC error.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/stores_server.go, line 119 at r2 (raw file):

	resp := &WaitForReplicaInitResponse{}
	err := is.execStoreCommand(req.StoreRequestHeader, func(s *Store) error {
		retryOpts := retry.Options{InitialBackoff: 10 * time.Millisecond}

There should be an upper bound on this retry loop. Better to return an error and abort the merge than to hang indefinitely.

@benesch
Copy link
Contributor Author

benesch commented Oct 25, 2018

There should be an upper bound on this retry loop. Better to return an error and abort the merge than to hang indefinitely.

The upper bound is enforced by context timeouts. (This is how WaitForApplication works too.) The client sets the timeout at 5s currently. Is that sufficient or do you want me to add a server-side upper bound too?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/storage/replica_command.go, line 633 at r2 (raw file):

	g := ctxgroup.WithContext(ctx)
	for _, repl := range desc.Replicas {
		repl := repl

Things would get really awkward (and incorrect) without this line. I hate that this is necessary but that's the Go way. Please add a comment.


pkg/storage/stores_server.go, line 119 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

There should be an upper bound on this retry loop. Better to return an error and abort the merge than to hang indefinitely.

Having the caller use context cancellation works for me, but add a comment here.

@bdarnell
Copy link
Contributor

The upper bound is enforced by context timeouts. (This is how WaitForApplication works too.) The client sets the timeout at 5s currently. Is that sufficient or do you want me to add a server-side upper bound too?

That's fine for now. I generally prefer limits set on both sides, but it doesn't need to change until we get around to doing a general review of all of our retry loops.

@benesch benesch force-pushed the merge-bug branch 2 times, most recently from 06ef494 to 7095be2 Compare October 26, 2018 21:11
Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/storage/replica_command.go, line 633 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Things would get really awkward (and incorrect) without this line. I hate that this is necessary but that's the Go way. Please add a comment.

Done.


pkg/storage/stores_server.go, line 119 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Having the caller use context cancellation works for me, but add a comment here.

Done.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanna fix your imports and get this baby in?

Reviewed 3 of 3 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)

The strategy used by the replica GC queue to determine whether a
subsumed range can be GC'd is flawed. If a replica of the LHS was
uninitialized at the time the merge commmitted, there is a small window
where the replica GC queue can think that it's safe to clean up an RHS
replica when in fact the uninitialized LHS replica could still
initialize and apply a merge trigger that required that RHS to be
present.

Make the replica GC queue's strategy valid by requiring that all
replicas of the LHS are initialized before beginning a merge
transaction. This closes the window during which a replica of the RHS
could be incorrectly GC'd with a patch that is small enough to be
backported to v2.1.1.

Fix #31719.

Release note: None
@benesch
Copy link
Contributor Author

benesch commented Oct 28, 2018

bors r=tschottdorf,bdarnell

craig bot pushed a commit that referenced this pull request Oct 28, 2018
31842: storage: fix disappearing RHS merge bug r=tschottdorf,bdarnell a=benesch

The strategy used by the replica GC queue to determine whether a
subsumed range can be GC'd is flawed. If a replica of the LHS was
uninitialized at the time the merge commmitted, there is a small window
where the replica GC queue can think that it's safe to clean up an RHS
replica when in fact the uninitialized LHS replica could still
initialize and apply a merge trigger that required that RHS to be
present.

Make the replica GC queue's strategy valid by requiring that all
replicas of the LHS are initialized before beginning a merge
transaction. This closes the window during which a replica of the RHS
could be incorrectly GC'd with a patch that is small enough to be
backported to v2.1.1.

Fix #31719.

Release note: None

Co-authored-by: Nikhil Benesch <[email protected]>
@craig
Copy link
Contributor

craig bot commented Oct 28, 2018

Build succeeded

@craig craig bot merged commit a3b85db into master Oct 28, 2018
@benesch benesch deleted the merge-bug branch October 28, 2018 05:10
@benesch
Copy link
Contributor Author

benesch commented Oct 28, 2018

Whew, got it in for this night's test run. 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ccl/importccl: TestImportCSVStmt failed under stress
4 participants