storage: fix disappearing RHS merge bug #31842

benesch · 2018-10-24T23:28:09Z

The strategy used by the replica GC queue to determine whether a
subsumed range can be GC'd is flawed. If a replica of the LHS was
uninitialized at the time the merge commmitted, there is a small window
where the replica GC queue can think that it's safe to clean up an RHS
replica when in fact the uninitialized LHS replica could still
initialize and apply a merge trigger that required that RHS to be
present.

Make the replica GC queue's strategy valid by requiring that all
replicas of the LHS are initialized before beginning a merge
transaction. This closes the window during which a replica of the RHS
could be incorrectly GC'd with a patch that is small enough to be
backported to v2.1.1.

Fix #31719.

Release note: None

cockroach-teamcity · 2018-10-24T23:28:17Z

This change is

benesch · 2018-10-25T04:46:30Z

Ok, this is RFAL.

tbg

Reviewed 2 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica_command.go, line 630 at r1 (raw file):

	retryOpts := base.DefaultRetryOptions()
	retryOpts.MaxRetries = 5
retryLoop:

Feel free to dismiss this suggestion, but I always try to avoid labels because they're hard to follow. Here it seems that you could just drop the continuebelow and add a check if lastErr != nil { lastErr = nil; continue } before the return nil. You don't need to propagate lastErr outside of the retry loop because at that point you know that all replicas caught the same error, so you can just return one you manufacture right at the end.

pkg/storage/replica_command.go, line 631 at r1 (raw file):

	retryOpts.MaxRetries = 5
retryLoop:
	for retrier := retry.StartWithCtx(ctx, retryOpts); retrier.Next(); {

Can we fail-fast if we're not the leader? Seems silly to spend a number of retries here and block merges that could actually proceed.

pkg/storage/replica_command.go, line 640 at r1 (raw file):

			// that the replicas are initialized, and the merge will (hopefully) be
			// retried on the leader.
			if raftStatus.Progress[uint64(r.ReplicaID)].Match == 0 {

raftStatus could be nil. Probably not in reality since you've already somehow made sure that it's initialized before getting here, but still.

bdarnell

with @tschottdorf 's comments

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

benesch

Tobi and I chatted and decided that it would be vastly simpler to just introduce a WaitForReplicaInit RPC into the stores server. Neither of us were fully confident that we wouldn't get stuck in a leader-but-not-leaseholder loop with the previous approach, and the code to transfer Raft leadership to match the leaseholdership is, uhm, hairy. (For example: do you want transferRaftLeader or transferRaftLeadership?)

PTAL!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/replica_command.go, line 630 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Feel free to dismiss this suggestion, but I always try to avoid labels because they're hard to follow. Here it seems that you could just drop the continuebelow and add a check if lastErr != nil { lastErr = nil; continue } before the return nil. You don't need to propagate lastErr outside of the retry loop because at that point you know that all replicas caught the same error, so you can just return one you manufacture right at the end.

No longer relevant.

pkg/storage/replica_command.go, line 631 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Can we fail-fast if we're not the leader? Seems silly to spend a number of retries here and block merges that could actually proceed.

No longer relevant.

pkg/storage/replica_command.go, line 640 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

raftStatus could be nil. Probably not in reality since you've already somehow made sure that it's initialized before getting here, but still.

No longer relevant.

benesch

I am curious to know if this has implications for backporting. I don't love the idea of introducing a new RPC in a patch release.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

bdarnell

Adding this RPC in a patch release is fine since it'll only be used if the cluster setting is enabled, and that setting isn't changing in a patch release. Even if the user upgrades directly from 2.1.0 to 2.2.0 (which will flip the default), the loop that calls this RPC should just bail out and abort the merge if it gets an unknown RPC error.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/stores_server.go, line 119 at r2 (raw file):

	resp := &WaitForReplicaInitResponse{}
	err := is.execStoreCommand(req.StoreRequestHeader, func(s *Store) error {
		retryOpts := retry.Options{InitialBackoff: 10 * time.Millisecond}

There should be an upper bound on this retry loop. Better to return an error and abort the merge than to hang indefinitely.

benesch · 2018-10-25T19:22:27Z

There should be an upper bound on this retry loop. Better to return an error and abort the merge than to hang indefinitely.

The upper bound is enforced by context timeouts. (This is how WaitForApplication works too.) The client sets the timeout at 5s currently. Is that sufficient or do you want me to add a server-side upper bound too?

tbg

Reviewed 5 of 5 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

pkg/storage/replica_command.go, line 633 at r2 (raw file):

	g := ctxgroup.WithContext(ctx)
	for _, repl := range desc.Replicas {
		repl := repl

Things would get really awkward (and incorrect) without this line. I hate that this is necessary but that's the Go way. Please add a comment.

pkg/storage/stores_server.go, line 119 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

There should be an upper bound on this retry loop. Better to return an error and abort the merge than to hang indefinitely.

Having the caller use context cancellation works for me, but add a comment here.

bdarnell · 2018-10-26T14:12:56Z

The upper bound is enforced by context timeouts. (This is how WaitForApplication works too.) The client sets the timeout at 5s currently. Is that sufficient or do you want me to add a server-side upper bound too?

That's fine for now. I generally prefer limits set on both sides, but it doesn't need to change until we get around to doing a general review of all of our retry loops.

benesch

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

pkg/storage/replica_command.go, line 633 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Things would get really awkward (and incorrect) without this line. I hate that this is necessary but that's the Go way. Please add a comment.

Done.

pkg/storage/stores_server.go, line 119 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Having the caller use context cancellation works for me, but add a comment here.

Done.

tbg

Wanna fix your imports and get this baby in?

Reviewed 3 of 3 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

The strategy used by the replica GC queue to determine whether a subsumed range can be GC'd is flawed. If a replica of the LHS was uninitialized at the time the merge commmitted, there is a small window where the replica GC queue can think that it's safe to clean up an RHS replica when in fact the uninitialized LHS replica could still initialize and apply a merge trigger that required that RHS to be present. Make the replica GC queue's strategy valid by requiring that all replicas of the LHS are initialized before beginning a merge transaction. This closes the window during which a replica of the RHS could be incorrectly GC'd with a patch that is small enough to be backported to v2.1.1. Fix #31719. Release note: None

benesch · 2018-10-28T04:33:37Z

bors r=tschottdorf,bdarnell

31842: storage: fix disappearing RHS merge bug r=tschottdorf,bdarnell a=benesch The strategy used by the replica GC queue to determine whether a subsumed range can be GC'd is flawed. If a replica of the LHS was uninitialized at the time the merge commmitted, there is a small window where the replica GC queue can think that it's safe to clean up an RHS replica when in fact the uninitialized LHS replica could still initialize and apply a merge trigger that required that RHS to be present. Make the replica GC queue's strategy valid by requiring that all replicas of the LHS are initialized before beginning a merge transaction. This closes the window during which a replica of the RHS could be incorrectly GC'd with a patch that is small enough to be backported to v2.1.1. Fix #31719. Release note: None Co-authored-by: Nikhil Benesch <[email protected]>

craig · 2018-10-28T04:52:11Z

Build succeeded

GitHub CI (Cockroach)

benesch · 2018-10-28T05:11:22Z

Whew, got it in for this night's test run. 🤞

benesch requested review from bdarnell, tbg and a team October 24, 2018 23:28

benesch added the do-not-merge bors won't merge a PR with this label. label Oct 24, 2018

benesch changed the title ~~[wip] storage: fix disappearing RHS merge bug~~ storage: fix disappearing RHS merge bug Oct 25, 2018

benesch removed the do-not-merge bors won't merge a PR with this label. label Oct 25, 2018

benesch force-pushed the merge-bug branch from f512841 to 7578c50 Compare October 25, 2018 04:45

tbg requested changes Oct 25, 2018

View reviewed changes

bdarnell approved these changes Oct 25, 2018

View reviewed changes

benesch force-pushed the merge-bug branch from 7578c50 to cdf6bf5 Compare October 25, 2018 18:08

benesch commented Oct 25, 2018

View reviewed changes

bdarnell approved these changes Oct 25, 2018

View reviewed changes

tbg approved these changes Oct 26, 2018

View reviewed changes

benesch force-pushed the merge-bug branch 2 times, most recently from 06ef494 to 7095be2 Compare October 26, 2018 21:11

benesch commented Oct 26, 2018

View reviewed changes

benesch mentioned this pull request Oct 26, 2018

roachtest: restore2TB/nodes=32 failed #31745

Closed

tbg approved these changes Oct 27, 2018

View reviewed changes

benesch force-pushed the merge-bug branch from 7095be2 to a3b85db Compare October 28, 2018 04:33

craig bot merged commit a3b85db into master Oct 28, 2018

benesch deleted the merge-bug branch October 28, 2018 05:10

benesch mentioned this pull request Oct 28, 2018

roachtest: restore2TB/nodes=32 failed #31946

Closed

tbg mentioned this pull request Oct 30, 2018

roachtest: restore2TB/nodes=32 failed #31979

Closed

benesch mentioned this pull request Nov 28, 2018

release-2.1: backport merge range fixes #32660

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: fix disappearing RHS merge bug #31842

storage: fix disappearing RHS merge bug #31842

benesch commented Oct 24, 2018 •

edited

Loading

cockroach-teamcity commented Oct 24, 2018

benesch commented Oct 25, 2018

tbg left a comment

bdarnell left a comment

benesch left a comment

benesch left a comment

bdarnell left a comment

benesch commented Oct 25, 2018

tbg left a comment

bdarnell commented Oct 26, 2018

benesch left a comment

tbg left a comment

benesch commented Oct 28, 2018

craig bot commented Oct 28, 2018

benesch commented Oct 28, 2018

storage: fix disappearing RHS merge bug #31842

storage: fix disappearing RHS merge bug #31842

Conversation

benesch commented Oct 24, 2018 • edited Loading

cockroach-teamcity commented Oct 24, 2018

benesch commented Oct 25, 2018

tbg left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

benesch commented Oct 25, 2018

tbg left a comment

Choose a reason for hiding this comment

bdarnell commented Oct 26, 2018

benesch left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

benesch commented Oct 28, 2018

craig bot commented Oct 28, 2018

Build succeeded

benesch commented Oct 28, 2018

benesch commented Oct 24, 2018 •

edited

Loading