Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: raft snapshot queue inefficiencies #31947

Closed
tbg opened this issue Oct 27, 2018 · 7 comments
Closed

storage: raft snapshot queue inefficiencies #31947

tbg opened this issue Oct 27, 2018 · 7 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. X-stale
Milestone

Comments

@tbg
Copy link
Member

tbg commented Oct 27, 2018

When Raft wants a snapshot (MsgSnap), we (on the leader) .Add the range to the raft snapshot queue. However, there are various code paths that can lead to the range not receiving a snapshot "right away" (purgatory, max size limitation). If that happens, the Raft group is stuck in an awkward state until the scanner picks up the range again and hopefully does give it a snapshot.

This setup seems risky and we may want to address the above points so that the Raft-initiated snapshots have no chance of being lost.

@tbg tbg added the A-kv-replication Relating to Raft, consensus, and coordination. label Oct 27, 2018
@tbg tbg added this to the 2.2 milestone Oct 27, 2018
@tbg tbg self-assigned this Oct 27, 2018
@tbg tbg changed the title storage: raft snapshot queue may not report back to raft in timely manner storage: raft snapshot queue inefficiencies Oct 27, 2018
@tbg
Copy link
Member Author

tbg commented Oct 27, 2018

Another issue that I've noticed in #31732 (comment) is that Raft snapshots block on a semaphore on the receiver, which adds a component that can add latency in non-obvious ways.

The raft log queue allows a concurrency of four, so you'd have to send to four backed-up nodes at the same time to get a complete pipeline stall, but it did happen in my (admittedly pretty busted) experiments.

Not sure there's really something we can easily improve there. I guess we could change the queue semantics to have at most four "sending" snapshots at any given point (i.e. allow more goroutines to get queued into semaphores on various nodes).

The reasonable place to invest the energy is probably to make sure that we're not getting into situations in which there are hundreds of outstanding Raft snapshots.

@petermattis
Copy link
Collaborator

However, there are various code paths that can lead to the range not receiving a snapshot "right away" (purgatory, max size limitation).

Does the Raft snapshot queue have a purgatory? (I know that all queues have purgatory, but actually placing replicas in purgatory requires an error that implements purgatoryErrorMarker).

If that happens, the Raft group is stuck in an awkward state until the scanner picks up the range again and hopefully does give it a snapshot.

s/awkward/fragile/g.

Note that we call Replica.reportSnapshotStatus on both success and failure. That calls into Raft which will likely request another snapshot placing the replica back in the queue immediately.

@tbg
Copy link
Member Author

tbg commented Oct 27, 2018

Does the Raft snapshot queue have a purgatory?

I doubt it, but it shouldn't have room for this kind of error.

Note that we call Replica.reportSnapshotStatus on both success and failure.

No, we would never call this if the queue were full (limit is 10k). We only report a status if we actually tried to process a snapshot.

I'm not aware that there's anything fishy going on in the real world. I'm just uneasy that the queue code isn't really made for a kind of queue in which offering a replica requires an action to be taken.

@petermattis
Copy link
Collaborator

We only report a status if we actually tried to process a snapshot.

Oh, that's a good point.

I'm not aware that there's anything fishy going on in the real world. I'm just uneasy that the queue code isn't really made for a kind of queue in which offering a replica requires an action to be taken.

Now you've made me uneasy. Perhaps we should bump the max size for this queue. On the other hand, can we realistically process 10k Raft snapshots (10k is the max size for the Raft snapshot queue) in the scanner interval?

@petermattis petermattis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 28, 2018
@tbg
Copy link
Member Author

tbg commented Oct 30, 2018

Another issue (that I think I've observed) is that the raft snapshot queue uses the same priority for everything. It doesn't use this code

if diff := now.GoTime().Sub(last.GoTime()); diff >= minInterval {
priority := float64(1)
// If there's a non-zero last processed timestamp, adjust the
// priority by a multiple of how long it's been since the last
// time this replica was processed.
if last != (hlc.Timestamp{}) {
priority = float64(diff.Nanoseconds()) / float64(minInterval.Nanoseconds())
}
return true, priority
}

As a result, I think that the priority queue will mostly be LIFO, which means that a snapshot that's queued up can be delayed arbitrarily if there's a frequent enough stream of other snapshot requests being pushed into the queue.

@tbg
Copy link
Member Author

tbg commented Oct 30, 2018

Have to think about whether it's the right thing to do, but we could just use a priority of (morally) math.MaxInt64 - time.Now() to make that FIFO.

@petermattis
Copy link
Collaborator

I noticed that same weakness of the queue priorities in some other context. Seems worth fixing as the starvation it can cause is unexpected. My thought was to add a sequence number of some sort so that priorities are never exactly equal.

tbg added a commit to tbg/cockroach that referenced this issue Oct 31, 2018
When priorities are equal, the priority queue naively performs close to
LIFO which is not what we want as it allows starvation of items backed
into the queue. This poses a big problem for the Raft snapshot queue
when put under pressure.

See
cockroachdb#31947 (comment).

Release note (bug fix): Avoid a stall in the processing of Raft
snapshots when many snapshots are requested at the same time.
craig bot pushed a commit that referenced this issue Oct 31, 2018
32053: storage: use FIFO order in queues on equal priorities r=petermattis a=tschottdorf

When priorities are equal, the priority queue naively performs close to
LIFO which is not what we want as it allows starvation of items backed into
the queue. This poses a big problem for the Raft snapshot queue when put
under pressure.

See
#31947 (comment).

Release note (bug fix): Avoid a stall in the processing of Raft snapshots
when many snapshots are requested at the same time.

Co-authored-by: Tobias Schottdorf <[email protected]>
tbg added a commit to tbg/cockroach that referenced this issue Nov 16, 2018
When priorities are equal, the priority queue naively performs close to
LIFO which is not what we want as it allows starvation of items backed
into the queue. This poses a big problem for the Raft snapshot queue
when put under pressure.

See
cockroachdb#31947 (comment).

Release note (bug fix): Avoid a stall in the processing of Raft
snapshots when many snapshots are requested at the same time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. X-stale
Projects
None yet
Development

No branches or pull requests

3 participants