-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: raft snapshot queue inefficiencies #31947
Comments
Another issue that I've noticed in #31732 (comment) is that Raft snapshots block on a semaphore on the receiver, which adds a component that can add latency in non-obvious ways. The raft log queue allows a concurrency of four, so you'd have to send to four backed-up nodes at the same time to get a complete pipeline stall, but it did happen in my (admittedly pretty busted) experiments. Not sure there's really something we can easily improve there. I guess we could change the queue semantics to have at most four "sending" snapshots at any given point (i.e. allow more goroutines to get queued into semaphores on various nodes). The reasonable place to invest the energy is probably to make sure that we're not getting into situations in which there are hundreds of outstanding Raft snapshots. |
Does the Raft snapshot queue have a purgatory? (I know that all queues have purgatory, but actually placing replicas in purgatory requires an error that implements
Note that we call |
I doubt it, but it shouldn't have room for this kind of error.
No, we would never call this if the queue were full (limit is 10k). We only report a status if we actually tried to process a snapshot. I'm not aware that there's anything fishy going on in the real world. I'm just uneasy that the queue code isn't really made for a kind of queue in which offering a replica requires an action to be taken. |
Oh, that's a good point.
Now you've made me uneasy. Perhaps we should bump the max size for this queue. On the other hand, can we realistically process 10k Raft snapshots (10k is the max size for the Raft snapshot queue) in the scanner interval? |
Another issue (that I think I've observed) is that the raft snapshot queue uses the same priority for everything. It doesn't use this code cockroach/pkg/storage/queue.go Lines 148 to 157 in 3cc9d57
As a result, I think that the priority queue will mostly be LIFO, which means that a snapshot that's queued up can be delayed arbitrarily if there's a frequent enough stream of other snapshot requests being pushed into the queue. |
Have to think about whether it's the right thing to do, but we could just use a priority of (morally) |
I noticed that same weakness of the queue priorities in some other context. Seems worth fixing as the starvation it can cause is unexpected. My thought was to add a sequence number of some sort so that priorities are never exactly equal. |
When priorities are equal, the priority queue naively performs close to LIFO which is not what we want as it allows starvation of items backed into the queue. This poses a big problem for the Raft snapshot queue when put under pressure. See cockroachdb#31947 (comment). Release note (bug fix): Avoid a stall in the processing of Raft snapshots when many snapshots are requested at the same time.
32053: storage: use FIFO order in queues on equal priorities r=petermattis a=tschottdorf When priorities are equal, the priority queue naively performs close to LIFO which is not what we want as it allows starvation of items backed into the queue. This poses a big problem for the Raft snapshot queue when put under pressure. See #31947 (comment). Release note (bug fix): Avoid a stall in the processing of Raft snapshots when many snapshots are requested at the same time. Co-authored-by: Tobias Schottdorf <[email protected]>
When priorities are equal, the priority queue naively performs close to LIFO which is not what we want as it allows starvation of items backed into the queue. This poses a big problem for the Raft snapshot queue when put under pressure. See cockroachdb#31947 (comment). Release note (bug fix): Avoid a stall in the processing of Raft snapshots when many snapshots are requested at the same time.
When Raft wants a snapshot (
MsgSnap
), we (on the leader).Add
the range to the raft snapshot queue. However, there are various code paths that can lead to the range not receiving a snapshot "right away" (purgatory, max size limitation). If that happens, the Raft group is stuck in an awkward state until the scanner picks up the range again and hopefully does give it a snapshot.This setup seems risky and we may want to address the above points so that the Raft-initiated snapshots have no chance of being lost.
The text was updated successfully, but these errors were encountered: