storage: raft snapshot queue inefficiencies #31947

tbg · 2018-10-27T20:12:14Z

When Raft wants a snapshot (MsgSnap), we (on the leader) .Add the range to the raft snapshot queue. However, there are various code paths that can lead to the range not receiving a snapshot "right away" (purgatory, max size limitation). If that happens, the Raft group is stuck in an awkward state until the scanner picks up the range again and hopefully does give it a snapshot.

This setup seems risky and we may want to address the above points so that the Raft-initiated snapshots have no chance of being lost.

The text was updated successfully, but these errors were encountered:

tbg · 2018-10-27T20:31:38Z

Another issue that I've noticed in #31732 (comment) is that Raft snapshots block on a semaphore on the receiver, which adds a component that can add latency in non-obvious ways.

The raft log queue allows a concurrency of four, so you'd have to send to four backed-up nodes at the same time to get a complete pipeline stall, but it did happen in my (admittedly pretty busted) experiments.

Not sure there's really something we can easily improve there. I guess we could change the queue semantics to have at most four "sending" snapshots at any given point (i.e. allow more goroutines to get queued into semaphores on various nodes).

The reasonable place to invest the energy is probably to make sure that we're not getting into situations in which there are hundreds of outstanding Raft snapshots.

petermattis · 2018-10-27T20:32:41Z

However, there are various code paths that can lead to the range not receiving a snapshot "right away" (purgatory, max size limitation).

Does the Raft snapshot queue have a purgatory? (I know that all queues have purgatory, but actually placing replicas in purgatory requires an error that implements purgatoryErrorMarker).

If that happens, the Raft group is stuck in an awkward state until the scanner picks up the range again and hopefully does give it a snapshot.

s/awkward/fragile/g.

Note that we call Replica.reportSnapshotStatus on both success and failure. That calls into Raft which will likely request another snapshot placing the replica back in the queue immediately.

tbg · 2018-10-27T20:38:37Z

Does the Raft snapshot queue have a purgatory?

I doubt it, but it shouldn't have room for this kind of error.

Note that we call Replica.reportSnapshotStatus on both success and failure.

No, we would never call this if the queue were full (limit is 10k). We only report a status if we actually tried to process a snapshot.

I'm not aware that there's anything fishy going on in the real world. I'm just uneasy that the queue code isn't really made for a kind of queue in which offering a replica requires an action to be taken.

petermattis · 2018-10-27T20:45:51Z

We only report a status if we actually tried to process a snapshot.

Oh, that's a good point.

I'm not aware that there's anything fishy going on in the real world. I'm just uneasy that the queue code isn't really made for a kind of queue in which offering a replica requires an action to be taken.

Now you've made me uneasy. Perhaps we should bump the max size for this queue. On the other hand, can we realistically process 10k Raft snapshots (10k is the max size for the Raft snapshot queue) in the scanner interval?

tbg · 2018-10-30T16:55:26Z

Another issue (that I think I've observed) is that the raft snapshot queue uses the same priority for everything. It doesn't use this code

cockroach/pkg/storage/queue.go

Lines 148 to 157 in 3cc9d57

    
           if diff := now.GoTime().Sub(last.GoTime()); diff >= minInterval { 
        
           	priority := float64(1) 
        
           	// If there's a non-zero last processed timestamp, adjust the 
        
           	// priority by a multiple of how long it's been since the last 
        
           	// time this replica was processed. 
        
           	if last != (hlc.Timestamp{}) { 
        
           		priority = float64(diff.Nanoseconds()) / float64(minInterval.Nanoseconds()) 
        
           	} 
        
           	return true, priority 
        
           }

As a result, I think that the priority queue will mostly be LIFO, which means that a snapshot that's queued up can be delayed arbitrarily if there's a frequent enough stream of other snapshot requests being pushed into the queue.

tbg · 2018-10-30T17:46:07Z

Have to think about whether it's the right thing to do, but we could just use a priority of (morally) math.MaxInt64 - time.Now() to make that FIFO.

petermattis · 2018-10-30T20:19:08Z

I noticed that same weakness of the queue priorities in some other context. Seems worth fixing as the starvation it can cause is unexpected. My thought was to add a sequence number of some sort so that priorities are never exactly equal.

When priorities are equal, the priority queue naively performs close to LIFO which is not what we want as it allows starvation of items backed into the queue. This poses a big problem for the Raft snapshot queue when put under pressure. See cockroachdb#31947 (comment). Release note (bug fix): Avoid a stall in the processing of Raft snapshots when many snapshots are requested at the same time.

32053: storage: use FIFO order in queues on equal priorities r=petermattis a=tschottdorf When priorities are equal, the priority queue naively performs close to LIFO which is not what we want as it allows starvation of items backed into the queue. This poses a big problem for the Raft snapshot queue when put under pressure. See #31947 (comment). Release note (bug fix): Avoid a stall in the processing of Raft snapshots when many snapshots are requested at the same time. Co-authored-by: Tobias Schottdorf <[email protected]>

When priorities are equal, the priority queue naively performs close to LIFO which is not what we want as it allows starvation of items backed into the queue. This poses a big problem for the Raft snapshot queue when put under pressure. See cockroachdb#31947 (comment). Release note (bug fix): Avoid a stall in the processing of Raft snapshots when many snapshots are requested at the same time.

tbg added the A-kv-replication Relating to Raft, consensus, and coordination. label Oct 27, 2018

tbg added this to the 2.2 milestone Oct 27, 2018

tbg self-assigned this Oct 27, 2018

tbg mentioned this issue Oct 27, 2018

storage: avoid errant Raft snapshots during splits #31875

Merged

tbg changed the title ~~storage: raft snapshot queue may not report back to raft in timely manner~~ storage: raft snapshot queue inefficiencies Oct 27, 2018

petermattis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 28, 2018

This was referenced Oct 30, 2018

roachtest: stability problems during tpcc-{5,10,20}k on {6,12,24} nodes #31409

Closed

roachtest: import/tpch/nodes=8 failed on master #30261

Closed

storage: snapshot/log truncation/replica GC badness tracking issue #32046

Closed

tbg mentioned this issue Oct 31, 2018

storage: use FIFO order in queues on equal priorities #32053

Merged

tbg mentioned this issue Nov 16, 2018

backport-2.1: storage: use FIFO order in queues on equal priorities #32414

Merged

tbg mentioned this issue Jan 28, 2019

storage: snapshots should not include Raft log #34287

Closed

lunevalex added the X-stale label Apr 23, 2021

lunevalex closed this as completed Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: raft snapshot queue inefficiencies #31947

storage: raft snapshot queue inefficiencies #31947

tbg commented Oct 27, 2018

tbg commented Oct 27, 2018

petermattis commented Oct 27, 2018

tbg commented Oct 27, 2018

petermattis commented Oct 27, 2018

tbg commented Oct 30, 2018

tbg commented Oct 30, 2018 •

edited

Loading

petermattis commented Oct 30, 2018

storage: raft snapshot queue inefficiencies #31947

storage: raft snapshot queue inefficiencies #31947

Comments

tbg commented Oct 27, 2018

tbg commented Oct 27, 2018

petermattis commented Oct 27, 2018

tbg commented Oct 27, 2018

petermattis commented Oct 27, 2018

tbg commented Oct 30, 2018

tbg commented Oct 30, 2018 • edited Loading

petermattis commented Oct 30, 2018

tbg commented Oct 30, 2018 •

edited

Loading