cmd/thanos/receive: reduce WAL replays at startup #1721

squat · 2019-11-05T19:15:17Z

Every time thanos receive is started, it has to replay the WAL three
times, namely:

open the TSDB;
close the TSDB; open the ReadOnly TSDB and Flush; and
open the TSDB

These WAL replays can take a very long time if the WAL has lots of data.
With the fix from #1654, the third time will be instantaneous because
the WAL will be empty. That still leaves two potentially long WAL
replays. We can cut this down to just one long replay if we do the
following operations instead:

with a closed TSDB, open the ReadOnly TSDB and Flush; and
open the TSDB

Now, the second step will be a fast replay because the WAL is empty,
leaving just one potentially expensive WAL replay.

This commit eliminates explicit opening of the writable TSDB during
startup, and instead opens it after flushing the read-only TSDB.

Signed-off-by: Lucas Servén Marín [email protected]

cc @metalmatze @bwplotka @brancz

brancz · 2019-11-06T06:40:51Z

Not sure if legitimately or not, but CI is killing the job because nothing new in logs for 10mins.

squat · 2019-11-06T07:54:25Z

Saw that. I am running locally and it’s happy. Can we re-kick CI?

brancz · 2019-11-06T08:24:37Z

Re-ran a few times now, it seems we just have to extend the timeout if that's possible on circleci.

squat · 2019-11-06T09:13:33Z

Ill do some digging

Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <[email protected]>

While debugging thanos-io#1721, I found that when thanos receive bails, there is a race in a select statement, where the non-returning branch may be chosen. This branch will deadlock if selected twice because the channel reader has already exited. The way to prevent this is by checking if we need to exit on every loop. Signed-off-by: Lucas Servén Marín <[email protected]>

While debugging #1721, I found that when thanos receive bails, there is a race in a select statement, where the non-returning branch may be chosen. This branch will deadlock if selected twice because the channel reader has already exited. The way to prevent this is by checking if we need to exit on every loop. Signed-off-by: Lucas Servén Marín <[email protected]>

brancz · 2019-11-07T10:56:59Z

Nicely found!

While debugging thanos-io#1721, I found that when thanos receive bails, there is a race in a select statement, where the non-returning branch may be chosen. This branch will deadlock if selected twice because the channel reader has already exited. The way to prevent this is by checking if we need to exit on every loop. Signed-off-by: Lucas Servén Marín <[email protected]> Signed-off-by: Aleksey Sin <[email protected]>

Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <[email protected]> Signed-off-by: Aleksey Sin <[email protected]>

While debugging thanos-io#1721, I found that when thanos receive bails, there is a race in a select statement, where the non-returning branch may be chosen. This branch will deadlock if selected twice because the channel reader has already exited. The way to prevent this is by checking if we need to exit on every loop. Signed-off-by: Lucas Servén Marín <[email protected]> Signed-off-by: Aleksey Sin <[email protected]>

Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <[email protected]> Signed-off-by: Aleksey Sin <[email protected]>

squat force-pushed the reducewalreplays branch from 46097e9 to 6c82afa Compare November 5, 2019 19:42

squat force-pushed the reducewalreplays branch from 6c82afa to cd30590 Compare November 6, 2019 16:59

squat mentioned this pull request Nov 6, 2019

cmd/thanos/receive: avoid deadlock #1727

Merged

brancz approved these changes Nov 7, 2019

View reviewed changes

brancz merged commit 4b325bd into thanos-io:master Nov 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/thanos/receive: reduce WAL replays at startup #1721

cmd/thanos/receive: reduce WAL replays at startup #1721

squat commented Nov 5, 2019 •

edited

Loading

brancz commented Nov 6, 2019

squat commented Nov 6, 2019

brancz commented Nov 6, 2019

squat commented Nov 6, 2019

brancz commented Nov 7, 2019

cmd/thanos/receive: reduce WAL replays at startup #1721

cmd/thanos/receive: reduce WAL replays at startup #1721

Conversation

squat commented Nov 5, 2019 • edited Loading

brancz commented Nov 6, 2019

squat commented Nov 6, 2019

brancz commented Nov 6, 2019

squat commented Nov 6, 2019

brancz commented Nov 7, 2019

squat commented Nov 5, 2019 •

edited

Loading