Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/thanos/receive: reduce WAL replays at startup #1721

Merged
merged 1 commit into from
Nov 7, 2019

Conversation

squat
Copy link
Member

@squat squat commented Nov 5, 2019

Every time thanos receive is started, it has to replay the WAL three
times, namely:

  1. open the TSDB;
  2. close the TSDB; open the ReadOnly TSDB and Flush; and
  3. open the TSDB

These WAL replays can take a very long time if the WAL has lots of data.
With the fix from #1654, the third time will be instantaneous because
the WAL will be empty. That still leaves two potentially long WAL
replays. We can cut this down to just one long replay if we do the
following operations instead:

  1. with a closed TSDB, open the ReadOnly TSDB and Flush; and
  2. open the TSDB

Now, the second step will be a fast replay because the WAL is empty,
leaving just one potentially expensive WAL replay.

This commit eliminates explicit opening of the writable TSDB during
startup, and instead opens it after flushing the read-only TSDB.

Signed-off-by: Lucas Servén Marín [email protected]

cc @metalmatze @bwplotka @brancz

@brancz
Copy link
Member

brancz commented Nov 6, 2019

Not sure if legitimately or not, but CI is killing the job because nothing new in logs for 10mins.

@squat
Copy link
Member Author

squat commented Nov 6, 2019

Saw that. I am running locally and it’s happy. Can we re-kick CI?

@brancz
Copy link
Member

brancz commented Nov 6, 2019

Re-ran a few times now, it seems we just have to extend the timeout if that's possible on circleci.

@squat
Copy link
Member Author

squat commented Nov 6, 2019

Ill do some digging

Every time thanos receive is started, it has to replay the WAL three
times, namely:
1. open the TSDB;
2. close the TSDB; open the ReadOnly TSDB and Flush; and
3. open the TSDB

These WAL replays can take a very long time if the WAL has lots of data.
With the fix from thanos-io#1654, the third time will be instantaneous because
the WAL will be empty. That still leaves two potentially long WAL
replays. We can cut this down to just one long replay if we do the
following operations instead:
1. with a closed TSDB, open the ReadOnly TSDB and Flush; and
2. open the TSDB

Now, the second step will be a fast replay because the WAL is empty,
leaving just one potentially expensive WAL replay.

This commit eliminates explicit opening of the writable TSDB during
startup, and instead opens it after flushing the read-only TSDB.

Signed-off-by: Lucas Servén Marín <[email protected]>
squat added a commit to squat/thanos that referenced this pull request Nov 6, 2019
While debugging thanos-io#1721, I found that when thanos receive bails, there is
a race in a select statement, where the non-returning branch may be
chosen. This branch will deadlock if selected twice because the channel
reader has already exited. The way to prevent this is by checking if
we need to exit on every loop.

Signed-off-by: Lucas Servén Marín <[email protected]>
brancz pushed a commit that referenced this pull request Nov 7, 2019
While debugging #1721, I found that when thanos receive bails, there is
a race in a select statement, where the non-returning branch may be
chosen. This branch will deadlock if selected twice because the channel
reader has already exited. The way to prevent this is by checking if
we need to exit on every loop.

Signed-off-by: Lucas Servén Marín <[email protected]>
@brancz
Copy link
Member

brancz commented Nov 7, 2019

Nicely found!

@brancz brancz merged commit 4b325bd into thanos-io:master Nov 7, 2019
IKSIN pushed a commit to monitoring-tools/thanos that referenced this pull request Nov 26, 2019
While debugging thanos-io#1721, I found that when thanos receive bails, there is
a race in a select statement, where the non-returning branch may be
chosen. This branch will deadlock if selected twice because the channel
reader has already exited. The way to prevent this is by checking if
we need to exit on every loop.

Signed-off-by: Lucas Servén Marín <[email protected]>
Signed-off-by: Aleksey Sin <[email protected]>
IKSIN pushed a commit to monitoring-tools/thanos that referenced this pull request Nov 26, 2019
Every time thanos receive is started, it has to replay the WAL three
times, namely:
1. open the TSDB;
2. close the TSDB; open the ReadOnly TSDB and Flush; and
3. open the TSDB

These WAL replays can take a very long time if the WAL has lots of data.
With the fix from thanos-io#1654, the third time will be instantaneous because
the WAL will be empty. That still leaves two potentially long WAL
replays. We can cut this down to just one long replay if we do the
following operations instead:
1. with a closed TSDB, open the ReadOnly TSDB and Flush; and
2. open the TSDB

Now, the second step will be a fast replay because the WAL is empty,
leaving just one potentially expensive WAL replay.

This commit eliminates explicit opening of the writable TSDB during
startup, and instead opens it after flushing the read-only TSDB.

Signed-off-by: Lucas Servén Marín <[email protected]>
Signed-off-by: Aleksey Sin <[email protected]>
IKSIN pushed a commit to monitoring-tools/thanos that referenced this pull request Nov 27, 2019
While debugging thanos-io#1721, I found that when thanos receive bails, there is
a race in a select statement, where the non-returning branch may be
chosen. This branch will deadlock if selected twice because the channel
reader has already exited. The way to prevent this is by checking if
we need to exit on every loop.

Signed-off-by: Lucas Servén Marín <[email protected]>
Signed-off-by: Aleksey Sin <[email protected]>
IKSIN pushed a commit to monitoring-tools/thanos that referenced this pull request Nov 27, 2019
Every time thanos receive is started, it has to replay the WAL three
times, namely:
1. open the TSDB;
2. close the TSDB; open the ReadOnly TSDB and Flush; and
3. open the TSDB

These WAL replays can take a very long time if the WAL has lots of data.
With the fix from thanos-io#1654, the third time will be instantaneous because
the WAL will be empty. That still leaves two potentially long WAL
replays. We can cut this down to just one long replay if we do the
following operations instead:
1. with a closed TSDB, open the ReadOnly TSDB and Flush; and
2. open the TSDB

Now, the second step will be a fast replay because the WAL is empty,
leaving just one potentially expensive WAL replay.

This commit eliminates explicit opening of the writable TSDB during
startup, and instead opens it after flushing the read-only TSDB.

Signed-off-by: Lucas Servén Marín <[email protected]>
Signed-off-by: Aleksey Sin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants