-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/thanos/receive: reduce WAL replays at startup #1721
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
squat
force-pushed
the
reducewalreplays
branch
from
November 5, 2019 19:42
46097e9
to
6c82afa
Compare
Not sure if legitimately or not, but CI is killing the job because nothing new in logs for 10mins. |
Saw that. I am running locally and it’s happy. Can we re-kick CI? |
Re-ran a few times now, it seems we just have to extend the timeout if that's possible on circleci. |
Ill do some digging |
Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <[email protected]>
squat
force-pushed
the
reducewalreplays
branch
from
November 6, 2019 16:59
6c82afa
to
cd30590
Compare
squat
added a commit
to squat/thanos
that referenced
this pull request
Nov 6, 2019
While debugging thanos-io#1721, I found that when thanos receive bails, there is a race in a select statement, where the non-returning branch may be chosen. This branch will deadlock if selected twice because the channel reader has already exited. The way to prevent this is by checking if we need to exit on every loop. Signed-off-by: Lucas Servén Marín <[email protected]>
brancz
pushed a commit
that referenced
this pull request
Nov 7, 2019
While debugging #1721, I found that when thanos receive bails, there is a race in a select statement, where the non-returning branch may be chosen. This branch will deadlock if selected twice because the channel reader has already exited. The way to prevent this is by checking if we need to exit on every loop. Signed-off-by: Lucas Servén Marín <[email protected]>
Nicely found! |
brancz
approved these changes
Nov 7, 2019
IKSIN
pushed a commit
to monitoring-tools/thanos
that referenced
this pull request
Nov 26, 2019
While debugging thanos-io#1721, I found that when thanos receive bails, there is a race in a select statement, where the non-returning branch may be chosen. This branch will deadlock if selected twice because the channel reader has already exited. The way to prevent this is by checking if we need to exit on every loop. Signed-off-by: Lucas Servén Marín <[email protected]> Signed-off-by: Aleksey Sin <[email protected]>
IKSIN
pushed a commit
to monitoring-tools/thanos
that referenced
this pull request
Nov 26, 2019
Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <[email protected]> Signed-off-by: Aleksey Sin <[email protected]>
IKSIN
pushed a commit
to monitoring-tools/thanos
that referenced
this pull request
Nov 27, 2019
While debugging thanos-io#1721, I found that when thanos receive bails, there is a race in a select statement, where the non-returning branch may be chosen. This branch will deadlock if selected twice because the channel reader has already exited. The way to prevent this is by checking if we need to exit on every loop. Signed-off-by: Lucas Servén Marín <[email protected]> Signed-off-by: Aleksey Sin <[email protected]>
IKSIN
pushed a commit
to monitoring-tools/thanos
that referenced
this pull request
Nov 27, 2019
Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <[email protected]> Signed-off-by: Aleksey Sin <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Every time thanos receive is started, it has to replay the WAL three
times, namely:
These WAL replays can take a very long time if the WAL has lots of data.
With the fix from #1654, the third time will be instantaneous because
the WAL will be empty. That still leaves two potentially long WAL
replays. We can cut this down to just one long replay if we do the
following operations instead:
Now, the second step will be a fast replay because the WAL is empty,
leaving just one potentially expensive WAL replay.
This commit eliminates explicit opening of the writable TSDB during
startup, and instead opens it after flushing the read-only TSDB.
Signed-off-by: Lucas Servén Marín [email protected]
cc @metalmatze @bwplotka @brancz