-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RECEIVE] Make out of bounds a non-recoverable error #5131
Closed
wiardvanrij
wants to merge
3
commits into
thanos-io:main
from
wiardvanrij:feature/4831-out-of-bounds
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something here, but my understanding was that Prometheus treats as recoverable only network errors,
5xx
responses and, if configured to,429
responses (see this line and below).It therefore should not make difference if we're responding with bad request (
400
) or conflict (409
), since conflict is already unrecoverable, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Prometheus, 400 = unrecoverable. If that's your question? Also, if that's untrue, than I'm happy to get pointed to some resource. However, I've tested this and the response now becomes unrecoverable for Prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understood, only
5xx
codes are recoverable for Prometheus (see the link in my comment). Meaning if until now we were returning409
for out of bounds error, it should have already been working as expected, i.e. there should not be a real difference between the codes. As long as we return any4xx
code, it should be treated as unrecoverable (with the possible exception of429
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I understand you and the code (missed the link) now. Funny thing is though, that I did not experience this behaviour. Only after my change I actually see the non-recoverable errors in the Prometheus logs. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More on this: prometheus/prometheus#5649 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking forward to seeing this fixed. it's causing us pain in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wiardvanrij I did a small test today: I spun up a VM where I skewed the clock on purpose (made it run in the 'past' compared to my host machine), I brought up an instance of Prometheus, configured to remote write to Thanos instance running on my host.
I first run it against Thanos build from latest
main
. The result - Prometheus informs me this is a non-recoverable error:Built from your branch:
I haven't noticed any unexpected behavior on either side (Thanos or Prometheus). Prometheus treated it (in both cases) simply as unrecoverable error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @matej-g really appreciate the testing - Then I need more user input. @Kampe did you have any chance to test it or could you provide more info? Because if we now look at the code, it already should be correct. I would agree that we sometimes see odd behaviour but then we really need to know why that is happening.
If I have some more time, i'll try to figure things out as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I responded in the issue ticket #4831, but there definitely seems to be two issues happening with 500 level errors and 409's. Unsure if they are related, but it seems we can on atleast a monthly cadence (sometimes weekly) can produce them.
We have not had the chance to roll the changes to any of our environments, we don't have much in the way of tooling setup to be able to roll non registry hosted third party tools to even developer environments so testing without a prebuilt container for us is pretty intensive, however when we have a container we can reference I can get that rolled in a matter of minutes.