-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RECEIVE] Make out of bounds a non-recoverable error #5131
[RECEIVE] Make out of bounds a non-recoverable error #5131
Conversation
Signed-off-by: Wiard van Rij <[email protected]>
Signed-off-by: Wiard van Rij <[email protected]>
Signed-off-by: Wiard van Rij <[email protected]>
I appreciate the time you took to find this case! |
case errOutOfBounds: | ||
http.Error(w, err.Error(), http.StatusBadRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something here, but my understanding was that Prometheus treats as recoverable only network errors, 5xx
responses and, if configured to, 429
responses (see this line and below).
It therefore should not make difference if we're responding with bad request (400
) or conflict (409
), since conflict is already unrecoverable, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Prometheus, 400 = unrecoverable. If that's your question? Also, if that's untrue, than I'm happy to get pointed to some resource. However, I've tested this and the response now becomes unrecoverable for Prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understood, only 5xx
codes are recoverable for Prometheus (see the link in my comment). Meaning if until now we were returning 409
for out of bounds error, it should have already been working as expected, i.e. there should not be a real difference between the codes. As long as we return any 4xx
code, it should be treated as unrecoverable (with the possible exception of 429
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I understand you and the code (missed the link) now. Funny thing is though, that I did not experience this behaviour. Only after my change I actually see the non-recoverable errors in the Prometheus logs. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More on this: prometheus/prometheus#5649 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking forward to seeing this fixed. it's causing us pain in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wiardvanrij I did a small test today: I spun up a VM where I skewed the clock on purpose (made it run in the 'past' compared to my host machine), I brought up an instance of Prometheus, configured to remote write to Thanos instance running on my host.
I first run it against Thanos build from latest main
. The result - Prometheus informs me this is a non-recoverable error:
ts=2022-02-12T12:26:39.089Z caller=dedupe.go:112 component=remote level=error remote_name=6c3063 url=http://192.168.56.1:19291/api/v1/receive msg="non-recoverable error" count=9 exemplarCount=0 err="server returned HTTP status 409 Conflict: store locally for endpoint 127.0.0.1:10901: conflict"
Built from your branch:
ts=2022-02-12T12:34:32.266Z caller=dedupe.go:112 component=remote level=error remote_name=6c3063 url=http://192.168.56.1:19291/api/v1/receive msg="non-recoverable error" count=63 exemplarCount=0 err="server returned HTTP status 400 Bad Request: store locally for endpoint 127.0.0.1:10901: out of bounds"
I haven't noticed any unexpected behavior on either side (Thanos or Prometheus). Prometheus treated it (in both cases) simply as unrecoverable error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @matej-g really appreciate the testing - Then I need more user input. @Kampe did you have any chance to test it or could you provide more info? Because if we now look at the code, it already should be correct. I would agree that we sometimes see odd behaviour but then we really need to know why that is happening.
If I have some more time, i'll try to figure things out as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I responded in the issue ticket #4831, but there definitely seems to be two issues happening with 500 level errors and 409's. Unsure if they are related, but it seems we can on atleast a monthly cadence (sometimes weekly) can produce them.
We have not had the chance to roll the changes to any of our environments, we don't have much in the way of tooling setup to be able to roll non registry hosted third party tools to even developer environments so testing without a prebuilt container for us is pretty intensive, however when we have a container we can reference I can get that rolled in a matter of minutes.
@wiardvanrij do you still need some testing for this feature? I think I experience this issue with a specific tenant which causes the receiver to consume a lot more memory than usual during certain scenarios, to the point where it will eventually OOM kill. Looking at the logs it seems like the issue is consistent with the description as the Prometheus agents will continue to retry the same failed out of order requests over and over again. |
Just wanted to add that I migrated to the new router->ingestor model and it solved most of my issues. I am still seeing a few out of bounds errors but they are no where near where they were before. |
sorry for the lag. |
Can you merge this please ? The message "unsupported value type" is very confusing and we take a lot of time to figure out what's going on ... (And we still have a lot of 409). Our solution : send less metrics... |
No, this isn't the solution. Please follow up on the issue with input. |
Signed-off-by: Wiard van Rij [email protected]
Relates to: #4831
Prometheus' remote write receiver treats
ErrOutOfOrder
as non-recoverable.prometheus/prometheus#5649 (comment)_
Changes
Out of bounds would result in a http 409, which causes Prometheus to retry, eventually causing an infinite loop which can only be solved by flushing the entire Prometheus data dir.
This change removes OutOfBounds err from ConflictErr to it's own entity and it will allow Prometheus to deal with this as non-recoverable (i.e. dropping it, instead of retry).
Also made a small adjustment in the Debug.Log, as it could not print the sample, which causes a confusing debug message like:
sample="unsupported value type"
- Now it will print the sample value and timestamp.Verification
I mean, i'll be honest that it was a pain in the a. to test and debug this. What I have checked is that Prometheus now handles this as non-recoverable and nothing will build up anymore as pending.
I have adjusted the test cases and added the OutOfBounds for it.