Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RECEIVE] Make out of bounds a non-recoverable error #5131

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions pkg/receive/handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ const (

var (
// errConflict is returned whenever an operation fails due to any conflict-type error.
errConflict = errors.New("conflict")

errConflict = errors.New("conflict")
errOutOfBounds = storage.ErrOutOfBounds
errBadReplica = errors.New("request replica exceeds receiver replication factor")
errNotReady = errors.New("target not ready")
errUnavailable = errors.New("target not available")
Expand Down Expand Up @@ -360,6 +360,8 @@ func (h *Handler) receiveHTTP(w http.ResponseWriter, r *http.Request) {
http.Error(w, err.Error(), http.StatusServiceUnavailable)
case errConflict:
http.Error(w, err.Error(), http.StatusConflict)
case errOutOfBounds:
http.Error(w, err.Error(), http.StatusBadRequest)
Comment on lines +363 to +364
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something here, but my understanding was that Prometheus treats as recoverable only network errors, 5xx responses and, if configured to, 429 responses (see this line and below).

It therefore should not make difference if we're responding with bad request (400) or conflict (409), since conflict is already unrecoverable, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Prometheus, 400 = unrecoverable. If that's your question? Also, if that's untrue, than I'm happy to get pointed to some resource. However, I've tested this and the response now becomes unrecoverable for Prometheus.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understood, only 5xx codes are recoverable for Prometheus (see the link in my comment). Meaning if until now we were returning 409 for out of bounds error, it should have already been working as expected, i.e. there should not be a real difference between the codes. As long as we return any 4xx code, it should be treated as unrecoverable (with the possible exception of 429).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I understand you and the code (missed the link) now. Funny thing is though, that I did not experience this behaviour. Only after my change I actually see the non-recoverable errors in the Prometheus logs. 🤔

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking forward to seeing this fixed. it's causing us pain in production.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wiardvanrij I did a small test today: I spun up a VM where I skewed the clock on purpose (made it run in the 'past' compared to my host machine), I brought up an instance of Prometheus, configured to remote write to Thanos instance running on my host.

I first run it against Thanos build from latest main. The result - Prometheus informs me this is a non-recoverable error:

ts=2022-02-12T12:26:39.089Z caller=dedupe.go:112 component=remote level=error remote_name=6c3063 url=http://192.168.56.1:19291/api/v1/receive msg="non-recoverable error" count=9 exemplarCount=0 err="server returned HTTP status 409 Conflict: store locally for endpoint 127.0.0.1:10901: conflict"

Built from your branch:

ts=2022-02-12T12:34:32.266Z caller=dedupe.go:112 component=remote level=error remote_name=6c3063 url=http://192.168.56.1:19291/api/v1/receive msg="non-recoverable error" count=63 exemplarCount=0 err="server returned HTTP status 400 Bad Request: store locally for endpoint 127.0.0.1:10901: out of bounds"

I haven't noticed any unexpected behavior on either side (Thanos or Prometheus). Prometheus treated it (in both cases) simply as unrecoverable error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @matej-g really appreciate the testing - Then I need more user input. @Kampe did you have any chance to test it or could you provide more info? Because if we now look at the code, it already should be correct. I would agree that we sometimes see odd behaviour but then we really need to know why that is happening.

If I have some more time, i'll try to figure things out as well.

Copy link

@Kampe Kampe Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I responded in the issue ticket #4831, but there definitely seems to be two issues happening with 500 level errors and 409's. Unsure if they are related, but it seems we can on atleast a monthly cadence (sometimes weekly) can produce them.

We have not had the chance to roll the changes to any of our environments, we don't have much in the way of tooling setup to be able to roll non registry hosted third party tools to even developer environments so testing without a prebuilt container for us is pretty intensive, however when we have a container we can reference I can get that rolled in a matter of minutes.

case errBadReplica:
http.Error(w, err.Error(), http.StatusBadRequest)
default:
Expand Down Expand Up @@ -679,7 +681,6 @@ func isConflict(err error) bool {
return err == errConflict ||
err == storage.ErrDuplicateSampleForTimestamp ||
err == storage.ErrOutOfOrderSample ||
err == storage.ErrOutOfBounds ||
status.Code(err) == codes.AlreadyExists
}

Expand All @@ -696,6 +697,11 @@ func isUnavailable(err error) bool {
status.Code(err) == codes.Unavailable
}

func isOutOfBounds(err error) bool {
return err == storage.ErrOutOfBounds ||
status.Code(err) == codes.OutOfRange
}

// retryState encapsulates the number of request attempt made against a peer and,
// next allowed time for the next attempt.
type retryState struct {
Expand Down Expand Up @@ -738,6 +744,7 @@ func determineWriteErrorCause(err error, threshold int) error {
{err: errConflict, cause: isConflict},
{err: errNotReady, cause: isNotReady},
{err: errUnavailable, cause: isUnavailable},
{err: errOutOfBounds, cause: isOutOfBounds},
}
for _, exp := range expErrs {
exp.count = 0
Expand Down
28 changes: 26 additions & 2 deletions pkg/receive/handler_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -331,7 +331,8 @@ func newTestHandlerHashring(appendables []*fakeAppendable, replicationFactor uin

func TestReceiveQuorum(t *testing.T) {
appenderErrFn := func() error { return errors.New("failed to get appender") }
conflictErrFn := func() error { return storage.ErrOutOfBounds }
conflictErrFn := func() error { return storage.ErrOutOfOrderSample }
outOfBoundsErrFn := func() error { return storage.ErrOutOfBounds }
commitErrFn := func() error { return errors.New("failed to commit") }
wreq1 := &prompb.WriteRequest{
Timeseries: []prompb.TimeSeries{
Expand Down Expand Up @@ -399,6 +400,17 @@ func TestReceiveQuorum(t *testing.T) {
},
},
},
{
name: "size 1 outofbound",
status: http.StatusBadRequest,
replicationFactor: 1,
wreq: wreq1,
appendables: []*fakeAppendable{
{
appender: newFakeAppender(outOfBoundsErrFn, nil, nil),
},
},
},
{
name: "size 2 success",
status: http.StatusOK,
Expand Down Expand Up @@ -667,7 +679,8 @@ func TestReceiveQuorum(t *testing.T) {

func TestReceiveWithConsistencyDelay(t *testing.T) {
appenderErrFn := func() error { return errors.New("failed to get appender") }
conflictErrFn := func() error { return storage.ErrOutOfBounds }
conflictErrFn := func() error { return storage.ErrOutOfOrderSample }
outOfBoundsErrFn := func() error { return storage.ErrOutOfBounds }
commitErrFn := func() error { return errors.New("failed to commit") }
wreq1 := &prompb.WriteRequest{
Timeseries: []prompb.TimeSeries{
Expand Down Expand Up @@ -735,6 +748,17 @@ func TestReceiveWithConsistencyDelay(t *testing.T) {
},
},
},
{
name: "size 1 outofbound",
status: http.StatusBadRequest,
replicationFactor: 1,
wreq: wreq1,
appendables: []*fakeAppendable{
{
appender: newFakeAppender(outOfBoundsErrFn, nil, nil),
},
},
},
{
name: "size 2 success",
status: http.StatusOK,
Expand Down
6 changes: 3 additions & 3 deletions pkg/receive/writer.go
Original file line number Diff line number Diff line change
Expand Up @@ -85,13 +85,13 @@ func (r *Writer) Write(ctx context.Context, tenantID string, wreq *prompb.WriteR
switch err {
case storage.ErrOutOfOrderSample:
numOutOfOrder++
level.Debug(r.logger).Log("msg", "Out of order sample", "lset", lset, "sample", s)
level.Debug(r.logger).Log("msg", "Out of order sample", "lset", lset, "sample value", s.Value, "sample timestamp", s.Timestamp)
case storage.ErrDuplicateSampleForTimestamp:
numDuplicates++
level.Debug(r.logger).Log("msg", "Duplicate sample for timestamp", "lset", lset, "sample", s)
level.Debug(r.logger).Log("msg", "Duplicate sample for timestamp", "lset", lset, "sample value", s.Value, "sample timestamp", s.Timestamp)
case storage.ErrOutOfBounds:
numOutOfBounds++
level.Debug(r.logger).Log("msg", "Out of bounds metric", "lset", lset, "sample", s)
level.Debug(r.logger).Log("msg", "Out of bounds metric", "lset", lset, "sample value", s.Value, "sample timestamp", s.Timestamp)
}
}

Expand Down