-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store-gateway: add timeout to query gate wait #7777
store-gateway: add timeout to query gate wait #7777
Conversation
pkg/storegateway/bucket.go
Outdated
@@ -611,6 +611,11 @@ func (s *BucketStore) Series(req *storepb.SeriesRequest, srv storegatewaypb.Stor | |||
// but sometimes it can take minutes if the block isn't loaded and there is a surge in queries for unloaded blocks. | |||
done, err := s.limitConcurrentQueries(ctx, stats) | |||
if err != nil { | |||
if errors.Is(err, errGateTimeout) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a timeout occurs here, should we add an event to the trace span or log something?
It might also be useful to add a metric for the number of queries that were rejected due to the gate timing out - a high rate of this indicates either a performance problem or suggests scaling out store-gateways would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be useful to add a metric for the number of queries that were rejected due to the gate timing out - a high rate of this indicates either a performance problem or suggests scaling out store-gateways would be useful.
A metric has been added directly in dskit:
grafana/dskit#512
It doesn't tell exactly the reason why the gate wasn't allowed tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a timeout occurs here, should we add an event to the trace span or log something?
good idea. I will add a span log
It doesn't tell exactly the reason why the gate wasn't allowed tho.
with the current setup a query can be "not_permitted" either because it was cancelled (context.Canceled
) or because it timed out (context.DeadlineExceeded
). I can add this distinction in the metrics. There's also the response metrics of the gRPC method calls which would record status_code="Canceled"
or status_code="OK"
, but it might be more clear to have this separation in the dskit metrics
pkg/storegateway/bucket.go
Outdated
@@ -611,6 +611,11 @@ func (s *BucketStore) Series(req *storepb.SeriesRequest, srv storegatewaypb.Stor | |||
// but sometimes it can take minutes if the block isn't loaded and there is a surge in queries for unloaded blocks. | |||
done, err := s.limitConcurrentQueries(ctx, stats) | |||
if err != nil { | |||
if errors.Is(err, errGateTimeout) { | |||
// If the gate timed out, then we behave in the same way as if the blocks aren't discovered yet. | |||
// The querier will try the blocks again on a different store-gateway replica. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this situation logged by queriers? (I'm wondering if we'll get a misleading or confusing message logged by queriers along the lines of "store-gateway hasn't discovered blocks", when the truth is that the store-gateway timed out the request.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a similar concern here. Why not returning the error? The querier will retry the request on other store-gateways in case of error, unless it's detected as a non retriable error (but we make sure the error we return here is detected as retryable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this situation logged by queriers?
this is the SpanLog that the querier logs
mimir/pkg/querier/blocks_store_queryable.go
Line 596 in fce7b38
spanLog.DebugLog("msg", "couldn't query all blocks", "attempt", attempt, "missing blocks", strings.Join(convertULIDsToString(remainingBlocks), " ")) |
If all of the store-gateways cannot query a block, then this is the error that the querier generates:
mimir/pkg/querier/blocks_store_queryable.go
Lines 600 to 601 in fce7b38
err = newStoreConsistencyCheckFailedError(remainingBlocks) | |
level.Warn(util_log.WithContext(ctx, spanLog)).Log("msg", "failed consistency check after all attempts", "err", err) |
mimir/pkg/querier/blocks_store_queryable.go
Line 596 in fce7b38
spanLog.DebugLog("msg", "couldn't query all blocks", "attempt", attempt, "missing blocks", strings.Join(convertULIDsToString(remainingBlocks), " ")) |
Why not returning the error? The querier will retry the request on other store-gateways in case of error, unless it's detected as a non retriable error (but we make sure the error we return here is detected as retryable).
That's not the case unfortunately. If the store-gateway returns an error, then the querier will immediately abort the query:
mimir/pkg/querier/blocks_store_queryable.go
Lines 569 to 574 in fce7b38
// Fetch series from stores. If an error occur we do not retry because retries | |
// are only meant to cover missing blocks. | |
queriedBlocks, err := queryF(clients, minT, maxT) | |
if err != nil { | |
return err | |
} |
I did consider returning a special error, but that seemed more complicated because of response processing in fetchSeriesFromStores
. If you still think it's a better idea to have a special error and handle it properly, then I can do the necessary refactors; perhaps even adopt a similar error-handling pattern to the distributor-ingester ingestion error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the case unfortunately. If the store-gateway returns an error, then the querier will immediately abort the query:
I'm not sure about it. I think the "magic" is done by queryF
. For example, look at fetchSeriesFromStores()
implementation and its call to shouldStopQueryFunc()
. I think the idea is that queryF
only return a fatal error, while swallow errors that shouldn't stop the query execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right there is shouldStopQueryFunc
mimir/pkg/querier/blocks_store_queryable.go
Lines 755 to 760 in fce7b38
if shouldStopQueryFunc(err) { | |
return err | |
} | |
level.Warn(log).Log("msg", "failed to fetch series", "remote", c.RemoteAddress(), "err", err) | |
return nil |
let me give it a try to add some error handling similar to distributor-ingester errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ended up being more complicated than i expected too. I opened a PR to prepare for this change #7790. I could have copied some code too, but wanted to move towards reducing code duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, thanks. I'm not convinced by not returning an error from the store-gateway: to me, it looks more natural returning an error.
@@ -420,6 +421,27 @@ func (u *BucketStores) syncDirForUser(userID string) string { | |||
return filepath.Join(u.cfg.BucketStore.SyncDir, userID) | |||
} | |||
|
|||
type timeoutGate struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Looks something we could have done in dskit. Non blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i also realized this after submitting the PR. But dskit doesn't have context.WithTimeoutCause
and implementing that isn't as trivial as I expected. Let's keep it here as the scope is very small anyways. Someone can move it to dskit when it's updated. I left a comment on timeoutGate 7bf6b24
pkg/storegateway/bucket.go
Outdated
@@ -611,6 +611,11 @@ func (s *BucketStore) Series(req *storepb.SeriesRequest, srv storegatewaypb.Stor | |||
// but sometimes it can take minutes if the block isn't loaded and there is a surge in queries for unloaded blocks. | |||
done, err := s.limitConcurrentQueries(ctx, stats) | |||
if err != nil { | |||
if errors.Is(err, errGateTimeout) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be useful to add a metric for the number of queries that were rejected due to the gate timing out - a high rate of this indicates either a performance problem or suggests scaling out store-gateways would be useful.
A metric has been added directly in dskit:
grafana/dskit#512
It doesn't tell exactly the reason why the gate wasn't allowed tho.
pkg/storegateway/bucket.go
Outdated
@@ -611,6 +611,11 @@ func (s *BucketStore) Series(req *storepb.SeriesRequest, srv storegatewaypb.Stor | |||
// but sometimes it can take minutes if the block isn't loaded and there is a surge in queries for unloaded blocks. | |||
done, err := s.limitConcurrentQueries(ctx, stats) | |||
if err != nil { | |||
if errors.Is(err, errGateTimeout) { | |||
// If the gate timed out, then we behave in the same way as if the blocks aren't discovered yet. | |||
// The querier will try the blocks again on a different store-gateway replica. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a similar concern here. Why not returning the error? The querier will retry the request on other store-gateways in case of error, unless it's detected as a non retriable error (but we make sure the error we return here is detected as retryable).
d57ac54
to
2ae4e2b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configuration documentation lgtm
2ae4e2b
to
13fad9e
Compare
This PR adds a timeout to query gate waiting. The store-gateway returns an empty response. The querier treats the empty response as if the blocks weren't discovered by the store-gateway and tries them again on another store-gateway. Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Dimitar Dimitrov <[email protected]>
Co-authored-by: Charles Korn <[email protected]>
Co-authored-by: Charles Korn <[email protected]>
Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Dimitar Dimitrov <[email protected]>
…cy limit Signed-off-by: Dimitar Dimitrov <[email protected]>
13fad9e
to
ed7012c
Compare
Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Dimitar Dimitrov <[email protected]>
…d_other Signed-off-by: Dimitar Dimitrov <[email protected]>
Signed-off-by: Dimitar Dimitrov <[email protected]>
816718d
to
df43e5f
Compare
"should not stop query on store-gateway instance limit": { | ||
err: globalerror.NewErrorWithGRPCStatus(errors.New("instance limit"), codes.Aborted, &mimirpb.ErrorDetails{Cause: mimirpb.INSTANCE_LIMIT}), | ||
expected: false, | ||
}, | ||
"should not stop query on store-gateway instance limit; shouldn't look at the gRPC code, only Mimir error cause": { | ||
err: globalerror.NewErrorWithGRPCStatus(errors.New("instance limit"), codes.Internal, &mimirpb.ErrorDetails{Cause: mimirpb.INSTANCE_LIMIT}), | ||
expected: false, | ||
}, | ||
"should not stop query on any other mimirpb error": { | ||
err: globalerror.NewErrorWithGRPCStatus(errors.New("instance limit"), codes.Internal, &mimirpb.ErrorDetails{Cause: mimirpb.TOO_BUSY}), | ||
expected: false, | ||
}, | ||
"should not stop query on any unknown error detail": { | ||
err: func() error { | ||
st, createErr := status.New(codes.Internal, "test").WithDetails(&hintspb.Block{Id: "123"}) | ||
require.NoError(t, createErr) | ||
return st.Err() | ||
}(), | ||
expected: false, | ||
}, | ||
"should not stop query on multiple error details": { | ||
err: func() error { | ||
st, createErr := status.New(codes.Internal, "test").WithDetails(&hintspb.Block{Id: "123"}, &mimirpb.ErrorDetails{Cause: mimirpb.INSTANCE_LIMIT}) | ||
require.NoError(t, createErr) | ||
return st.Err() | ||
}(), | ||
expected: false, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change the implementation, because that was the default behaviour anyways. But decided to add tests because the current behaviour is implicit rather than explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work, LGTM!
What this PR does
This PR adds a timeout to query gate waiting. The store-gateway returns an empty response. The querier treats the empty response as if the blocks weren't discovered by the store-gateway and tries them again on another store-gateway.
Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.