You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We had an incident which appeared as intermittent network flakiness for 1-2 hours. During the affected time, the compactor's cleanup loop had a number of small object fetch calls to GCS that errored with "context deadline exceeded" after one minute. Example:
ts=2024-04-10T18:33:29.26007212Z caller=blocks_cleaner.go:407 level=warn component=cleaner run_id=1712773933 task=clean_up_users user=redacted msg="failed blocks cleanup and maintenance" err="read bucket index: retry failed with context deadline exceeded; last error: Get \"https://storage.googleapis.com/bucket_redacted/tenant_id%2Fbucket-index.json.gz\": context deadline exceeded" duration=1m0.001110933s
I saw the mention of "retry failed" in the error, but I wondered why a retry would not paper over what looks like a transient network issue. It turns out that no retry actually occurred: The standard GCS client retries certain classes of idempotent errors automatically, but in this case when the first request takes up the entire deadline before failing, no retries will occur. This was also reported in Google's tracker some time ago:
A tenant's cleanup loop will fail if any of its GCS operation times out. In our configuration (cleanup every 15 mins, store gateway only loads bucket indexes <1hr old) a single tenant would have a partial query outage if four of their consecutive cleanup loops experience this kind of unluckiness. Tenants with more blob store operations to perform at cleanup time will be more impacted.
Store-gateway looks to be victim to this same issue. When a store-gateway -> GCS fetch stalls in the same way (say in the Series RPC), store-gateway will wait for the caller's context to expire (by default 2 minutes in querier). What ends up happening is querier's request context is canceled and querier cancels all fan-out requests on other store-gateways and returns an error to query-frontend, where a retry is finally performed. That's a lot of stuff riding on every GCS fetch not timing out.
(I am assuming in this issue that exhausting a 1 minute timeout on a fetch of 6KiB is a case of network instability and that a retry would help. Cuz networks are unreliable.)
My cursory read of Thanos/minio's S3 interactions is that it has the same behavior: Certain HTTP errors are retried, but if the failure is the caller's context being canceled of course no retry will occur.
Expected behavior
In the face of transient network blips, I expect individual blob storage operations to be retried.
Environment
Infrastructure: kubernetes
The content you are editing has changed. Please copy your edits and refresh the page.
Describe the bug
We had an incident which appeared as intermittent network flakiness for 1-2 hours. During the affected time, the compactor's cleanup loop had a number of small object fetch calls to GCS that errored with "context deadline exceeded" after one minute. Example:
I saw the mention of "retry failed" in the error, but I wondered why a retry would not paper over what looks like a transient network issue. It turns out that no retry actually occurred: The standard GCS client retries certain classes of idempotent errors automatically, but in this case when the first request takes up the entire deadline before failing, no retries will occur. This was also reported in Google's tracker some time ago:
A tenant's cleanup loop will fail if any of its GCS operation times out. In our configuration (cleanup every 15 mins, store gateway only loads bucket indexes <1hr old) a single tenant would have a partial query outage if four of their consecutive cleanup loops experience this kind of unluckiness. Tenants with more blob store operations to perform at cleanup time will be more impacted.
Store-gateway looks to be victim to this same issue. When a store-gateway -> GCS fetch stalls in the same way (say in the
Series
RPC), store-gateway will wait for the caller's context to expire (by default 2 minutes in querier). What ends up happening is querier's request context is canceled and querier cancels all fan-out requests on other store-gateways and returns an error to query-frontend, where a retry is finally performed. That's a lot of stuff riding on every GCS fetch not timing out.(I am assuming in this issue that exhausting a 1 minute timeout on a fetch of 6KiB is a case of network instability and that a retry would help. Cuz networks are unreliable.)
My cursory read of Thanos/minio's S3 interactions is that it has the same behavior: Certain HTTP errors are retried, but if the failure is the caller's context being canceled of course no retry will occur.
Expected behavior
In the face of transient network blips, I expect individual blob storage operations to be retried.
Environment
Tasks
The text was updated successfully, but these errors were encountered: