-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry GCS Resumable Upload on Error 410 #45963
Retry GCS Resumable Upload on Error 410 #45963
Conversation
A resumable upload session can fail on with a 410 error and should be retried in that case. I added retrying twice using resetting of the given `InputStream` as the retry mechanism since the same approach is used by the AWS S3 SDK already as well and relied upon by the S3 repository implementation. Related GCS documentation: https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
Pinging @elastic/es-distributed |
...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStore.java
Outdated
Show resolved
Hide resolved
...sitory-gcs/src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageBlobStore.java
Outdated
Show resolved
Hide resolved
} | ||
assert storageException != null; | ||
throw new IOException(storageException); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why wrap this here in an IOException whereas we have throw se;
in line 275? As far as I can see we rethrow StorageException everywhere here, so I would just do throw storageException
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I just liked it better if I see where its actually thrown but here it doesn't really matter so I killed the rethrow.
plugins/repository-gcs/src/test/java/org/elasticsearch/repositories/gcs/MockStorage.java
Outdated
Show resolved
Hide resolved
if (errorCode == HTTP_GONE) { | ||
logger.warn("Retrying broken resumable upload session for blob {}", blobInfo); | ||
storageException = ExceptionsHelper.useOrSuppress(storageException, se); | ||
inputStream.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is inputStream.mark()
called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nowhere, which was fine for all the streams we seem to be using. Added an explicit call now though :)
@@ -256,6 +266,11 @@ public void setChunkSize(int chunkSize) { | |||
|
|||
@Override | |||
public int write(ByteBuffer src) throws IOException { | |||
// Only fail a blob once on a 410 error since the error is so unlikely in practice | |||
if (simulated410s.add(blobInfo) && random.nextBoolean()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test this with @tlrx's recent HTTP mock of GCS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the goal, but for now repository integration tests do not support GCS resumable uploads (as we don't generate large enough segment files).
I can add GCS-like retries test at the blob store level like I did for S3 (see S3BlobContainerRetriesTests) and test this retry logic with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left this for @tlrx now if that's ok :) No need for two people to mess with the API mock code concurrently I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took some time but I finally opened #46968, which adds some random 410-Gone
errors when testing resume upload (see testWriteLargeBlob
test)
Sorry for the delay here @ywelsch , all points addressed now :) |
Jenkins run elasticsearch-ci/1 (dep download failed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @ywelsch ! |
A resumable upload session can fail on with a 410 error and should be retried in that case. I added retrying twice using resetting of the given `InputStream` as the retry mechanism since the same approach is used by the AWS S3 SDK already as well and relied upon by the S3 repository implementation. Related GCS documentation: https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
A resumable upload session can fail on with a 410 error and should be retried in that case. I added retrying twice using resetting of the given `InputStream` as the retry mechanism since the same approach is used by the AWS S3 SDK already as well and relied upon by the S3 repository implementation. Related GCS documentation: https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
Similarly to what has been done for S3 in #45383, this commit adds unit tests that verify the behavior of the SDK client and blob container implementation for Google Storage when the remote service returns errors. The main purpose was to add an extra test to the specific retry logic for 410-Gone errors added in #45963. Relates #45963
Similarly to what has been done for S3 in #45383, this commit adds unit tests that verify the behavior of the SDK client and blob container implementation for Google Storage when the remote service returns errors. The main purpose was to add an extra test to the specific retry logic for 410-Gone errors added in #45963. Relates #45963
A resumable upload session can fail on with a 410 error and should be retried in that case. I added retrying twice using resetting of the given `InputStream` as the retry mechanism since the same approach is used by the AWS S3 SDK already as well and relied upon by the S3 repository implementation. Related GCS documentation: https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
A resumable upload session can fail on with a 410 error and should be retried in that case. I added retrying twice using resetting of the given `InputStream` as the retry mechanism since the same approach is used by the AWS S3 SDK already as well and relied upon by the S3 repository implementation. Related GCS documentation: https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
A resumable upload session can fail on with a 410 error and should
be retried in that case. I added retrying twice using resetting of
the given
InputStream
as the retry mechanism since the sameapproach is used by the AWS S3 SDK already as well and relied upon
by the S3 repository implementation.
Related GCS documentation:
https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
Note:
The retry count is intentionally not made configurable and set at the small
value of
2
. According to our interaction with Google support running into a410
error as a result of a broken resumable upload session is highly unlikely (also, empirically the chance to run into this in any given resumable upload session was less than 1:100k so far in our testing it seems).=> Retrying more than twice (you could even argue retrying once is enough obviously) should never be needed.
=> Retrying more often likely would just masks some other issue and waste resources
=> no need to make this configurable since there are no Minio-like drop in replacements for the GCS API at the moment so support according GCs SLAs is all we need.