Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry GCS Resumable Upload on Error 410 #45963

Merged

Conversation

original-brownbear
Copy link
Member

A resumable upload session can fail on with a 410 error and should
be retried in that case. I added retrying twice using resetting of
the given InputStream as the retry mechanism since the same
approach is used by the AWS S3 SDK already as well and relied upon
by the S3 repository implementation.

Related GCS documentation:
https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone


Note:

The retry count is intentionally not made configurable and set at the small
value of 2. According to our interaction with Google support running into a 410 error as a result of a broken resumable upload session is highly unlikely (also, empirically the chance to run into this in any given resumable upload session was less than 1:100k so far in our testing it seems).
=> Retrying more than twice (you could even argue retrying once is enough obviously) should never be needed.
=> Retrying more often likely would just masks some other issue and waste resources
=> no need to make this configurable since there are no Minio-like drop in replacements for the GCS API at the moment so support according GCs SLAs is all we need.

A resumable upload session can fail on with a 410 error and should
be retried in that case. I added retrying twice using resetting of
the given `InputStream` as the retry mechanism since the same
approach is used by the AWS S3 SDK already as well and relied upon
by the S3 repository implementation.

Related GCS documentation:
https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

}
assert storageException != null;
throw new IOException(storageException);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wrap this here in an IOException whereas we have throw se; in line 275? As far as I can see we rethrow StorageException everywhere here, so I would just do throw storageException for now.

Copy link
Member Author

@original-brownbear original-brownbear Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just liked it better if I see where its actually thrown but here it doesn't really matter so I killed the rethrow.

if (errorCode == HTTP_GONE) {
logger.warn("Retrying broken resumable upload session for blob {}", blobInfo);
storageException = ExceptionsHelper.useOrSuppress(storageException, se);
inputStream.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is inputStream.mark() called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowhere, which was fine for all the streams we seem to be using. Added an explicit call now though :)

@@ -256,6 +266,11 @@ public void setChunkSize(int chunkSize) {

@Override
public int write(ByteBuffer src) throws IOException {
// Only fail a blob once on a 410 error since the error is so unlikely in practice
if (simulated410s.add(blobInfo) && random.nextBoolean()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test this with @tlrx's recent HTTP mock of GCS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the goal, but for now repository integration tests do not support GCS resumable uploads (as we don't generate large enough segment files).

I can add GCS-like retries test at the blob store level like I did for S3 (see S3BlobContainerRetriesTests) and test this retry logic with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left this for @tlrx now if that's ok :) No need for two people to mess with the API mock code concurrently I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took some time but I finally opened #46968, which adds some random 410-Gone errors when testing resume upload (see testWriteLargeBlob test)

@original-brownbear
Copy link
Member Author

Sorry for the delay here @ywelsch , all points addressed now :)

@original-brownbear
Copy link
Member Author

original-brownbear commented Sep 9, 2019

Jenkins run elasticsearch-ci/1

(dep download failed)

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

Thanks @ywelsch !

@original-brownbear original-brownbear merged commit 3f9e86b into elastic:master Sep 17, 2019
@original-brownbear original-brownbear deleted the add-gcs-upload-retry branch September 17, 2019 14:20
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Sep 17, 2019
A resumable upload session can fail on with a 410 error and should
be retried in that case. I added retrying twice using resetting of
the given `InputStream` as the retry mechanism since the same
approach is used by the AWS S3 SDK already as well and relied upon
by the S3 repository implementation.

Related GCS documentation:
https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
original-brownbear added a commit that referenced this pull request Sep 17, 2019
A resumable upload session can fail on with a 410 error and should
be retried in that case. I added retrying twice using resetting of
the given `InputStream` as the retry mechanism since the same
approach is used by the AWS S3 SDK already as well and relied upon
by the S3 repository implementation.

Related GCS documentation:
https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
tlrx added a commit that referenced this pull request Sep 24, 2019
Similarly to what has been done for S3 in #45383, this commit 
adds unit tests that verify the behavior of the SDK client and 
blob container implementation for Google Storage when the 
remote service returns errors.

The main purpose was to add an extra test to the specific retry 
logic for 410-Gone errors added in #45963.

Relates #45963
tlrx added a commit that referenced this pull request Sep 24, 2019
Similarly to what has been done for S3 in #45383, this commit 
adds unit tests that verify the behavior of the SDK client and 
blob container implementation for Google Storage when the 
remote service returns errors.

The main purpose was to add an extra test to the specific retry 
logic for 410-Gone errors added in #45963.

Relates #45963
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 24, 2020
A resumable upload session can fail on with a 410 error and should
be retried in that case. I added retrying twice using resetting of
the given `InputStream` as the retry mechanism since the same
approach is used by the AWS S3 SDK already as well and relied upon
by the S3 repository implementation.

Related GCS documentation:
https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
original-brownbear added a commit that referenced this pull request Jan 24, 2020
A resumable upload session can fail on with a 410 error and should
be retried in that case. I added retrying twice using resetting of
the given `InputStream` as the retry mechanism since the same
approach is used by the AWS S3 SDK already as well and relied upon
by the S3 repository implementation.

Related GCS documentation:
https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants