Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add blob container retries tests for Google Cloud Storage #46968

Merged
merged 2 commits into from
Sep 24, 2019

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Sep 23, 2019

Similarly to what has been done for S3 in #45383, this commit adds unit tests that verify the behavior of the SDK client and blob container implementation for Google Storage when the remote service returns errors.

The main purpose was to add an extra test to the specific retry logic for 410-Gone errors added in #45963 but since I was there I also added tests for other read/write methods.

Relates #45963

@tlrx tlrx added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.5.0 labels Sep 23, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tlrx :) Just a few random NITs and the timeout issue I commented on (I think we should do something here ... or maybe just use longer timeouts since it's a rarely() thing). Let me know what you think there :)


public void testWriteLargeBlob() throws IOException {
final boolean useTimeout = rarely();
final TimeValue readTimeout = useTimeout ? TimeValue.timeValueMillis(randomIntBetween(100, 500)) : null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has me a little worried stability wise. It seems all it takes for this test to fail is some GC pause with unlucky timing?
Can we harden the test against this scenaro somehow like we did in the S3 tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment. I've think a bit to what you say and suggest and I agree this test can fail in case of GC pauses at the wrong time. It can also be quite hard to investigate because with low read timeout values a request timeout could be either caused by a GC pause or by the test itself.

To mitigate this, I've change the test to use a higher value for the read timeout client settings (I picked up 3s) and then only simulates read timeouts for the resumable upload session init and for the first chunk upload. This way we still test that read timeout work for the 2 types of resumable requests requests but we only fail 1 time for each.

This allows to use a higher read timeout value and keep the test under the 10-15 seconds execution time.

Please let me know what you think!

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

Let's try with the longer timeout. We both know what's going on here and can jump in if it turns out to be unstable. The chance of that might be very low in fact since the GC pause has to hit right on the physical read call (sort of since its async IO) so I'm optimistic :)

@tlrx tlrx merged commit 6061912 into elastic:master Sep 24, 2019
@tlrx tlrx deleted the add-retries-tests-for-gcs branch September 24, 2019 06:57
@tlrx
Copy link
Member Author

tlrx commented Sep 24, 2019

Thanks @original-brownbear !

tlrx added a commit that referenced this pull request Sep 24, 2019
Similarly to what has been done for S3 in #45383, this commit 
adds unit tests that verify the behavior of the SDK client and 
blob container implementation for Google Storage when the 
remote service returns errors.

The main purpose was to add an extra test to the specific retry 
logic for 410-Gone errors added in #45963.

Relates #45963
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test Issues or PRs that are addressing/adding tests v7.5.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants