Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrap s3 calls with retry to avoid race in s3 bucket creation calls #891

Merged
merged 5 commits into from
Aug 31, 2017
Merged

wrap s3 calls with retry to avoid race in s3 bucket creation calls #891

merged 5 commits into from
Aug 31, 2017

Conversation

xchapter7x
Copy link
Contributor

issue: [https://github.com//issues/877]

@radeksimko radeksimko added the bug Addresses a defect in current functionality. label Jun 19, 2017
@xchapter7x
Copy link
Contributor Author

tested locally with:

TF_ACC=1 go test ./aws -run TestAccAWSS3MultiBucket_withTags -v

@xchapter7x xchapter7x changed the title [WIP] add waituntilbucketexists to avoid race in s3 bucket creation calls add waituntilbucketexists to avoid race in s3 bucket creation calls Jun 19, 2017
@abbyachau
Copy link

hi @radeksimko, could you please give us an indication of when this pr will be reviewed? thanks!

@radeksimko
Copy link
Member

@xchapter7x thank you for raising the PR & issue.

While I believe there are some eventual consistency issues that need addressing in S3 I'm in doubts whether this is the right place as I'm not sure the error in #877 is coming from that specific place.

I did notice we get a bunch of test failures once in a while in our nightly acceptance tests, but the error usually came from a followup API call, e.g. setting policy, CORS, website, versioning, ...

Would you mind providing debug logs we can go through to understand the nature of the problem better (i.e. find out which API request is actually returning the error)?

TL;DR I believe that this is a real problem which needs fixing, but I'm not convinced this PR is actually fixing it. Maybe it is, but we need more data before moving forward.

Thanks.

@radeksimko radeksimko added the waiting-response Maintainers are waiting on response from community or contributor. label Jun 25, 2017
@xchapter7x
Copy link
Contributor Author

@radeksimko
You are correct about the errors coming from those followup API calls. We are seeing the same behavior (intermittently). This was the reasoning behind putting a wait before that follow on call, so it will block until the bucket resource is actually available for the following calls, avoiding any race.

Let me spin up an environment to replicate the issue and get you the requested logs.
Stay tuned.

@radeksimko
Copy link
Member

This was the reasoning behind putting a wait before that follow on call, so it will block until the bucket resource is actually available for the following calls, avoiding any race.

I'm afraid we'll have to add retries to each followup call as S3 is eventually consistent and the API doesn't do client pinning, AFAIK - which means that 1st success (200 OK) does not mean the change (new bucket) was propagated fully. I think we might still get NoSuchBucket from any API call which follows one that succeeded. At least this is a behaviour I did observe number of times in IAM and KMS.

In fact KMS was so badly affected by this that I had to implement "guessing" into the retry logic:
hashicorp/terraform#4447

btw. all this behaviour is pretty much undocumented, it's just what I remember from my past experience with the AWS APIs and it's hard to test. 😄

I hope that's not too discouraging - I'd love to get such intermittent issues fixed and I understand it's tricky to do so.

Regarding the code:

  • we have helpers (resource.Retry, resource.StateChangeConf) for retries which are used throughout most of the codebase and we'd probably prefer those over anything else, unless there's a good reason to use AWS SDK's ones
  • we tend to use a simple fmt.Sprintf instead of templates, unless there's a good reason to use templates

@xchapter7x
Copy link
Contributor Author

xchapter7x commented Jun 26, 2017

i have a large debug output log, tfplan & tfstate all reproducing the failure.
there might be some passwords and things, so i'd rather not post here. What's the best way to get you the files (email, secure gist, slack...)?

In regards to the feedback I'll take a look at implementing resource.Retry and others suggested instead of direct sdk calls.

I decided to use the template instead of Sprintf, b/c it was a lot larger than the other use cases in the tests that i saw, and made the implementation a bit more readable/manageable. Happy to switch it to use sprintf, if that is a more consistent implementation.

@radeksimko
Copy link
Member

@xchapter7x feel free to send me the debug log to [email protected] You can also GPG encrypt it using my pubkey published at https://keybase.io/radeksimko

Re templating - Admittedly the nice thing about templates is that we get a crash straight away if we can't compile the template (e.g. forgotten placeholder/variable), whereas we get invalid tf config with %d or %!(EXTRA string=variable) when using Sprintf. That said I'd be worried about potential unwanted HTML escaping and other behaviour as this is supposed to be used for HTML templates. How about we use text/template then? It's still one extra import, but I can see the benefit as I did run into invalid templates caused by Sprintf in the past. We could decouple this into a function in a separate file perhaps - to avoid that extra import.

@xchapter7x
Copy link
Contributor Author

I sent everything over in keybase. Let me know if you're able to find anything.

Im going to switch this PR back to WIP and, taking into account your feedback, make some changes. Ill let you know when thats done.

Thanks for your continued support in working through this.

@xchapter7x xchapter7x changed the title add waituntilbucketexists to avoid race in s3 bucket creation calls [WIP] add waituntilbucketexists to avoid race in s3 bucket creation calls Jun 27, 2017
@radeksimko
Copy link
Member

Thanks for providing the debug log, that's very useful. 👌

On first sight there seems to be quite a lot of 404s there, even s3/HeadBucket. 😞
I'll see if I can make a head and tail of this with some help of unix filtering tools, so we just get a sequence of API calls and error codes, ideally.

Based on the log it seems worse (404 appearing more often) than anything I've seen so far. On the positive side the worse the situation is the easier it should be to fix it. 😄

@xchapter7x
Copy link
Contributor Author

xchapter7x commented Jul 17, 2017

hi @radeksimko sorry for the long silence on this. I got pulled away. I found some time to revisit this today and I have a question around the resource.Retry functions currently in use that you suggested.

If I'm understanding the use case properly, this resource.Retry function is intended to wrap a call to the s3 client and provide retry logic in the event of an error.

However, the client calls currently being wrapped provide a request.Retryer in the client struct. This built in request.Retryer seems intended to be used in similar scenarios as the resource.Retry wrapper used throughout the terraform codebase.

Im curious why one would choose to use resource.Retry over the provided request.Retryer interface?

Specific to this PR, would it be an acceptable implementation to leverage request.Retryer instead? I ask b/c it doesn't seem to fit with the current implementation and wouldn't want to invest effort in a divergent, undesired path.

Your thoughts and feedback are greatly appreciated. Thanks.

@xchapter7x xchapter7x changed the title [WIP] add waituntilbucketexists to avoid race in s3 bucket creation calls wrap s3 calls with retry to avoid race in s3 bucket creation calls Aug 10, 2017
@xchapter7x
Copy link
Contributor Author

hi @radeksimko , sorry for the long delay. This should be ready to be reviewed. Let me know your thoughts and feedback. Thanks.

@Ninir Ninir requested a review from radeksimko August 30, 2017 12:08
Copy link
Member

@radeksimko radeksimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the patience.

@radeksimko radeksimko merged commit dba3893 into hashicorp:master Aug 31, 2017
@abbyachau
Copy link

Thanks @radeksimko very much. Do you know which release this fix will be included on please?

nbaztec pushed a commit to nbaztec/terraform-provider-aws that referenced this pull request Sep 26, 2017
wrap s3 calls with retry to avoid race in s3 bucket creation calls
@esilva-everbridge
Copy link

Have these changes been released? I'm seeing this issue with 0.11.1

@ghost
Copy link

ghost commented Apr 10, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Apr 10, 2020
@breathingdust breathingdust removed the waiting-response Maintainers are waiting on response from community or contributor. label Sep 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants