-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul template retries #1224
Comments
Hey @gowthamsubbu, thanks for filing the issue. I have a few questions to clarify things. In the configuration you've provided, in the vault.retry section it is configured to start retries at 250ms with a maximum of 12 retries with a max backoff of 1m. The debug output you have matches that configuration as it starts at 250ms and doubles each time after, presumably continuing on for 12 times and maxing out at a 1 minute delay. What did you expect to see and why (ie. what is "the right interval")? You mention 2 sets of secrets, aws and dbs, and that the retries work correctly for one and not the other. Are they both in the same Vault? Are there any other differences between these 2 sets you could highlight to help explain why is one working and the other not? Thanks. |
@eikenb If you look at the time frame between retries in the logs it didn't seem like a 250ms, 500ms interval since AWS STSFederation tokens and AssumeRole are not renewable. As a result, it sleeps for 1/3 of TTL before the next retry. However, the real issue being not able differentiate between failure (in this case, time out) connection with Vault while retrieving AWS secrets and sleeping again is the concern. https://github.com/hashicorp/consul-template/blob/master/dependency/vault_read.go#L95
Yes, both are in the same Vault. It is working as expected for DB secrets since those are renewable secrets and there is no sleep involved and retries work as expected within the backoff and max backoff interval https://github.com/hashicorp/consul-template/blob/master/dependency/vault_read.go#L64 |
Thanks for the response. I see what you are talking about with those log entries. They say they are going to retry in 250ms, etc. but the timestamps for the next try definitely doesn't line up with that. It sounds like this should be a more generalized issue with tokens that fail to renew, where there are errors acquiring them. That it getting an error doesn't change the behavior, that it still waits the same whether it is successful in getting the token or not. Instead it should be using the exponential backoff, retry algorithm. Sound right? [edit: changed confusing use of 'non-renewable tokens' to 'tokens that fail to renew' which is what I meant] |
Exactly. It should be using exponential backoff to retry instead of waiting for retry until sleep duration. |
The code works the way it does currently is because...
So 2-3 seem to be good compromises to keep the secret renewed without being to demanding of the server while allowing for retries in case of issues. Where things go astray is at 4, when the Lease has expired. This is where it seems like it should switch to the same behavior as if it never had a secret. [edit: changing to not be about renewable issues.] |
I think just re-arranging the code might do it. It currently checks if it has a secret, and if so checks to see if it should try to renew or just sleep. Then after that it will try to get a new secret. If I just invert that, so it first tries to get the secret then, if it got one, stores and and then checks for renew or sleep. This works to fix this as the call to read the new secret will error when vault isn't available and return before getting to the renew/sleep code. By returning it will fall back to the normal retry system. I've tried it and it works. The code passes all the tests as well. The only downside I see is that it, for renewable secrets, it will immediately renew it as the renew code (in vault's api) renews first, then sleeps. I could add a sleep before starting the renewer, but I'm going to check with the vault people about this. |
Found the caveat with that fix... the old/current logic has it return immediately after getting the secret the first time. With the changed logic it sleeps even that first time. That would delay the initial template rendering by that sleep time. That won't work. Going to need to think about either a fix for that or another way to achieve the fix. |
The original problem was that for non-renewable vault secrets that it was having trouble fetching, it would wait the standard exponential backoff time plus the configured sleep time (like it does between successful fetches). When what it should do is use the sleep time between successful fetches and exponential backoff on failures. While fixing this I cleaned up the code to make the logic more clear. The issue existed in both vault_read and vault_write, and they shared a common chunk of renew logic between them and with vault_token. So I refactored that out into a common function. Fixes #1224
The original problem was that for non-renewable vault secrets that it was having trouble fetching, it would wait the standard exponential backoff time plus the configured sleep time (like it does between successful fetches). When what it should do is use the sleep time between successful fetches and exponential backoff on failures. While fixing this I cleaned up the code to make the logic clear. The issue existed in both vault_read and vault_write, and they shared a common chunk of renew logic between them and with vault_token. So I refactored that out into a common function. Fixes #1224
The original problem was that for non-renewable vault secrets that it was having trouble fetching, it would wait the standard exponential backoff time plus the configured sleep time (like it does between successful fetches). When what it should do is use the sleep time between successful fetches and exponential backoff on failures. While fixing this I cleaned up the code to make the logic clear. The issue existed in both vault_read and vault_write, and they shared a common chunk of renew logic between them and with vault_token. So I refactored that out into a common function. Fixes #1224
The original problem was that for non-renewable vault secrets that it was having trouble fetching, it would wait the standard exponential backoff time plus the configured sleep time (like it does between successful fetches). When what it should do is use the sleep time between successful fetches and exponential backoff on failures. While fixing this I cleaned up the code to make the logic clear. The issue existed in both vault_read and vault_write, and they shared a common chunk of renew logic between them and with vault_token. So I refactored that out into a common function. Fixes #1224
Consul template retries for Vault are not working as expected while renewing/re-acquiring AWS secrets after first retrieval when Vault is not in a working state.
PS: Retries are working as expected for DB engines secrets
Consul Template version
Run
consul-template -v
to show the version. If you are notrunning the latest version, please upgrade before submitting an
issue.
consul-template v0.19.5 (57b6c71)
Configuration
Debug output
Provide a link to a GitHub Gist containing the complete debug
output by running with
-log-level=trace
.2019/06/28 19:37:09.469556 [WARN] (view) vault.read(test-aws/creds/testrole): vault.read(test-aws/creds/testrole): Get https://vault.secrets:8200/v1/test-aws/creds/testrole: dial tcp 172.20.53.25:8200: i/o timeout (retry attempt 1 after "250ms")
2019/06/28 19:56:46.993714 [WARN] (view) vault.read(test-aws/creds/testrole): vault.read(test-aws/creds/testrole): Get https://vault.secrets:8200/v1/test-aws/creds/testrole: dial tcp 172.20.53.25:8200: i/o timeout (retry attempt 2 after "500ms")
2019/06/28 20:09:09.985364 [WARN] (view) vault.read(test-aws/creds/testrole): vault.read(test-aws/creds/testrole): Get https://vault.secrets:8200/v1/test-aws/creds/testrole: dial tcp 172.20.53.25:8200: i/o timeout (retry attempt 3 after "1s")
2019/06/28 20:28:58.213988 [WARN] (view) vault.read(test-aws/creds/testrole): vault.read(test-aws/creds/testrole): Get https://vault.secrets:8200/v1/test-aws/creds/testrole: dial tcp 172.20.53.25:8200: i/o timeout (retry attempt 4 after "2s")
Expected behavior
What should have happened?
Retries should have happened at the right interval
Actual behavior
What actually happened?
Retries are happening based on the sleep time calculated for AWS secrets renewal/re-acquisition
Steps to reproduce
The text was updated successfully, but these errors were encountered: