Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul template retries #1224

Closed
gowthamsubbu opened this issue Jun 28, 2019 · 7 comments · Fixed by #1269
Closed

Consul template retries #1224

gowthamsubbu opened this issue Jun 28, 2019 · 7 comments · Fixed by #1269
Labels
Milestone

Comments

@gowthamsubbu
Copy link

Consul template retries for Vault are not working as expected while renewing/re-acquiring AWS secrets after first retrieval when Vault is not in a working state.

PS: Retries are working as expected for DB engines secrets

Consul Template version

Run consul-template -v to show the version. If you are not
running the latest version, please upgrade before submitting an
issue.

consul-template v0.19.5 (57b6c71)

Configuration

# This is the signal to listen for to trigger a reload event. The default
# value is shown below. Setting this value to the empty string will cause CT
# to not listen for any reload signals.
reload_signal = "SIGHUP"

# This is the signal to listen for to trigger a graceful stop. The default
# value is shown below. Setting this value to the empty string will cause CT
# to not listen for any graceful stop signals.
kill_signal = "SIGINT"

# This is the maximum interval to allow "stale" data. By default, only the
# Consul leader will respond to queries; any requests to a follower will
# forward to the leader. In large clusters with many requests, this is not as
# scalable, so this option allows any follower to respond to a query, so long
# as the last-replicated data is within these bounds. Higher values result in
# less cluster load, but are more likely to have outdated data.
max_stale = "10m"

# This is the log level. If you find a bug in Consul Template, please enable
# debug logs so we can help identify the issue. This is also available as a
# command line flag.
log_level = "warn"

# This is the path to store a PID file which will contain the process ID of the
# Consul Template process. This is useful if you plan to send custom signals
# to the process.
#pid_file = "/root/consul-template.pid"

# This is the quiescence timers; it defines the minimum and maximum amount of
# time to wait for the cluster to reach a consistent state before rendering a
# template. This is useful to enable in systems that have a lot of flapping,
# because it will reduce the the number of times a template is rendered.
wait {
  min = "5s"
  max = "10s"
}

# This denotes the start of the configuration section for Vault. All values
# contained in this section pertain to Vault.
vault {
  # This is the address of the Vault leader. The protocol (http(s)) portion
  # of the address is required.
  address = "https://vault.secrets:8200"

  # This is the grace period between lease renewal of periodic secrets and secret
  # re-acquisition. When renewing a secret, if the remaining lease is less than or
  # equal to the configured grace, Consul Template will request a new credential.
  # This prevents Vault from revoking the credential at expiration and Consul
  # Template having a stale credential.
  #
  # Note: If you set this to a value that is higher than your default TTL or
  # max TTL, Consul Template will always read a new secret!
  #
  # This should also be less than or around 1/3 of your TTL for a predictable
  # behaviour. See https://github.com/hashicorp/vault/issues/3414
  grace = "5m"

  # This is the token to use when communicating with the Vault server.
  # Like other tools that integrate with Vault, Consul Template makes the
  # assumption that you provide it with a Vault token; it does not have the
  # incorporated logic to generate tokens via Vault's auth methods.
  #
  # This value can also be specified via the environment variable VAULT_TOKEN.
  token = "{{vault_token}}"

  # This tells Consul Template that the provided token is actually a wrapped
  # token that should be unwrapped using Vault's cubbyhole response wrapping
  # before being used. Please see Vault's cubbyhole response wrapping
  # documentation for more information.
  unwrap_token = false

  # This option tells Consul Template to automatically renew the Vault token
  # given. If you are unfamiliar with Vault's architecture, Vault requires
  # tokens be renewed at some regular interval or they will be revoked. Consul
  # Template will automatically renew the token at half the lease duration of
  # the token. The default value is true, but this option can be disabled if
  # you want to renew the Vault token using an out-of-band process.
  #
  # Note that secrets specified in a template (using {{secret}} for example)
  # are always renewed, even if this option is set to false. This option only
  # applies to the top-level Vault token itself.
  renew_token = true

  # This section details the retry options for connecting to Vault. Please see
  # the retry options in the Consul section for more information (they are the
  # same).
  retry {
    # This enabled retries. Retries are enabled by default, so this is
    # redundant.
    enabled = true

    # This specifies the number of attempts to make before giving up. Each
    # attempt adds the exponential backoff sleep time. Setting this to
    # zero will implement an unlimited number of retries.
    attempts = 12

    # This is the base amount of time to sleep between retry attempts. Each
    # retry sleeps for an exponent of 2 longer than this base. For 5 retries,
    # the sleep times would be: 250ms, 500ms, 1s, 2s, then 4s.
    backoff = "250ms"

    # This is the maximum amount of time to sleep between retry attempts.
    # When max_backoff is set to zero, there is no upper limit to the
    # exponential sleep between retry attempts.
    # If max_backoff is set to 10s and backoff is set to 1s, sleep times
    # would be: 1s, 2s, 4s, 8s, 10s, 10s, ...
    max_backoff = "1m"
  }

  # This block configures the SSL options for connecting to the Consul server.
  ssl {
    # This enables SSL. Specifying any option for SSL will also enable it.
    enabled = true

    # This enables SSL peer verification. The default value is "true", which
    # will check the global CA chain to make sure the given certificates are
    # valid. If you are using a self-signed certificate that you have not added
    # to the CA chain, you may want to disable SSL verification. However, please
    # understand this is a potential security vulnerability.
    verify = true

    # This is the path to the certificate to use to authenticate. If just a
    # certificate is provided, it is assumed to contain both the certificate and
    # the key to convert to an X509 certificate. If both the certificate and
    # key are specified, Consul Template will automatically combine them into an
    # X509 certificate for you.
    #cert = "/root/vault-ca.crt"
    #key  = "/path/to/client/key"

    # This is the path to the certificate authority to use as a CA. This is
    # useful for self-signed certificates or for organizations using their own
    # internal certificate authority.
    #ca_cert = "/consul-template/files/vault-ca.crt"
    ca_cert = "/etc/vaultca/ca_crt"

    # This is the path to a directory of PEM-encoded CA cert files. If both
    # `ca_cert` and `ca_path` is specified, `ca_cert` is preferred.
    #ca_path = "path/to/certs/"

    # This sets the SNI server name to use for validation.
    #server_name = "my-server.com"
  }
}

# This block defines the configuration for connecting to a syslog server for
# logging.
syslog {
  # This enables syslog logging. Specifying any other option also enables
  # syslog logging.
  #enabled = true

  # This is the name of the syslog facility to log to.
  #facility = "LOCAL5"
}

# This block defines the configuration for de-duplication mode. Please see the
# de-duplication mode documentation later in the README for more information
# on how de-duplication mode operates.
deduplicate {
  # This enables de-duplication mode. Specifying any other options also enables
  # de-duplication mode.
  #enabled = true

  # This is the prefix to the path in Consul's KV store where de-duplication
  # templates will be pre-rendered and stored.
  #prefix = "consul-template/dedup/"
}

# This block defines the configuration for exec mode. Please see the exec mode
# documentation at the bottom of this README for more information on how exec
# mode operates and the caveats of this mode.
exec {
  # This is the command to exec as a child process. There can be only one
  # command per Consul Template process.
  #command = "/usr/bin/app"

  # This is a random splay to wait before killing the command. The default
  # value is 0 (no wait), but large clusters should consider setting a splay
  # value to prevent all child processes from reloading at the same time when
  # data changes occur. When this value is set to non-zero, Consul Template
  # will wait a random period of time up to the splay value before reloading
  # or killing the child process. This can be used to prevent the thundering
  # herd problem on applications that do not gracefully reload.
  #splay = "5s"

  env {
    # This specifies if the child process should not inherit the parent
    # process's environment. By default, the child will have full access to the
    # environment variables of the parent. Setting this to true will send only
    # the values specified in `custom_env` to the child process.
    #pristine = false

    # This specifies additional custom environment variables in the form shown
    # below to inject into the child's runtime environment. If a custom
    # environment variable shares its name with a system environment variable,
    # the custom environment variable takes precedence. Even if pristine,
    # whitelist, or blacklist is specified, all values in this option
    # are given to the child process.
    #custom = ["PATH=$PATH:/etc/myapp/bin"]

    # This specifies a list of environment variables to exclusively include in
    # the list of environment variables exposed to the child process. If
    # specified, only those environment variables matching the given patterns
    # are exposed to the child process. These strings are matched using Go's
    # glob function, so wildcards are permitted.
    #whitelist = ["CONSUL_*"]

    # This specifies a list of environment variables to exclusively prohibit in
    # the list of environment variables exposed to the child process. If
    # specified, any environment variables matching the given patterns will not
    # be exposed to the child process, even if they are whitelisted. The values
    # in this option take precedence over the values in the whitelist.
    # These strings are matched using Go's glob function, so wildcards are
    # permitted.
    #blacklist = ["VAULT_*"]
  }

  # This defines the signal that will be sent to the child process when a
  # change occurs in a watched template. The signal will only be sent after the
  # process is started, and the process will only be started after all
  # dependent templates have been rendered at least once. The default value is
  # nil, which tells Consul Template to stop the child process and spawn a new
  # one instead of sending it a signal. This is useful for legacy applications
  # or applications that cannot properly reload their configuration without a
  # full reload.
  #reload_signal = ""

  # This defines the signal sent to the child process when Consul Template is
  # gracefully shutting down. The application should begin a graceful cleanup.
  # If the application does not terminate before the `kill_timeout`, it will
  # be terminated (effectively "kill -9"). The default value is "SIGTERM".
  #kill_signal = "SIGINT"

  # This defines the amount of time to wait for the child process to gracefully
  # terminate when Consul Template exits. After this specified time, the child
  # process will be force-killed (effectively "kill -9"). The default value is
  # "30s".
  #kill_timeout = "2s"
}

# This block defines the configuration for a template. Unlike other blocks,
# this block may be specified multiple times to configure multiple templates.
# It is also possible to configure templates via the CLI directly.
template {
  # This is the source file on disk to use as the input template. This is often
  # called the "Consul Template template". This option is required if not using
  # the `contents` option.
  source = "/config/template/template.ctmpl"

  # This is the destination path on disk where the source template will render.
  # If the parent directories do not exist, Consul Template will attempt to
  # create them, unless create_dest_dirs is false.
  destination = "/config/secrets"

  # This options tells Consul Template to create the parent directories of the
  # destination path if they do not exist. The default value is true.
  create_dest_dirs = true

  # This option allows embedding the contents of a template in the configuration
  # file rather then supplying the `source` path to the template file. This is
  # useful for short templates. This option is mutually exclusive with the
  # `source` option.
  #contents = "{{ keyOrDefault \"service/redis/maxconns@east-aws\" \"5\" }}"

  # This is the optional command to run when the template is rendered. The
  # command will only run if the resulting template changes. The command must
  # return within 30s (configurable), and it must have a successful exit code.
  # Consul Template is not a replacement for a process monitor or init system.
  #command = "restart service foo"

  # This is the maximum amount of time to wait for the optional command to
  # return. Default is 30s.
  #command_timeout = "60s"

  # Exit with an error when accessing a struct or map field/key that does not
  # exist. The default behavior will print "<no value>" when accessing a field
  # that does not exist. It is highly recommended you set this to "true" when
  # retrieving secrets from Vault.
  error_on_missing_key = true

  # This is the permission to render the file. If this option is left
  # unspecified, Consul Template will attempt to match the permissions of the
  # file that already exists at the destination path. If no file exists at that
  # path, the permissions are 0644.
  # Commenting out Permissions will allow reading from any container in the pod
  #perms = 0600

  # This option backs up the previously rendered template at the destination
  # path before writing a new one. It keeps exactly one backup. This option is
  # useful for preventing accidental changes to the data without having a
  # rollback strategy.
  backup = true

  # These are the delimiters to use in the template. The default is "{{" and
  # "}}", but for some templates, it may be easier to use a different delimiter
  # that does not conflict with the output file itself.
  left_delimiter  = "{{"
  right_delimiter = "}}"

  # This is the `minimum(:maximum)` to wait before rendering a new template to
  # disk and triggering a command, separated by a colon (`:`). If the optional
  # maximum value is omitted, it is assumed to be 4x the required minimum value.
  # This is a numeric time with a unit suffix ("5s"). There is no default value.
  # The wait value for a template takes precedence over any globally-configured
  # wait.
  wait {
    min = "2s"
    max = "10s"
  }
}
# Copy-paste your Consul Template template here
{
            "test-aws":{{with secret "test-aws/creds/testrole"}} {
            "aws_access_key": "{{.Data.access_key}}",
            "aws_secret_key": "{{.Data.secret_key}}",
            "aws_security_token": "{{.Data.security_token}}",
            {{end}}
            "type": "AWS"},
        "Project_Name": "vault-testing-microservice" 

Debug output

Provide a link to a GitHub Gist containing the complete debug
output by running with -log-level=trace.

2019/06/28 19:37:09.469556 [WARN] (view) vault.read(test-aws/creds/testrole): vault.read(test-aws/creds/testrole): Get https://vault.secrets:8200/v1/test-aws/creds/testrole: dial tcp 172.20.53.25:8200: i/o timeout (retry attempt 1 after "250ms")
2019/06/28 19:56:46.993714 [WARN] (view) vault.read(test-aws/creds/testrole): vault.read(test-aws/creds/testrole): Get https://vault.secrets:8200/v1/test-aws/creds/testrole: dial tcp 172.20.53.25:8200: i/o timeout (retry attempt 2 after "500ms")
2019/06/28 20:09:09.985364 [WARN] (view) vault.read(test-aws/creds/testrole): vault.read(test-aws/creds/testrole): Get https://vault.secrets:8200/v1/test-aws/creds/testrole: dial tcp 172.20.53.25:8200: i/o timeout (retry attempt 3 after "1s")
2019/06/28 20:28:58.213988 [WARN] (view) vault.read(test-aws/creds/testrole): vault.read(test-aws/creds/testrole): Get https://vault.secrets:8200/v1/test-aws/creds/testrole: dial tcp 172.20.53.25:8200: i/o timeout (retry attempt 4 after "2s")

Expected behavior

What should have happened?

Retries should have happened at the right interval

Actual behavior

What actually happened?

Retries are happening based on the sleep time calculated for AWS secrets renewal/re-acquisition

Steps to reproduce

  1. Deploy consul-template to retrieve AWS secrets from Vault in Kubernetes cluster
  2. Consul-template connect and retrieves AWS secrets for the first time
  3. Bring Vault out-of-service and check consul-template retry behaviour for AWS secret renewal
@eikenb
Copy link
Contributor

eikenb commented Jul 8, 2019

Hey @gowthamsubbu, thanks for filing the issue.

I have a few questions to clarify things.

In the configuration you've provided, in the vault.retry section it is configured to start retries at 250ms with a maximum of 12 retries with a max backoff of 1m. The debug output you have matches that configuration as it starts at 250ms and doubles each time after, presumably continuing on for 12 times and maxing out at a 1 minute delay. What did you expect to see and why (ie. what is "the right interval")?

You mention 2 sets of secrets, aws and dbs, and that the retries work correctly for one and not the other. Are they both in the same Vault? Are there any other differences between these 2 sets you could highlight to help explain why is one working and the other not?

Thanks.

@gowthamsubbu
Copy link
Author

gowthamsubbu commented Jul 8, 2019

@eikenb If you look at the time frame between retries in the logs it didn't seem like a 250ms, 500ms interval since AWS STSFederation tokens and AssumeRole are not renewable. As a result, it sleeps for 1/3 of TTL before the next retry. However, the real issue being not able differentiate between failure (in this case, time out) connection with Vault while retrieving AWS secrets and sleeping again is the concern.

https://github.com/hashicorp/consul-template/blob/master/dependency/vault_read.go#L95

You mention 2 sets of secrets, aws and dbs, and that the retries work correctly for one and not the other. Are they both in the same Vault? Are there any other differences between these 2 sets you could highlight to help explain why is one working and the other not?

Yes, both are in the same Vault. It is working as expected for DB secrets since those are renewable secrets and there is no sleep involved and retries work as expected within the backoff and max backoff interval

https://github.com/hashicorp/consul-template/blob/master/dependency/vault_read.go#L64

@eikenb
Copy link
Contributor

eikenb commented Aug 2, 2019

Thanks for the response.

I see what you are talking about with those log entries. They say they are going to retry in 250ms, etc. but the timestamps for the next try definitely doesn't line up with that.

It sounds like this should be a more generalized issue with tokens that fail to renew, where there are errors acquiring them. That it getting an error doesn't change the behavior, that it still waits the same whether it is successful in getting the token or not. Instead it should be using the exponential backoff, retry algorithm. Sound right?

[edit: changed confusing use of 'non-renewable tokens' to 'tokens that fail to renew' which is what I meant]

@eikenb eikenb added the bug label Aug 2, 2019
@gowthamsubbu
Copy link
Author

Exactly. It should be using exponential backoff to retry instead of waiting for retry until sleep duration.

@eikenb
Copy link
Contributor

eikenb commented Aug 26, 2019

The code works the way it does currently is because...

  1. CT polls to get the secret at around 90% of the secret's TTL (default 5 minutes if no ttl)
  2. it is designed to keep using the current secret if vault is down (could be still good)
  3. it has no code to change behaviors if it fails/vault is down

So 2-3 seem to be good compromises to keep the secret renewed without being to demanding of the server while allowing for retries in case of issues.

Where things go astray is at 4, when the Lease has expired. This is where it seems like it should switch to the same behavior as if it never had a secret.

[edit: changing to not be about renewable issues.]

@eikenb
Copy link
Contributor

eikenb commented Aug 26, 2019

I think just re-arranging the code might do it. It currently checks if it has a secret, and if so checks to see if it should try to renew or just sleep. Then after that it will try to get a new secret. If I just invert that, so it first tries to get the secret then, if it got one, stores and and then checks for renew or sleep. This works to fix this as the call to read the new secret will error when vault isn't available and return before getting to the renew/sleep code. By returning it will fall back to the normal retry system.

I've tried it and it works. The code passes all the tests as well. The only downside I see is that it, for renewable secrets, it will immediately renew it as the renew code (in vault's api) renews first, then sleeps. I could add a sleep before starting the renewer, but I'm going to check with the vault people about this.

@eikenb
Copy link
Contributor

eikenb commented Aug 26, 2019

Found the caveat with that fix... the old/current logic has it return immediately after getting the secret the first time. With the changed logic it sleeps even that first time. That would delay the initial template rendering by that sleep time. That won't work. Going to need to think about either a fix for that or another way to achieve the fix.

eikenb added a commit that referenced this issue Aug 27, 2019
The original problem was that for non-renewable vault secrets that it
was having trouble fetching, it would wait the standard exponential
backoff time plus the configured sleep time (like it does between
successful fetches). When what it should do is use the sleep time
between successful fetches and exponential backoff on failures.

While fixing this I cleaned up the code to make the logic more clear.
The issue existed in both vault_read and vault_write, and they shared a
common chunk of renew logic between them and with vault_token. So I
refactored that out into a common function.

Fixes #1224
eikenb added a commit that referenced this issue Aug 27, 2019
The original problem was that for non-renewable vault secrets that it
was having trouble fetching, it would wait the standard exponential
backoff time plus the configured sleep time (like it does between
successful fetches). When what it should do is use the sleep time
between successful fetches and exponential backoff on failures.

While fixing this I cleaned up the code to make the logic clear.
The issue existed in both vault_read and vault_write, and they shared a
common chunk of renew logic between them and with vault_token. So I
refactored that out into a common function.

Fixes #1224
eikenb added a commit that referenced this issue Aug 30, 2019
The original problem was that for non-renewable vault secrets that it
was having trouble fetching, it would wait the standard exponential
backoff time plus the configured sleep time (like it does between
successful fetches). When what it should do is use the sleep time
between successful fetches and exponential backoff on failures.

While fixing this I cleaned up the code to make the logic clear.
The issue existed in both vault_read and vault_write, and they shared a
common chunk of renew logic between them and with vault_token. So I
refactored that out into a common function.

Fixes #1224
eikenb added a commit that referenced this issue Aug 30, 2019
The original problem was that for non-renewable vault secrets that it
was having trouble fetching, it would wait the standard exponential
backoff time plus the configured sleep time (like it does between
successful fetches). When what it should do is use the sleep time
between successful fetches and exponential backoff on failures.

While fixing this I cleaned up the code to make the logic clear.
The issue existed in both vault_read and vault_write, and they shared a
common chunk of renew logic between them and with vault_token. So I
refactored that out into a common function.

Fixes #1224
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants