Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault agent DoS's the vault server if template command fails #12566

Closed
nvx opened this issue Sep 16, 2021 · 11 comments · Fixed by #25497
Closed

Vault agent DoS's the vault server if template command fails #12566

nvx opened this issue Sep 16, 2021 · 11 comments · Fixed by #25497
Labels
agent bug Used to indicate a potential bug performance

Comments

@nvx
Copy link
Contributor

nvx commented Sep 16, 2021

Describe the bug
Running a vault agent with a command that returns a non-zero error code results in the template being restarted immediately with no back-off or retry delay. This results in many requests to the Vault server in quick succession which can cause a Vault outage due to resource exhaustion.

To Reproduce

  1. Setup Vault agent with a template stanza containing a command that fails
  2. Watch the Vault Agent logs to see it repeatedly restarting the template without any back-off
  3. Look at the Vault Server logs and observe that not only is it wasting local resources, it's also wasting resources on the Vault server too

Expected behavior
Other areas of the Vault Agent such as the auto auth implement a back-off mechanism on failure to avoid excessive resource consumption (either locally or from the Vault server). A similar back-off when restarting the template would resolve this issue.

Environment:
Vault v1.8.2 (aca76f6)

Additional context
#9200 appears to have implemented the auto-restarting behaviour. This might be a location where such a back-off could be added.
#9059 may or may not be related to this, I'm not sure enough of consul-templates internals to know if setting that retry would fix this issue or not.

Setting this config option causes Vault Agent to quit on error rather than retry which mitigates this issue, but then relies on something else to restart the Vault Agent (hopefully with an appropriate back-off time):

template_config {
  exit_on_retry_failure = true
}

Vault Agent Logs

==> Vault agent started! Log data will stream in below:
==> Vault agent configuration:
           Api Address 1: unix:///var/run/vault-agent.sock
           Api Address 2: http://127.0.0.1:8200
                     Cgo: disabled
               Log Level: info
                 Version: Vault v1.8.2
             Version Sha: aca76f63357041a43b49f3e8c11d67358496959f
2021-09-16T11:25:18.343+1000 [INFO]  sink.file: creating file sink
2021-09-16T11:25:18.343+1000 [INFO]  sink.file: file sink configured: path=/etc/vault-agent/token mode=-rw-r-----
2021-09-16T11:25:18.345+1000 [INFO]  template.server: starting template server
2021-09-16T11:25:18.345+1000 [INFO]  sink.server: starting sink server
2021-09-16T11:25:18.345+1000 [INFO] (runner) creating new runner (dry: false, once: false)
2021-09-16T11:25:18.345+1000 [INFO]  auth.handler: starting auth handler
2021-09-16T11:25:18.345+1000 [INFO]  auth.handler: authenticating
2021-09-16T11:25:18.346+1000 [INFO] (runner) creating watcher
2021-09-16T11:25:18.509+1000 [INFO]  auth.handler: authentication successful, sending token to sinks
2021-09-16T11:25:18.509+1000 [INFO]  auth.handler: starting renewal process
2021-09-16T11:25:18.509+1000 [INFO]  template.server: template server received new token
2021-09-16T11:25:18.509+1000 [INFO] (runner) stopping
2021-09-16T11:25:18.509+1000 [INFO] (runner) creating new runner (dry: false, once: false)
2021-09-16T11:25:18.510+1000 [INFO] (runner) creating watcher
2021-09-16T11:25:18.510+1000 [INFO] (runner) starting
2021-09-16T11:25:18.510+1000 [INFO]  sink.file: token written: path=/etc/vault-agent/token
2021-09-16T11:25:18.624+1000 [INFO]  auth.handler: renewed auth token
2021-09-16T11:25:18.703+1000 [INFO] (runner) rendered "(dynamic)" => "/etc/consul.d/tls/ca.crt"
2021-09-16T11:25:18.705+1000 [INFO] (runner) executing command "/bin/sh -c 'service consul reload || true'" from "(dynamic)" => "/etc/consul.d/tls/ca.crt"
2021-09-16T11:25:18.705+1000 [INFO] (child) spawning: /bin/sh -c service consul reload || true
Redirecting to /bin/systemctl reload consul.service
consul.service is not active, cannot reload.
2021-09-16T11:25:18.793+1000 [INFO] (runner) rendered "(dynamic)" => "/etc/consul.d/tls/consul.crt"
2021-09-16T11:25:18.794+1000 [INFO] (runner) rendered "(dynamic)" => "/etc/consul.d/tls/consul.key"
2021-09-16T11:25:18.795+1000 [INFO] (runner) rendered "(dynamic)" => "/etc/consul.d/consul_encrypt.hcl"
2021-09-16T11:25:18.796+1000 [INFO] (runner) rendered "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:18.796+1000 [INFO] (runner) rendered "(dynamic)" => "/secrets/clientkeystore.pem"
2021-09-16T11:25:18.796+1000 [INFO] (runner) executing command "/bin/sh -c 'service consul reload || true'" from "(dynamic)" => "/etc/consul.d/tls/consul.crt"
2021-09-16T11:25:18.796+1000 [INFO] (child) spawning: /bin/sh -c service consul reload || true
Redirecting to /bin/systemctl reload consul.service
consul.service is not active, cannot reload.
2021-09-16T11:25:18.827+1000 [INFO] (runner) executing command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:18.827+1000 [INFO] (child) spawning: /etc/vault-agent/some-script.sh
[stderr from script]
2021-09-16T11:25:18.937+1000 [ERROR] template.server: template server error:
  error=
  | 1 error occurred:
  |         * failed to execute command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem": child: command exited with a non-zero exit status:
  |
  |     /etc/vault-agent/some-script.sh
  |
  | This is assumed to be a failure. Please ensure the command
  | exits with a zero exit status.
  |

2021-09-16T11:25:18.937+1000 [INFO] (runner) stopping
2021-09-16T11:25:18.937+1000 [INFO] (runner) creating new runner (dry: false, once: false)
2021-09-16T11:25:18.937+1000 [INFO] (runner) creating watcher
2021-09-16T11:25:18.938+1000 [INFO] (runner) starting
2021-09-16T11:25:19.061+1000 [INFO] (runner) rendered "(dynamic)" => "/etc/consul.d/tls/consul.crt"
2021-09-16T11:25:19.062+1000 [INFO] (runner) rendered "(dynamic)" => "/etc/consul.d/tls/consul.key"
2021-09-16T11:25:19.063+1000 [INFO] (runner) executing command "/bin/sh -c 'service consul reload || true'" from "(dynamic)" => "/etc/consul.d/tls/consul.crt"
2021-09-16T11:25:19.063+1000 [INFO] (child) spawning: /bin/sh -c service consul reload || true
Redirecting to /bin/systemctl reload consul.service
consul.service is not active, cannot reload.
2021-09-16T11:25:19.101+1000 [INFO] (runner) rendered "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:19.102+1000 [INFO] (runner) executing command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:19.102+1000 [INFO] (child) spawning: /etc/vault-agent/some-script.sh
[stderr from script]
2021-09-16T11:25:19.114+1000 [ERROR] template.server: template server error:
  error=
  | 1 error occurred:
  |         * failed to execute command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem": child: command exited with a non-zero exit status:
  |
  |     /etc/vault-agent/some-script.sh
  |
  | This is assumed to be a failure. Please ensure the command
  | exits with a zero exit status.
  |

2021-09-16T11:25:19.114+1000 [INFO] (runner) stopping
2021-09-16T11:25:19.115+1000 [INFO] (runner) creating new runner (dry: false, once: false)
2021-09-16T11:25:19.115+1000 [INFO] (runner) creating watcher
2021-09-16T11:25:19.115+1000 [INFO] (runner) starting
2021-09-16T11:25:19.234+1000 [INFO] (runner) rendered "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:19.234+1000 [INFO] (runner) executing command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:19.234+1000 [INFO] (child) spawning: /etc/vault-agent/some-script.sh
[stderr from script]
2021-09-16T11:25:19.249+1000 [ERROR] template.server: template server error:
  error=
  | 1 error occurred:
  |         * failed to execute command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem": child: command exited with a non-zero exit status:
  |
  |     /etc/vault-agent/some-script.sh
  |
  | This is assumed to be a failure. Please ensure the command
  | exits with a zero exit status.
  |

2021-09-16T11:25:19.249+1000 [INFO] (runner) stopping
2021-09-16T11:25:19.250+1000 [INFO] (runner) creating new runner (dry: false, once: false)
2021-09-16T11:25:19.250+1000 [INFO] (runner) creating watcher
2021-09-16T11:25:19.250+1000 [INFO] (runner) starting
2021-09-16T11:25:19.421+1000 [INFO] (runner) rendered "(dynamic)" => "/etc/consul.d/tls/consul.crt"
2021-09-16T11:25:19.422+1000 [INFO] (runner) rendered "(dynamic)" => "/etc/consul.d/tls/consul.key"
2021-09-16T11:25:19.423+1000 [INFO] (runner) executing command "/bin/sh -c 'service consul reload || true'" from "(dynamic)" => "/etc/consul.d/tls/consul.crt"
2021-09-16T11:25:19.423+1000 [INFO] (child) spawning: /bin/sh -c service consul reload || true
Redirecting to /bin/systemctl reload consul.service
consul.service is not active, cannot reload.
2021-09-16T11:25:19.459+1000 [INFO] (runner) rendered "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:19.460+1000 [INFO] (runner) rendered "(dynamic)" => "/secrets/clientkeystore.pem"
2021-09-16T11:25:19.460+1000 [INFO] (runner) executing command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:19.460+1000 [INFO] (child) spawning: /etc/vault-agent/some-script.sh
[stderr from script]
2021-09-16T11:25:19.471+1000 [ERROR] template.server: template server error:
  error=
  | 1 error occurred:
  |         * failed to execute command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem": child: command exited with a non-zero exit status:
  |
  |     /etc/vault-agent/some-script.sh
  |
  | This is assumed to be a failure. Please ensure the command
  | exits with a zero exit status.
  |

2021-09-16T11:25:19.471+1000 [INFO] (runner) stopping
2021-09-16T11:25:19.471+1000 [INFO] (runner) creating new runner (dry: false, once: false)
2021-09-16T11:25:19.471+1000 [INFO] (runner) creating watcher
2021-09-16T11:25:19.471+1000 [INFO] (runner) starting
2021-09-16T11:25:19.617+1000 [INFO] (runner) rendered "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:19.618+1000 [INFO] (runner) executing command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:19.618+1000 [INFO] (child) spawning: /etc/vault-agent/some-script.sh
[stderr from script]
2021-09-16T11:25:19.630+1000 [ERROR] template.server: template server error:
  error=
  | 1 error occurred:
  |         * failed to execute command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem": child: command exited with a non-zero exit status:
  |
  |     /etc/vault-agent/some-script.sh
  |
  | This is assumed to be a failure. Please ensure the command
  | exits with a zero exit status.
  |

2021-09-16T11:25:19.630+1000 [INFO] (runner) stopping
2021-09-16T11:25:19.631+1000 [INFO] (runner) creating new runner (dry: false, once: false)
2021-09-16T11:25:19.631+1000 [INFO] (runner) creating watcher
2021-09-16T11:25:19.631+1000 [INFO] (runner) starting
2021-09-16T11:25:19.750+1000 [INFO] (runner) rendered "(dynamic)" => "/secrets/clientkeystore.pem"
2021-09-16T11:25:19.750+1000 [INFO] (runner) executing command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/clientkeystore.pem"
2021-09-16T11:25:19.750+1000 [INFO] (child) spawning: /etc/vault-agent/some-script.sh
[stderr from script]
2021-09-16T11:25:19.761+1000 [ERROR] template.server: template server error:
  error=
  | 1 error occurred:
  |         * failed to execute command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/clientkeystore.pem": child: command exited with a non-zero exit status:
  |
  |     /etc/vault-agent/some-script.sh
  |
  | This is assumed to be a failure. Please ensure the command
  | exits with a zero exit status.
  |

2021-09-16T11:25:19.761+1000 [INFO] (runner) stopping
2021-09-16T11:25:19.761+1000 [INFO] (runner) creating new runner (dry: false, once: false)
2021-09-16T11:25:19.762+1000 [INFO] (runner) creating watcher
2021-09-16T11:25:19.762+1000 [INFO] (runner) starting
2021-09-16T11:25:19.920+1000 [INFO] (runner) rendered "(dynamic)" => "/etc/consul.d/tls/consul.crt"
2021-09-16T11:25:19.921+1000 [INFO] (runner) rendered "(dynamic)" => "/etc/consul.d/tls/consul.key"
2021-09-16T11:25:19.922+1000 [INFO] (runner) executing command "/bin/sh -c 'service consul reload || true'" from "(dynamic)" => "/etc/consul.d/tls/consul.crt"
2021-09-16T11:25:19.922+1000 [INFO] (child) spawning: /bin/sh -c service consul reload || true
Redirecting to /bin/systemctl reload consul.service
consul.service is not active, cannot reload.
2021-09-16T11:25:19.958+1000 [INFO] (runner) rendered "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:19.958+1000 [INFO] (runner) rendered "(dynamic)" => "/secrets/clientkeystore.pem"
2021-09-16T11:25:19.958+1000 [INFO] (runner) executing command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:19.958+1000 [INFO] (child) spawning: /etc/vault-agent/some-script.sh
[stderr from script]
2021-09-16T11:25:19.972+1000 [ERROR] template.server: template server error:
  error=
  | 1 error occurred:
  |         * failed to execute command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem": child: command exited with a non-zero exit status:
  |
  |     /etc/vault-agent/some-script.sh
  |
  | This is assumed to be a failure. Please ensure the command
  | exits with a zero exit status.
  |

2021-09-16T11:25:19.972+1000 [INFO] (runner) stopping
2021-09-16T11:25:19.972+1000 [INFO] (runner) creating new runner (dry: false, once: false)
2021-09-16T11:25:19.972+1000 [INFO] (runner) creating watcher
2021-09-16T11:25:19.972+1000 [INFO] (runner) starting
2021-09-16T11:25:20.089+1000 [INFO] (runner) rendered "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:20.089+1000 [INFO] (runner) executing command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem"
2021-09-16T11:25:20.089+1000 [INFO] (child) spawning: /etc/vault-agent/some-script.sh
[stderr from script]
2021-09-16T11:25:20.106+1000 [ERROR] template.server: template server error:
  error=
  | 1 error occurred:
  |         * failed to execute command "/etc/vault-agent/some-script.sh" from "(dynamic)" => "/secrets/keystore.pem": child: command exited with a non-zero exit status:
  |
  |     /etc/vault-agent/some-script.sh
  |
  | This is assumed to be a failure. Please ensure the command
  | exits with a zero exit status.
  |

2021-09-16T11:25:20.106+1000 [INFO] (runner) stopping
2021-09-16T11:25:20.106+1000 [INFO] (runner) creating new runner (dry: false, once: false)
2021-09-16T11:25:20.106+1000 [INFO] (runner) creating watcher
2021-09-16T11:25:20.106+1000 [INFO] (runner) starting
@pmmukh pmmukh added bug Used to indicate a potential bug core Issues and Pull-Requests specific to Vault Core performance agent labels Sep 16, 2021
@pmmukh
Copy link
Contributor

pmmukh commented Dec 21, 2021

Thanks for submitting the issue, @nvx ! Just wanted to swing by and drop an update that I've been able to reproduce this issue, while looking at a separate one, and can confirm the retries in a tight loop is indeed the case when some command from the template file fails.

@pmmukh pmmukh added ecosystem and removed core Issues and Pull-Requests specific to Vault Core labels Jan 4, 2022
@calvn
Copy link
Contributor

calvn commented Jan 6, 2022

@nvx What happens if you also set template.error_on_missing_key to true?

@nvx
Copy link
Contributor Author

nvx commented Jan 13, 2022

@nvx What happens if you also set template.error_on_missing_key to true?

Looking at the configuration, this option was already set.

@VioletHynes
Copy link
Contributor

VioletHynes commented Sep 1, 2022

Hi there! I believe this will be fixed by #16970 - if you have caching in the Vault Agent config, currently, Agent will ignore the retry configuration and perform immediate infinite retries in a loop with no back-off. As a workaround, an empty template_config stanza should work, by setting retry to the default config (12), though obviously that's not perfect as you might not want 12.

The fix should be coming in 1.12.

@VioletHynes
Copy link
Contributor

Hi there! I'm going to close this issue as I just merged #16970 which should fix this. This should release in 1.12.

Thanks for the bug report!

@celesteking
Copy link

celesteking commented Feb 6, 2024

It isn't fixed. If there's an error, it spins like crazy.

2024-02-06T13:31:15.365Z [DEBUG] agent: (runner) checking template ea719e25596e1dd1659265c8ec95f87b
2024-02-06T13:31:15.365Z [ERROR] agent.template.server: template server error: error="(dynamic): parse: template: :1: malformed character constant: 'pki/issue/xx'"
2024-02-06T13:31:15.365Z [INFO] agent: (runner) stopping

Then at 2024-02-06T13:31:15.366Z , .367Z, .368Z, etc...each millisecond. crazy.

@VioletHynes
Copy link
Contributor

VioletHynes commented Feb 6, 2024

Hey there @celesteking ! I apologize if we conflated two issues. The PR linked above definitely fixed an issue related to retries with the cache, but it's possible that we mixed up some issues that seemed related.

To make sure I understand the issue properly, the issue here is that when the Agent template server crashes, the server restarts and immediately retries, and since it crashed as opposed to errored, it doesn't know to retry. Does that sound right? The other issues linked to this one were related to it erroring (but the server not crashing), and it looks like that got confused.

There's kind of a sub issue here, in that the template server probably shouldn't crash in some of these cases.

Do you have a configuration you can share that reproduces the issue, or any more details you can share?

@VioletHynes VioletHynes reopened this Feb 6, 2024
@celesteking
Copy link

Pretty easy to reproduce with config template { contents = "{{ ' }}". I'm able to calm it down with template_config { exit_on_retry_failure = true } but then it just exits altogether. Is there a retry backoff timer?

@VioletHynes
Copy link
Contributor

Thank you! I appreciate the reproduction too. This is definitely a bug and you're right that it wasn't addressed by the previous PR.

I'll raise a ticket in our internal bug tracker for this and try and get someone's eyes on it soon. There are retry backoff timers for failures, but not for anything that crashes the template server entirely.

I can't promise anything with regards to timelines but I'm eager to get a fix in soon. I'll link the GitHub issue so you'll know when we're done.

@celesteking
Copy link

That ain't a problem, no rush. I'm still learning about this thing.

@VioletHynes
Copy link
Contributor

In either case, I appreciate you letting us know this is still a problem. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent bug Used to indicate a potential bug performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants