Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching of ACL error responses causes longer start times of applications on Kubernetes #11704

Open
ishustava opened this issue Dec 1, 2021 · 0 comments
Labels
theme/acls ACL and token generation type/bug Feature does not function as expected

Comments

@ishustava
Copy link
Contributor

Overview of the Issue

We have recently come across an issue with Consul on Kubernetes where Consul caches negative responses (e.g. ACL not found) and that causes apps on Kubernetes take too long to start because it needs to wait for the negative cache entry to expire.

This happens because on k8s we do the following before application starts:

  1. Call consul login with the Kubernetes auth method to get a token
  2. Call v1/agent/services?filter=... to list services for this pod

Because call 1 and 2 happen immediately after one another, it's possible that a call to get services hits a follower that doesn't have the token we've created during call 1 replicated yet. As a result call 2 results in an "ACL not found" error which the client then caches for the acl_token_ttl duration which defaults to 30seconds. Please also see hashicorp/consul-k8s#862 for the description of the original issue.

We have now found a workaround where we do another call between calls 1 and 2 to stale read the token (hashicorp/consul-k8s#887) and this workaround covers the majority of cases, however the question still remains what would be the ideal solution in Consul.

Reproduction Steps

This could probably be reproduced outside of k8s, but the most obvious way to reproduce is using a Kubernetes cluster like GKE (kind will not work out-of-the box for this since we need to run a cluster with more than 1 server)

  1. Create a Kubernetes cluster (e.g. on google cloud) with 3 nodes
  2. Deploy consul helm chart with connect-inject and ACLs enabled helm install test --set connectInject.enabled=true --set global.acls.manageSystemACLs=true hashicorp/consul (Please see these docs on how to add the hashicorp helm repo. Note you need consul Helm chart version 0.37.0 or lower (i.e. a version that doesn't have the workaround I mentioned above).
  3. Deploy a test service with, e.g static-server service from these docs
  4. Look at the logs of the init container:
$ kubectl get logs -f deploy/static-server -c consul-connect-inject-init
2021-12-01T19:29:00.464Z [INFO]  Consul login complete
2021-12-01T19:29:00.467Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:01.469Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:02.471Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:03.472Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:04.473Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:05.474Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:06.476Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:07.478Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:08.479Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:09.480Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:10.481Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:11.526Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:12.528Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:13.530Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:14.532Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:15.534Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:16.536Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:17.537Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:18.538Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:19.629Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:20.630Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:21.631Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:22.634Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:23.635Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:24.637Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:25.639Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:26.640Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:27.641Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:28.644Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:29.646Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:30.651Z [INFO]  Registered service has been detected: service=static-server
2021-12-01T19:29:30.651Z [INFO]  Registered service has been detected: service=static-server-sidecar-proxy
2021-12-01T19:29:30.651Z [INFO]  Connect initialization completed

You can see in the logs above that the ACL not found error persists for ~30seconds.

You might need to do this last step a few times to see the errors as it won't happen every time (e.g. if the call to resolve token when getting services goes to a leader instead of a follower).

@Amier3 Amier3 added theme/acls ACL and token generation type/bug Feature does not function as expected labels Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/acls ACL and token generation type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

2 participants