You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have recently come across an issue with Consul on Kubernetes where Consul caches negative responses (e.g. ACL not found) and that causes apps on Kubernetes take too long to start because it needs to wait for the negative cache entry to expire.
This happens because on k8s we do the following before application starts:
Call consul login with the Kubernetes auth method to get a token
Call v1/agent/services?filter=... to list services for this pod
Because call 1 and 2 happen immediately after one another, it's possible that a call to get services hits a follower that doesn't have the token we've created during call 1 replicated yet. As a result call 2 results in an "ACL not found" error which the client then caches for the acl_token_ttl duration which defaults to 30seconds. Please also see hashicorp/consul-k8s#862 for the description of the original issue.
We have now found a workaround where we do another call between calls 1 and 2 to stale read the token (hashicorp/consul-k8s#887) and this workaround covers the majority of cases, however the question still remains what would be the ideal solution in Consul.
Reproduction Steps
This could probably be reproduced outside of k8s, but the most obvious way to reproduce is using a Kubernetes cluster like GKE (kind will not work out-of-the box for this since we need to run a cluster with more than 1 server)
Create a Kubernetes cluster (e.g. on google cloud) with 3 nodes
Deploy consul helm chart with connect-inject and ACLs enabled helm install test --set connectInject.enabled=true --set global.acls.manageSystemACLs=true hashicorp/consul (Please see these docs on how to add the hashicorp helm repo. Note you need consul Helm chart version 0.37.0 or lower (i.e. a version that doesn't have the workaround I mentioned above).
Deploy a test service with, e.g static-server service from these docs
Look at the logs of the init container:
$ kubectl get logs -f deploy/static-server -c consul-connect-inject-init
2021-12-01T19:29:00.464Z [INFO] Consul login complete
2021-12-01T19:29:00.467Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:01.469Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:02.471Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:03.472Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:04.473Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:05.474Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:06.476Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:07.478Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:08.479Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:09.480Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:10.481Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:11.526Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:12.528Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:13.530Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:14.532Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:15.534Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:16.536Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:17.537Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:18.538Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:19.629Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:20.630Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:21.631Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:22.634Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:23.635Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:24.637Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:25.639Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:26.640Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:27.641Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:28.644Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:29.646Z [ERROR] Unable to get Agent services: error="Unexpected response code: 403 (ACL not found)"
2021-12-01T19:29:30.651Z [INFO] Registered service has been detected: service=static-server
2021-12-01T19:29:30.651Z [INFO] Registered service has been detected: service=static-server-sidecar-proxy
2021-12-01T19:29:30.651Z [INFO] Connect initialization completed
You can see in the logs above that the ACL not found error persists for ~30seconds.
You might need to do this last step a few times to see the errors as it won't happen every time (e.g. if the call to resolve token when getting services goes to a leader instead of a follower).
The text was updated successfully, but these errors were encountered:
Overview of the Issue
We have recently come across an issue with Consul on Kubernetes where Consul caches negative responses (e.g. ACL not found) and that causes apps on Kubernetes take too long to start because it needs to wait for the negative cache entry to expire.
This happens because on k8s we do the following before application starts:
consul login
with the Kubernetes auth method to get a tokenv1/agent/services?filter=...
to list services for this podBecause call 1 and 2 happen immediately after one another, it's possible that a call to get services hits a follower that doesn't have the token we've created during call 1 replicated yet. As a result call 2 results in an "ACL not found" error which the client then caches for the
acl_token_ttl
duration which defaults to 30seconds. Please also see hashicorp/consul-k8s#862 for the description of the original issue.We have now found a workaround where we do another call between calls 1 and 2 to stale read the token (hashicorp/consul-k8s#887) and this workaround covers the majority of cases, however the question still remains what would be the ideal solution in Consul.
Reproduction Steps
This could probably be reproduced outside of k8s, but the most obvious way to reproduce is using a Kubernetes cluster like GKE (kind will not work out-of-the box for this since we need to run a cluster with more than 1 server)
helm install test --set connectInject.enabled=true --set global.acls.manageSystemACLs=true hashicorp/consul
(Please see these docs on how to add the hashicorp helm repo. Note you need consul Helm chart version 0.37.0 or lower (i.e. a version that doesn't have the workaround I mentioned above).You can see in the logs above that the ACL not found error persists for ~30seconds.
You might need to do this last step a few times to see the errors as it won't happen every time (e.g. if the call to resolve token when getting services goes to a leader instead of a follower).
The text was updated successfully, but these errors were encountered: