-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connect: leaf cert rotation is not reflected on non blocking api queries #10871
Comments
I'm also seeing this behaviour in the http api on consul 1.9.8 (no envoy, just calling the endpoints) - it seems surprisingly easy to reproduce. What I mean by that is, I'm surprised it's not causing more problems for people. |
I've been looking at this in the background. While I am still new to the code base, I think the issue is mostly isolated to the background caching in the API endpoint. Still investigating how we can make this right. Today, once a leaf cert is loaded up into the cache, IIUC, no non-blocking query can get the callee a:
Blocking queries will give back the expected result:
|
I think this issue and #9862 are the same. |
Thanks for looking at this. I updated the cluster to 1.9.8 (following one of the linked issues that said some changes in that area might have sorted it in a just-released version) back in the summer and the issue may be solved. Since then we've had one case of this error, and there was enough other activity going on for it not to be very obvious what the cause could have been. We are increasing usage of it gradually while we build up confidence that it's definitely gone. |
As expected, making a comment like that was enough to trigger a new occurrence in our uat environment. It's hard to say anything definitive, but looking at the logs it appears that the agent in question had trouble connecting to the cluster. Either it wasn't able to renew the certificate before it expired and the certificate was then stuck in the expired state, or something about failing to connect caused it to stop trying to refresh it. I can see log entries like: 2021-11-18T00:13:53.972Z [WARN] agent.cache: handling error in Cache.Notify: cache-type=connect-ca-root error="rpc error making call: EOF" index=57794105 2021-11-20T00:45:20.713Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=xxx:8300 error="rpc error making call: rpc error getting client: failed to get conn: dial tcp xxx:0->yyy:8300: connect: connection refused" Making the request with the suffix ?index= forced the agent to get a new certificate and it was then fine. |
Thank you for the details @mr-miles! We believe we have confirmed that #9862 (making a request with index=0) can cause this bug because the cache-entry is never updated. How are you using this endpoint? Are you not using Envoy at all? How frequently would your applications make requests to this endpoint? I believe as long as there is at least one process making requests using blocking queries ( After 72 hours of no requests the cache entry should expire, and the next request would properly generate a leaf cert again (if a new one is required). So maybe during normal operation requests are made less frequently than once every 72 hours, but when you went to confirm the issue again you made some extra requests which kept the cache-entry around? |
Corrsct - we aren't using envoy at all. We are using the leaf certificate endpoints directly: services request their leaf certificate and can then use that to log in to vault and get cloud or db credentials. Works well for our set up and gives us minimal credentials to manage: the environment is some way off evolving into a full service mesh hence no envoy. I think what is happening is that we only start requesting a new certificate 6 or 12 hours before, so I guess we're keeping the cache entry alive with that request. My question is - if I always make the request with a constant hard-coded index, e.g. index=3, when the index is known to be way higher than that, does that fix the problem everywhere? Or do I need to track and use the right modifyIndex? |
hey @danielehc thanks for digging into this and for providing detailed info. We had a chance to fix this with #11693. The fix will be in 1.11 and backported to latest 1.9 and 1.10. @mr-miles -- thanks for your engagement here. Let me try to answer some of your questions:
No, in general index ~ min query index. So if the current index is 54, queries with index=3 will return the leaf cert w a modify index higher than 3. Let's say the one with 54 in this case. Without upgrading to any of the new releases above, you can still issue blocking queries to make sure you get an updated leaf cert. But
yes, you will need to keep track of that. With the fix in, non blocking queries will always revalidate the leaf cert and if it's not good (expired or ca has been rotated), a new one will be generated and returned. I'll leave this issue open for a while to offer folks a chance to follow up. |
Closing for now. Feel free to open as needed! |
Overview of the Issue
Consul datacenter with Connect enabled and Vault used ad Connect CA.
The /agent/connect/ca/leaf/:service endpoint cache never get invalidated and shows old certificates.
Outside from this behavior the Consul datacenter seems to behave properly and all other functionalities work as expected.
Reproduction Steps
Create a cluster with 2 client nodes and 3 server nodes with Connect enabled
Check service leaf certificate:
Configure Vault as Connect CA
Configuration:
Command:
Logs:
Server:
Client:
Check for certificate (after waiting for rotation period or any amount o time)
In the example the cache shows an age of
77527
seconds that is ~21.5
hEnvoy dashboard shows the new certificates properly
The scenario used for the test can be reproduced using the script at https://github.com/danielehc/consul-docker
Operating system and Environment details
1.10.1
1.18.3
Expected results
The API endpoint should show the new certificates when they are present.
Workaround
Use Envoy admin UI to check the certificates
Using a different Consul node for the API request the cache does not get used so the new certificate is properly shown
The text was updated successfully, but these errors were encountered: