-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ACL token rotation causes check status inconsistency #4372
Comments
Let me rephrase this: Restarting the consul client updated and syncronises health on cluster level to reflect the real values. |
Thanks for the report @danaps and @burdandrei. Are you positive this is a change in behaviour with 1.2.0? Do you have any way to run 1.0.7 on your staging cluster for a day or so to confirm the behaviour goes away? I ask not to dismiss it but to try to establish if it's a new issue - as far as I know nothing specifically changed in the code path related to agent's updating statuses with the servers. The one major change is the new UI became default and may present the data in a way that makes something that always happened more noticeable (wild speculation but an example of why we should clear up the link between version and bug). Can you also provide some additional information to help us debug:
Thanks! |
For reference for folks who aren't aware, the local agent state and global cluster state are kept in sync by a process we call "anti entropy" syncing. You can read more here: https://www.consul.io/docs/internals/anti-entropy.html#anti-entropy-1 Check status is in general only eventually consistent which is why I asked for details on whether it fixes itself although I assume not as it wasn't mentioned. The issue is likely to be related to the AE sync mechanism failing which might be a bug but also might be caused by network issues, ACL token issues and several other factors some noted in docs. How long it takes to sync can be impacted by cluster size, number of services and checks running on local agent, available network connectivity etc. |
Thanks for the answer @banks
Our production is really not happy now =( |
and why 1.0.7? we had 1.1.0 running for a pretty long time in prod. |
For logs, the most useful ones would be on the agent where the checks are failing - I suspect the issue is that it's not able to sync it's state with servers and it should log about why.
All checks when registered start in critical state until the succeed first time so it's possible the initial registration is working but then subsequent updates are not
Interesting, I ask that because anti-entropy requires that the token the application registers with remains valid otherwise it breaks anti-entropy. It might not be the cause but it would def show in logs if it is. We don't do a great job of documenting this or how to work with rotating tokens currently - it's on the near-term list to rework. If you can post client agent (with "failing" checks) logs for the period in which they appeared as failing that would be great. Also, do the checks ever show as passing in the Catalog? i.e. do they show as critical the entire life once registered or are they passing for some time and then flip to critical and never recover? |
I just ran over the cluster and updated all the cluster to 1.2.0 =) |
Just checked and I saw machine with agent/server discrepancy are fixed after a couple of minutes. |
Interesting. Keep us posted. Thanks for the compliment @burdandrei ❤️Always nice to get good feedback 😄. |
Hi again! =) After running some
Interesting moment: normally in our cluster nomad is the one who registers/deregisters the services in Consul WITHOUT token (Using anonymous token). These services and check where registered with resec - redis sidecar that runs HA redis based on Consul Session locks. Because it has to aquire lock and write to KV it runs with the following ACL:
The token that he receives is temporrary generated by Vault. So maybe there is a correlation to the thing you said about Anti Entropy. |
UPD: after updating ALL the instances in the cluster to 1.2.0 and then draining and terminating all the instances that had not synced services checks I can say that for the last 3 hours everything looks OK. |
Thanks @burdandrei. Sounds like my ACL hunch was right. This is a known issue (at least I hypothesised it existed several months ago, first time we've seen it bite in the wild) and it has been this way sicne forever - 1.2.0 upgrade must have been bad timing. The issue is that Vault backend expires tokens (i.e. deleted them from Consul) but the agent that you registered with doesn't know anything about that currently and so it just stops being able to perform antientroy. The only thing it can do here is log that error. The "fix" if you want to use Vault correctly (which is not documented anywhere and one of the reasons we are going to be doing a lot of work on ACLs and Vault integration soon) is that whenever Vault expires your ACL token, your app (or whatever registered your app in catalog) needs to re-register itself with the new token. @danaps I'm going to rename this issue to reflect the real problem @burdandrei has since there has now been a lot of context about his case. Can you confirm whether ACLs being deleted is also the cause of the issue for you? If not, do you mind opening a separate issue with the same info (including why you think it's not ACLs and the agent that is "failing" logs) so we can keep track of what you are seeing please? |
Sorry for necroposting @banks, but should this still be an issue in |
@burdandrei the underlying issue here is the same in 1.4. There's no real way to fix it at a fundamental level - if we do something magic and auto-rotate tokens or something so that it never fails then it effectively defies the security benefit you get from rotating since an attacker could use the same mechanism to prolong the life of their stolen token too. So we prefer to keep it explicit. One thing we've not done yet is fix the docs to make this more clear. There is work going on due to be released in the next month or two that has the potential to help with this issue, but ultimately it still needs the application (or some other tooling) to actively manage the rotation which means re-registering with the new token when it rotates. In some common cases we hope to eventually provide tooling that can be used to do that so you don't have to solve the issue by hand, but it will necessarily be one opinionated way to run things and won't fit everyone's workflow. Better docs on how to do this yourself will come as part of that though. The short answer is that 1.4.0 ACLs don't really change this: you still need to have something re-register the application instance with it's local agent with a new token before the old one is revoked. |
@banks Thanks for the answer. Is this something you and the Nomad team is working on improving as well? Based on what @burdandrei told me, Nomad do not handle this - either at all, or not nicely / seamless |
Thanks for answer @banks, we're just thinking of the right way to distribute consul tokens to nomad clients. |
Nomad currently only registers services at start of a task and removes
during termination. That means if the token rotates in the interim, Nomad
won't re-register for you currently leaving the Consul agent unable to make
further changes to the registration.
The good news is that we are working with the Nomad team to improve the
overall integration and the registration and token UX will be part of that.
…On Mon, Feb 18, 2019 at 11:28 AM Andrei Burd ***@***.***> wrote:
Thanks for answer @banks <https://github.com/banks>, we're just thinking
of the right way to distribute consul tokens to nomad clients.
Do I understand right that if we generate the token *per nomad client*
and it's renewed during all the instance lifecycle we're in a good place,
cause the token is scoped to the consul agent where it registers the
services?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4372 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHYU5CmJTm-LyTXWUq6BKzC4G1mu9TEks5vOo5dgaJpZM4VJcXy>
.
|
I'm facing the very same issue after I integrated Vault to generate on the fly Consul ACL tokens. In our case, we update the
Restarting the consul agent seems to solve the issue until our out-of-band process generate new Consul ACL tokens via Vault. Update: Restart the consul agent is not sufficient in this case... we have to restart the nomad client as well. |
I've done some tests to validate the Nomad behaviour once a new ACL token is injected to the Consul agent and the previous one is revoked and I noticed the following:
So, it seems that Nomad only tries to refresh its "ACL cache" after the first failure and is able to get things sorted after that. Does anyone else see the same behaviour in your environments? @banks would that be a possible nomad behaviour in this scenario? UPDATE: That seems to apply to each Nomad job / Service though. |
@banks <https://github.com/banks> would that be a possible nomad
behaviour in this scenario?
Did you mean a possible Consul behaviour? I'll assume so as I think what
you describe is accurate for Nomad to best of my knowledge.
It's more subtle than that in Consul sadly.
In Nomad case, Nomad itself is the "application" that is authenticated to
Vault and can fetch and renew ACL tokens for Consul actively or passively
on failures as you described.
But Consul can't just renew tokens magically with Vault since it is not
directly authenticate to vault - the application is and is pushing in the
token.
The only correct way to handle this currently is that the *application* (or
tooling around it like Nomad/another scheduler/helper daemon etc) needs to
actively manage the token in vault and renew it and then re-register with
Consul.
As I mentioned we have plans to expand the new Auth method system in 1.5.0
to provide a full alternative workflow that works different to side step
this problem and allows rotation etc to make this much easier over the next
few months but I don't think there are any quick fixes to the existing ACL
flow that will make this just work - it needs to be driven by the
application externally to Consul that is interacting with Vault in the
current model.
For example one *could* write some external tooling - perhaps even using
consul-template - that manages the token via vault actively on behalf of
your app and then pushes the updated registration with the new token on
each change. That could be deployed as a "side car" alongside app instances
and make this all work today. The eventual solution will probably look a
lot like this anyway (possibly built into consul binary though not separate
tooling) although will simultaneously solve the issue of managing tokens
and policies more simply even without Vault.
Does that help?
…On Mon, May 13, 2019 at 7:18 PM Daniel Santos ***@***.***> wrote:
I've done some tests to validate the Nomad behaviour once a new ACL token
is injected to the Consul agent and the previous one is revoked and I
noticed the following:
- After the ACL tokes are recycled, the first call for updates by
Nomad fail with the *ACL not found* errors.
- The only side-effects remaining after are errors for [ERR] consul:
"Catalog.Deregister" RPC failed to server 1.2.3.4:8300: rpc error
making call: rpc error making call: Unknown check
'82501dafc1680911b77851756cc2e6aab4fa084c'
- The *ACL not found* errors will only happen when vault recycle the
Consul ACL token again.
So, it seems that Nomad only tries to refresh its "ACL cache" after the
first failure and is able to get things sorted after that.
Does anyone else see the same behaviour in your environments? @banks
<https://github.com/banks> would that be a possible nomad behaviour in
this scenario?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4372?email_source=notifications&email_token=AAA5QU3ONDKMUZX4DCFKGGDPVGWFNA5CNFSM4FJFYXZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVJEKEA#issuecomment-491930896>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAA5QU3UNS5BTCPA72H6QILPVGWFNANCNFSM4FJFYXZA>
.
|
Sure does... looking forward for the enhancements to come. Right now we have a single systemd job to handle these issues in a scheduled way, so we should be covered for now. Thanks for the detailed feedback. |
@banks any news about the documentation of proper usage? We have several clusters, which are experiencing this issue occasionally. Our setup consists of the cluster with vault, consul, and nomad servers. Nomad client nodes are accompanied by vault agent and consul agent. Vault agent templates nomad and consul configuration with ACL tokens. Vault agent actively renews tokens, which seems to work OK. Occasionally it stops syncing service instances to the consul catalog. Nomad client restart seems to fix this, however, I would like to be able to prevent such an issue. EDIT: added component versions |
Hi @elcomtik, Nomad engineer here 👋 Just to clarify: are you saying that Vault agent re-templates Nomad's agent configuration file with a new Consul Token, SIGHUP's Nomad, and sometimes Nomad fails to use the new token when registering Consul services? If so this sounds like a bug with Nomad's SIGHUP handling. Please file an issue over at: https://github.com/hashicorp/nomad/issues/new?assignees=&labels=type%2Fbug&template=bug_report.md |
@elcomtik correct me if I'm wrong. |
Yes, the vault agent is providing an ACL token for the nomad agent. I suppose that it goes wrong at max TTL or failed renewal. Then it needs to be rotated by a vault agent and reloaded by a nomad agent. Vault agent templates it and does restart by
Systemd service is inspired by https://learn.hashicorp.com/tutorials/nomad/production-deployment-guide-vm-with-consul with little customization. So currently I'm sending SIGINT and restarting, @schmichael mentioned it should be SIGHUPed, you @burdandrei are recommending USR1 or USR2 signals. Are they equivalent are some of them is better? |
@elcomtik USR1 or USR2 are not implemented in nomad right now. Here's how I see the flow for this:
|
OK, should I try to change systemctl restart for systemctl reload? I might to test it and give you feedback when it's done. |
@burdandrei few questions emerged:
|
@burdandrei I'm just testing this approach with SIGHUP signal. Unfortunately, I came into this bug hashicorp/nomad#3885. I'm reloading it by vault-agent service, soI wanted to patch this by defining systemd service order. However, this is useless because systemd unit for nomad is using Type simple, which doesn't wait for service startup. I tried to use types forking and notify, however, nomad doesn't support them also. There is an issue that is aiming for a similar feature hashicorp/nomad#4466, but it seems to be stale for a long time. I will patch this with some sleep before starting vault-agent, however, I will raise another issue in nomad repo to address this. |
For anyone interested, I resolved this issue with Nomad not supporting systemd service types forking or notify by using configuration Restart=always instead of Restart=on-failure. Not clean, but it works for me. |
hi, any updates ? |
#16097 was included in Consul 1.15, which is released. That change helps with failed deregistrations of checks and services, which now use the However, that change doesn't help with token expirations causing failed registrations and updates of checks and services. The agent remembers the token used to register a check (or service). After that token expires, subsequent anti-entropy updates to the Consul servers for that check (or service) will fail due the expired token. (The local Consul agent will know the state of the check, but that won't be reflected in the Consul catalog.) |
Edit(banks)
Note that the thread concludes a root cause that may not apply to this original issue related to ACLs. See #4372 (comment) for a summary of the issue this represents now
Hi folks,
we run Consul 1.2.0 and encountered the following issue:
in the consul UI, when looking at some node Health check we see it is failing.
`curl http://localhost:8500/v1/health/node/ip-10-1-7-170
{
"Node": "ip-10-1-7-170",
"CheckID": "c49af2eedf67fdbf7135045051aad72c5f1d8a4c",
"Name": "Nomad Client HTTP Check",
"Status": "critical",
"Notes": "",
"Output": "",
"ServiceID": "_nomad-client-vsvkabaz4izafp6d5ejwyhrurtl2pdeq",
"ServiceName": "nomad-client",
"ServiceTags": [
"http"
],
"Definition": {},
"CreateIndex": 170787117,
"ModifyIndex": 170787117
},
`
after checking underlying API call /v1/agent/checks on the node the check is actually passing.
` curl http://localhost:8500/v1/agent/checks
"c49af2eedf67fdbf7135045051aad72c5f1d8a4c": {
"Node": "ip-10-1-7-170",
"CheckID": "c49af2eedf67fdbf7135045051aad72c5f1d8a4c",
"Name": "Nomad Client HTTP Check",
"Status": "passing",
"Notes": "",
"Output": "HTTP GET http://0.0.0.0:4646/v1/agent/health?type=client: 200 OK Output: {"client":{"message":"ok","ok":true}}",
"ServiceID": "_nomad-client-vsvkabaz4izafp6d5ejwyhrurtl2pdeq",
"ServiceName": "nomad-client",
"ServiceTags": [
"http"
],
"Definition": {},
"CreateIndex": 0,
"ModifyIndex": 0
},
`
The text was updated successfully, but these errors were encountered: