-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consul connect is extremely unreliable at startup #9307
Comments
Hi @kneufeld sorry you're having trouble with this. Seeing the |
This is the only thing I can find that looks interesting.
I'd say there is a race condition between groups (I'm actually pretty positive there is) but this error happens with my redis task which is at the "bottom" on which everything else has a proxy connect to. ie: redis is the "server" proxy and does not have any "client" proxy |
Can you take a look at the Nomad agent logs when this happens? There should be a log line with the
|
Found these lines
|
Hi @kneufeld , the |
that's Is there a way to restart nomad on a client without restarting the docker containers? I don't want to restart nomad if that will cause the containers to migrate because that will cause an outage on any machine not running v1. And that's assuming the problem is fixed. Thanks for the quick turn around time and follow up. |
I migrated my clients off of nomad and into their own vms and started with docker-compose. This allowed me to upgrade my west datacenter to to v1.0-beta3. My east dc still has clients on 0.12 but the servers over there have been upgraded. First tests looked good but then I started getting the same bootstrap errors. I may have found an other piece of the puzzle Evaluations area insta failing.
Then in
followed by a never ending stream of:
and
but I kinda call BS on Is it possible that my cluster is just corrupt somehow? If so then how can nomad be trusted? My final hope is to migrate the rest of my clients off nomad in the east dc and re-bootstrap the entire cluster. I have zero doubts that will work at first but long term? ps. I've |
@shoenig I got the rest of my jobs off the cluster and rebuilt it. My quick test works, tasks scale etc. I really don't know what to think... |
Hi @kneufeld sorry this has caused so much trouble. The Nomad v0.12 release was centered around major underlying network architecture changes, and we've been squashing followup bugs in the point releases since. Though we hope for a smooth transition, the changes were made before Nomad v1.0 specifically so we could provide the stability guarantees of a 1.0 release going forward. It sounds like the rebuilt cluster is operating okay? I suspect the evaluation failures may have been caused by the problem described in #9356. The fix for that isn't present in the |
@shoenig I'm only running a single job but I never saw any issues until I tried to run many jobs. I do think the cluster was somehow corrupted which is it's own very troubling thought. I'm opening another bug that I saw pre-rebuild and post-rebuild with only a single job running. |
@shoenig I just saw the same insta-blocked deployments/evaulations with only a single job in the cluster. I shutdown the leader to force an election and once a new leader was chosen I could successfully deploy a job. This was after stopping all servers and clients, deleting |
I'm getting the above behavior when attempting to use Connect on an existing cluster (Consul 1.9.1, Nomad 1.0.2, ACLs and TLS enabled). No errors in log about bootstrap error for Connect though, so maybe it's a similar manifestation of something else. FWIW I can't get Connenct to work at all (using the Counter Dashboard example) even when making sure node is warm before submitting a single job. Nothing stands out in logs for consul or nomad, or the envoy containers, apart from |
I've seen this error in particular alongside the generated Nomad ACL tokens. Namely, what seems to happen is that as part of creating a new allocation, Nomad spins-up a service Consul token for the sidecar, and then immediately tries to use it. This token hasn't propagated quite yet into Consul, so I see in the Nomad logs some 403 ACL token not found errors, then eventually when the alloc keeps retrying, as it keeps using the same service token, it eventually becomes valid and the allocation succeeds. So, it seems like the sidecar needs to wait a little bit more or have some retries built-in at an earlier phase rather than failing the whole allocation immediately and relying on the general allocation restart behaviour to sort it out. |
Sounds different from what I am experiencing here (though I think I've absolutely seen what you mention as well before). All checks passing, services considered healthy, no errors in either consul or nomad |
We're getting the same error, without using ACL tokens, connect fails sometimes causing the tasks to fail.
|
Hello, small follow-up: In our case, reason was in service name, name contains UPPERCASE characters. After transform em to lowercase, problem was solved. |
Got hit by this again now when updating an existing job (the only thing changed: no-op change inside a generated template) Stopping the job and starting it again:
Tried the following, does not resolve it:
Same sidecar proxy (there are 4 in total on the group) keeps failing with the same errors. Consul logs around the same time:
|
Stopped all other jobs related to the failing sidecar container and performed a rolling restart of all consul servers and nomad servers and rescheduled - still same. Observation: When draining the node, the 3 other sidecar tasks for the group with the failing one are stuck in running despite the main tasks being successfully killed. First time I see a drain taking more than a few seconds. Only doing a manual stop successfully stops the job fully. In the "zombie" sidecar tasks I see in stderr after the drain (apart from all the deprecation warnings):
|
The error seems to trigger here: https://github.com/hashicorp/consul/blob/8f3223a98456d3fb816897bdf5eb4fbf1c9f4109/command/connect/envoy/envoy.go#L289 |
Arbitrarily changing the service name (amending intentions accordingly) made it come back up again. So seems to be the exact same thing @flytocolors was having (the casing for them is most likely coincidental though as that was not a factor here) |
@Legogris are you using Ingress Gateways, and on which version of Nomad? We fixed a bug that would trigger that error in Nomad v1.0.4. If we're still running into that error with just normal sidecars we can investigate further. In particular this log line (with DEBUG enabled) should shed some light on what's happening. |
Not to "me too" an issue, but im getting this sporadically as well. We're on consul 1.11.2 and nomad 1.2.4, using ACLs for both. Our cluster is running just fine and i can even stop/start/restart existing jobs, but I have a new job that is refusing to start right now. This job has been running off and on for the past week as i tweak things for our environment, but as of this morning it is consistently failing with this exact scenario. We're testing a 2 node keycloak cluster in Nomad/Consul Connect with a Treafik frontend. The goal is to have 3 ports - service, health check, and jgroups (for the JBoss clustering). The jgroups port is outside connect, and the health check is a manual 'expose' because it runs on a different port than the service. This has worked fine for going on 2 weeks, but this morning started in with the same issue described here. I don't know if some facet of what we're doing adds any new information, but i figured it couldn't hurt. I've attached our job file along with the excerpts of the consul and nomad logs as well if it can help. |
As an update to my addition, i was able to get past the issue by moving my non-connect'ed service (for jgroups) from the group level into the task level. |
Well, I will intentionally leave a "me too" here - Note the following error message: |
Leaving a note here based on a conversation I had with @schmichael and some folks from Consul recently. We may be running into the issue fixed for the Consul K8s plugin here: hashicorp/consul-k8s#887 Effectively we're creating the token and then the agent may have a stale view of the world when we use that token. Unfortunately this PR isn't something we can just drop into the Nomad code base because the architecture is a good bit different, but that's a promising lead for when this issue gets revisited next. |
I've got an early prototype branch Internal ref: https://hashicorp.atlassian.net/browse/NET-10051 |
Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051
Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051
Nomad creates a Consul ACL token for each service for registering it in Consul or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always talks to the local Consul agent and never directly to the Consul servers. But the local Consul agent talks to the Consul servers in stale consistency mode to reduce load on the servers. This can result in the Nomad client making the Envoy bootstrap request with a token that has not yet replicated to the follower that the local client is connected to. This request gets a 404 on the ACL token and that negative entry gets cached, preventing any retries from succeeding. To workaround this, we'll use a method described by our friends over on `consul-k8s` where after creating the service token we try to read the token from the local agent in stale consistency mode (which prevents a failed read from being cached). This cannot completely eliminate this source of error because it's possible that Consul cluster replication is unhealthy at the time we need it, but this should make Envoy bootstrap significantly more robust. In this changeset, we add the preflight check after we login via Workload Identity and in the function we use to derive tokens in the legacy workflow. We've added the timeouts to be configurable via node metadata rather than the usual static configuration because for most cases, users should not need to touch or even know these values are configurable; the configuration is mostly available for testing. Fixes: #9307 Fixes: #20516 Fixes: #10451 Ref: hashicorp/consul-k8s#887 Ref: https://hashicorp.atlassian.net/browse/NET-10051
I've just merged #23381 which hopefully should close out this issue. That's planned for release in Nomad 1.8.2 (with backports to Nomad Enterprise 1.7.x and 1.6.x). Once you've deployed that, if you're still running into issues with Envoy bootstrap, please report a new issue after having gone through the troubleshooting guides in Nomad's Service Mesh troubleshooting and Resolving Common Errors in Envoy Proxy. That'll help us separate out additional problems above-and-beyond the ones we've identified here. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I've just had multiple multi-hour outages in the past two days because consul connect proxies would not start.
That's it, no logs because it never actually started. Granted this is a complicated job with 5 groups and proxies between them but it used to work. It seems as soon as I had multiple jobs (built from a template) the proxies started to act up. Proxy names are all unique based on job name and group.
All I can do it just repeatedly hit restart until the gods smile on me and the proxy finally starts.
Nomad version
Started at 0.11.6 and desperately tried to upgrade to 0.12.7 to no avail.
Operating system and Environment details
Ubuntu 18.04 and Ubuntu 20.04
Reproduction steps
Make a job with multiple groups and multiple connect sidecar between them (but not circular). The first instance of that job will probably work. Make 10 more.
After investing a lot of time and energy in nomad I'm afraid I'll have drop it for I have no idea what I'm going to do...
The text was updated successfully, but these errors were encountered: