-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Healthcheck Output field delay #9113
Comments
Hi @ndhanushkodi 👋 The UI uses different catalog endpoints to get healthcheck updates, not the agent endpoint:
I ran through your walkthrough and there is indeed a difference between the agent endpoint and the health endpoint (which is health according to the catalog), but this is a difference in the API, so I'm guessing its an issue with Consul itself not necessarily the UI, ~7mins does seem like an awfully long time! Maybe @freddygv or @rboyer can help with some info here? But for now I'll remove the UI label. Thanks for the issue report! |
Hi @ndhanushkodi , This is due to an optimization around syncing check updates to servers where only the output has changed. There have been issues in the past where a check's output would churn while the check status remained the same. This would trigger frequent syncs to the servers, and as many writes to RAFT only because of a changing output. The way we get around this is that whenever there's an output change but no status change, we set a 5 minute timer with some jitter to defer syncing the latest check output. The reason why you're running into this is that checks are initially registered as I think to make that update faster we would need to somehow distinguish the first update that comes from actually running the check. I will leave that decision up to @hashicorp/consul-foundations. |
@johncowen, Thank you for taking a look! @freddygv Ahh that makes sense thanks for the info on how it works! Yeah it's not super high priority just something me and a few others on the team noticed when running some manual tests for healthchecks on consul-k8s. |
Overview of the Issue
There's a ~7 min delay in showing the service health check output in the UI. When registering a service with a failing healthcheck, when I look at the Consul UI, I see the Output field as blank for 7 min before it gets populated with the failure. When I
curl localhost:8500/v1/agent/checks
, I see the Output error message without delay, so I'd expect to see it sooner in the UI.May be related issue: #8225
Reproduction Steps
Steps to reproduce this issue, eg:
make dev
to put binary in GOPATHconsul agent -dev
http://localhost:8500/ui/dc1/services
payload.json
curl --request PUT --data @payload.json http://localhost:8500/v1/agent/service/register
dial tcp [::1]:8888: connect: connection refused
curl localhost:8500/v1/agent/checks
you will see the healthcheck Output field has been set, but it is not showing up in the UI until 7 min later.curl --request PUT --data @payload.json http://localhost:8500/v1/agent/service/register
within the 7 min, the Output field gets populatedConsul info for both Client and Server
Only ran one consul agent, using
consul agent -dev
locally.Starting Consul agent
``` Version: '1.9.0-dev' Node ID: 'a14f00c2-c789-a951-31f7-bf9c3003940f' Node name: 'dhanushkodi.local' Datacenter: 'dc1' (Segment: '') Server: true (Bootstrap: false) Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600) Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302) Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false ```Server info (`consul info`)
Operating system and Environment details
OS, Architecture, and any other information you can provide about the environment.
Running locally on a macbook pro.
Log Fragments
Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use
-log-level=TRACE
on the client and server to capture the maximum log detail.The agent logs show:
2020-11-05T12:16:42.412-0800 [WARN] agent: Check socket connection failed: check=service:redis1 error="dial tcp [::1]:8888: connect: connection refused"
far before the UI shows the Output error message.
The text was updated successfully, but these errors were encountered: