-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watches being triggered during central config entries changes #16663
Comments
@radykal-com , thanks for reporting this. To help us reproduce this issue, could you please share the checks watchers for your server and client agents? |
Sure @huikang server.json
/server-watches.py
client.json
/client-watches.py
|
Hi again, @huikang I've doing some more researching of this issue. The new setup of our watches is as follow: Watches scripts (python) configured ONLY in consul servers (3 nodes). The code of those scripts:
Every time we apply any consul central config the sidecar proxies (envoy) checks pass to critical in all of the services, not sure for how much time, if its just a moment, or some seconds, but it happens to all services after the config change. This graph shows the critical status triggering for 2 different consul central config changes (8:12 and 8:29). Each color is one of the services |
Ah, it looks like you run into the same issue that we discovered just last week. There seems to be a regression introduced by f34703fc63c (added to v1.10), which can cause the sidecar proxy health check to flap, and subsequently cause an increase of We are rolling out this fix at the moment: diff --git a/agent/agent.go b/agent/agent.go
index c3b8570c13..4a666e8d13 100644
--- a/agent/agent.go
+++ b/agent/agent.go
@@ -530,8 +530,7 @@ func (a *Agent) Start(ctx context.Context) error {
}
// Load checks/services/metadata.
- emptyCheckSnapshot := map[structs.CheckID]*structs.HealthCheck{}
- if err := a.loadServices(c, emptyCheckSnapshot); err != nil {
+ if err := a.loadServices(c, nil); err != nil {
return err
}
if err := a.loadChecks(c, nil); err != nil {
@@ -3648,7 +3647,7 @@ func (a *Agent) reloadConfigInternal(newCfg *config.RuntimeConfig) error {
}
// Reload service/check definitions and metadata.
- if err := a.loadServices(newCfg, snap); err != nil {
+ if err := a.loadServices(newCfg, nil); err != nil {
return fmt.Errorf("Failed reloading services: %s", err)
}
if err := a.loadChecks(newCfg, snap); err != nil { We plan to send a pull request next week. |
I'll try that patch and come back with news. Thank you @sayap |
Unfortunately the issue isn't fixed with the patch. |
I see. We are still on 1.10.12 and the patch seems to work fine. We haven't tried it on 1.14 / 1.15 yet, will take a look.. |
Just verified that the patch works fine for me on 1.15.2 as well. Let's elaborate a bit what the patch does, based on my incomplete understanding... In agent/service_manager.go, agent will make a However, if there has been any config change in the 10 minutes (even if the changes are totally irrelevant to the sidecar proxy), the function Due to the regression introduced by commit f34703f, this call is not no-op anymore, but will instead reset the sidecar proxy's health checks to either:
This health check flapping causes INFO log lines that look like these:
The checks run every 10 seconds, so they will revert to their original passing status within 10 seconds, which then causes more INFO log lines:
"sidecar-proxy:1" is a TCP check on the Envoy's public listener port (i.e. Connect Sidecar Listening), and a TCP check has an initial pause time. Thus, it has a first "Synced check" that sends the critical status to the Consul servers, followed by a second "Synced check" a few seconds later that sends the passing status to the Consul servers. This creates unnecessary network traffic and also unnecessary disk writes on the Consul servers. In your case, it also impacts the watcher script. "sidecar-proxy:2" is an Alias check for the actual service, and an Alias check has no initial pause time. So, while the regression causes the check to be in the critical status, the check will run immediately and go back to the passing status. Thus, it only has one "Synced check" log line that sends the passing status to the Consul servers. This create unnecessary network traffic, but shouldn't create any disk write on the Consul servers, since the status remains the same from the perspective of the Consul servers. With the patch, these "Synced check" log lines shouldn't happen anymore after some irrelevant config change. I am quite curious why the issue still persist after applying the patch, maybe you can share the log from one of the patched agent? (the patch is only needed for Consul agents, not the Consul servers) |
Hey @sayap you are right. Yesterday I applied the patch to just a subset of the platform and I couldn't see a real impact on the number of checks triggered during the changes. Today I've applied it to all of the platform and effectively, there are no checks triggering on config changes. I'm a bit worried if there would be side or undesired effects with this change that may lead to new issues or problems. anyway, thank you very much as I started to think the problem was not in consul but in our deployment. |
That's great 😁 A side effect is that the agents will do a tiny bit more work, as now it has to do |
Commit f34703f introduced a regression, such that it passes an empty map as the check snapshot on startup: emptyCheckSnapshot := map[structs.CheckID]*structs.HealthCheck{} if err := a.loadServices(c, emptyCheckSnapshot); err != nil { If there is no `consul reload`, this empty map will be used for comparison indefinitely, such that the following if-block will never be executed: // Restore the fields from the snapshot. prev, ok := req.checkStateSnapshot[cid] if ok { check.Output = prev.Output check.Status = prev.Status } Subsequently, every time `handleUpdate` in agent/service_manager.go is triggered, the health checks for sidecar services will flap to critical for several seconds. If there has been a `consul reload`, and it happened while the health checks were still critical (e.g. shortly after agent startup), then the flapping will also happen, as the check snapshot with critical status will be used for comparison indefinitely. This commit fixes the issue by always passing nil as the check snapshot. Note that `handleUpdate` can happen once every 10 minutes even though there is no config change for a specific service, as long as there is any config entries modification within the 10 minutes. As such, on systems with Consul agents >= 1.10, after applying this fix, there should be a significant decrease in `Synced check` log lines, `Catalog.Register` RPC calls, and also disk writes on Consul servers. Fixes hashicorp#16663
We also just ran into this after performing a broad ![]() This is the rate of |
A fix would be nice to avoid having to be maintaining a patched version |
Overview of the Issue
Similar to #7446 but it's happening when applying any change to central config entries using the consul cli (config write, config delete)
We have checks watchers configured in consul servers and clients:
/v1/health/state/critical
) and if they find any they will mark the node as unhealthy so it can be replaced by the autoscaling group/v1/agent/checks?filter=Status!=passing
) and if any is in non passing status they mark themselves as an unhealthy node so it can be replaced by the autoscaling groupEvery time we apply a new config entry with the consul cli (consul config write or consul config delete) the watchers are triggered in servers and clients, and the watchers scripts find checks in critical status so leading to the replacement of all of our nodes.
If we don't make any consul config change, it works fine, the unhealthy nodes are replaced when they need to, new nodes join without problems, etc.
Reproduction Steps
Consul info for both Client and Server
Client info
Server info
Operating system and Environment details
3 consul servers and ~80 consul clients
OS: Ubuntu 20.04
Arch: x86_64
Reproduced with consul versions:
Central config entries:
envoy_dogstatsd_url
configLog Fragments
The text was updated successfully, but these errors were encountered: