-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Heartbeat] Log error on dupe monitor ID instead of strict req #29041
Conversation
Currently Heartbeat will not start > 1 monitor with the same ID. This can create unintuitive scenarios in situations where a user changes a monitor by adding a new entry via some reload mechanism (say k8s autodiscover) before the old one is deleted. This PR changes the behavior to only log errors in this situation. Only one monitor will be active, but it will be the newest monitor.
Pinging @elastic/uptime (Team:Uptime) |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. 🤖 GitHub commentsTo re-run your PR in the CI, just comment with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'm noticing in my smoke testing of this patch is that, while it does let me run a config with multiple monitors sharing an ID, it doesn't seem to respect the usual CTRL+C
I'd
use to kill the process.
When attempting to run the same multi-shared-ID config on master
I got the error message/no-op that is typical of HB up until now.
In the case of different monitors with the same ID it did seem to log them, per the included screenshot:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes LGTM. Added small comments.
heartbeat/monitors/dedup.go
Outdated
|
||
closed := um.stopUnsafe(m) | ||
if closed { | ||
logp.Warn("monitor ID %s is configured for multiple monitors! IDs should be unique values, last seen config will win", m.stdFields.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we log like stopping previous monitor with same monitor id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if that adds anything, because it's hard to tell them apart. Should I change the log message maybe?
heartbeat/monitors/monitor_test.go
Outdated
m2.stdFields.Name = "MON2!!!" | ||
// This used to trigger an error, but shouldn't any longer, we just log | ||
// the error, and ensure the last monitor wins | ||
require.NoError(t, m2Err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also check if previous monitor is stopped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do check for this at the end of the test by checking the number of stops invoked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, mainly about avoiding the global objects if possible.
heartbeat/monitors/monitor.go
Outdated
func (e ErrDuplicateMonitorID) Error() string { | ||
return fmt.Sprintf("monitor ID %s is configured for multiple monitors! IDs must be unique values.", e.ID) | ||
} | ||
var globalDedup = newDedup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be part of the RunnerFactory
to avoid needing a global object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, done!
heartbeat/monitors/monitor.go
Outdated
@@ -151,6 +139,10 @@ func newMonitorUnsafe( | |||
m.stdFields.ID = fmt.Sprintf("auto-%s-%#X", m.stdFields.Type, hash) | |||
} | |||
|
|||
// De-duplicate monitors with identical IDs | |||
// last write wins | |||
globalDedup.register(m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newMonitor
is used by CheckConfig
. Checking configs shouldn't stop monitors existing monitors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this, it is now the case that newMonitor
is side effect free
@justinkambic there was a deadlock issue that I've now fixed, that was blocking clean shutdowns, ctrl-c should now work, that's important! |
var runners []cfgfile.Runner | ||
for _, cfg := range bt.config.Monitors { | ||
created, err := factory.Create(b.Publisher, cfg) | ||
created, err := bt.dynamicFactory.Create(b.Publisher, cfg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleans everything up so that we only use a single factory instance in heartbeat
@@ -135,14 +131,10 @@ func newMonitorUnsafe( | |||
internalsMtx: sync.Mutex{}, | |||
config: config, | |||
stats: pluginFactory.Stats, | |||
state: MON_INIT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prevents weird situations with dual stops, makes stop()
idempotent.
OK, this should be ready for review again, it's largely rewritten, now using the factory for all state, and generally cleaning up how all the lifecycle stuff is handled. Thanks for the great input @jsoriano @justinkambic @vigneshshanmugam Races and deadlocks should all be gone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if it looks good to uptime 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
} | ||
|
||
// Stop stops the Monitor's execution in its configured scheduler. | ||
// This is safe to call even if the Monitor was never started. | ||
// Stop stops the monitor without freeing it in global dedup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Stop stops the monitor without freeing it in global dedup | |
// Stop the monitor without freeing it in global dedup |
I know the function name is a proper noun here but it feels kinda redundant.
(cherry picked from commit dbca099)
… (#29083) (cherry picked from commit dbca099) Co-authored-by: Andrew Cholakian <[email protected]>
…ead of strict req (#29082) * Remove watch poll feature (#27166) * Remove watch poll feature * Changelog * Update YML * [Heartbeat] Log error on dupe monitor ID instead of strict req (#29041) Co-authored-by: Andrew Cholakian <[email protected]>
…ws-on-file-changes * upstream/master: override host on statsd metricset (elastic#29103) Skip config check in autodiscover for duplicated configurations (elastic#29048) Change "filebeat.config.modules.enabled" to "true" (elastic#28769) Remove deprecated spool queue from Beats (elastic#28869) Add `beat` field back to beat.stats (elastic#29094) Revert "Move labels and annotations under kubernetes.namespace. (elastic#27917)" (elastic#29069) heartbeat: remove w2008 in the CI (elastic#29093) Remove deprecated `--template` and `--index-policy` flags (elastic#28870) Fix parsing of apache trace log levels (elastic#28717) [Elastic-Agent] IUse itnernal port for local fleet server (elastic#28993) [Heartbeat] Log error on dupe monitor ID instead of strict req (elastic#29041) Enable pprof for elastic-agent and beats (elastic#28983)
Currently Heartbeat will not start > 1 monitor with the same ID. This can create unintuitive scenarios in situations where a user
changes a monitor by adding a new entry via some reload mechanism (say k8s autodiscover) before the old one is deleted.
This PR changes the behavior to only log errors in this situation. Only one monitor will be active, but it will be the newest monitor.
There may be other fixes we should do in other areas, but this is probably a better failure method than what we currently have.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.