Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heartbeat] Log error on dupe monitor ID instead of strict req #29041

Merged
merged 14 commits into from
Nov 22, 2021

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Nov 18, 2021

Currently Heartbeat will not start > 1 monitor with the same ID. This can create unintuitive scenarios in situations where a user
changes a monitor by adding a new entry via some reload mechanism (say k8s autodiscover) before the old one is deleted.

This PR changes the behavior to only log errors in this situation. Only one monitor will be active, but it will be the newest monitor.

There may be other fixes we should do in other areas, but this is probably a better failure method than what we currently have.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Currently Heartbeat will not start > 1 monitor with the same ID.
This can create unintuitive scenarios in situations where a user
changes a monitor by adding a new entry via some reload mechanism
(say k8s autodiscover) before the old one is deleted.

This PR changes the behavior to only log errors in this situation.
Only one monitor will be active, but it will be the newest monitor.
@andrewvc andrewvc added bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team v7.16.0 backport-v8.0.0 Automated backport with mergify backport-v7.16.0 Automated backport with mergify labels Nov 18, 2021
@andrewvc andrewvc self-assigned this Nov 18, 2021
@andrewvc andrewvc requested a review from a team as a code owner November 18, 2021 22:53
@elasticmachine
Copy link
Collaborator

Pinging @elastic/uptime (Team:Uptime)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Nov 18, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Nov 18, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-11-20T01:56:31.605+0000

  • Duration: 70 min 45 sec

  • Commit: fad0683

Test stats 🧪

Test Results
Failed 0
Passed 3589
Skipped 80
Total 3669

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm noticing in my smoke testing of this patch is that, while it does let me run a config with multiple monitors sharing an ID, it doesn't seem to respect the usual CTRL+C I'd
use to kill the process.

When attempting to run the same multi-shared-ID config on master I got the error message/no-op that is typical of HB up until now.

In the case of different monitors with the same ID it did seem to log them, per the included screenshot:

image

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes LGTM. Added small comments.


closed := um.stopUnsafe(m)
if closed {
logp.Warn("monitor ID %s is configured for multiple monitors! IDs should be unique values, last seen config will win", m.stdFields.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we log like stopping previous monitor with same monitor id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if that adds anything, because it's hard to tell them apart. Should I change the log message maybe?

m2.stdFields.Name = "MON2!!!"
// This used to trigger an error, but shouldn't any longer, we just log
// the error, and ensure the last monitor wins
require.NoError(t, m2Err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also check if previous monitor is stopped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do check for this at the end of the test by checking the number of stops invoked

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, mainly about avoiding the global objects if possible.

heartbeat/monitors/dedup.go Outdated Show resolved Hide resolved
func (e ErrDuplicateMonitorID) Error() string {
return fmt.Sprintf("monitor ID %s is configured for multiple monitors! IDs must be unique values.", e.ID)
}
var globalDedup = newDedup()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be part of the RunnerFactory to avoid needing a global object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, done!

@@ -151,6 +139,10 @@ func newMonitorUnsafe(
m.stdFields.ID = fmt.Sprintf("auto-%s-%#X", m.stdFields.Type, hash)
}

// De-duplicate monitors with identical IDs
// last write wins
globalDedup.register(m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newMonitor is used by CheckConfig. Checking configs shouldn't stop monitors existing monitors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this, it is now the case that newMonitor is side effect free

@andrewvc
Copy link
Contributor Author

@justinkambic there was a deadlock issue that I've now fixed, that was blocking clean shutdowns, ctrl-c should now work, that's important!

var runners []cfgfile.Runner
for _, cfg := range bt.config.Monitors {
created, err := factory.Create(b.Publisher, cfg)
created, err := bt.dynamicFactory.Create(b.Publisher, cfg)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleans everything up so that we only use a single factory instance in heartbeat

@@ -135,14 +131,10 @@ func newMonitorUnsafe(
internalsMtx: sync.Mutex{},
config: config,
stats: pluginFactory.Stats,
state: MON_INIT,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prevents weird situations with dual stops, makes stop() idempotent.

@andrewvc
Copy link
Contributor Author

OK, this should be ready for review again, it's largely rewritten, now using the factory for all state, and generally cleaning up how all the lifecycle stuff is handled. Thanks for the great input @jsoriano @justinkambic @vigneshshanmugam

Races and deadlocks should all be gone.

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if it looks good to uptime 🙂

heartbeat/monitors/factory.go Show resolved Hide resolved
Copy link
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

}

// Stop stops the Monitor's execution in its configured scheduler.
// This is safe to call even if the Monitor was never started.
// Stop stops the monitor without freeing it in global dedup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Stop stops the monitor without freeing it in global dedup
// Stop the monitor without freeing it in global dedup

I know the function name is a proper noun here but it feels kinda redundant.

@andrewvc andrewvc merged commit dbca099 into elastic:master Nov 22, 2021
@andrewvc andrewvc deleted the warn-on-dup branch November 22, 2021 21:46
mergify bot pushed a commit that referenced this pull request Nov 22, 2021
andrewvc added a commit that referenced this pull request Nov 23, 2021
… (#29083)

(cherry picked from commit dbca099)

Co-authored-by: Andrew Cholakian <[email protected]>
vigneshshanmugam pushed a commit that referenced this pull request Nov 23, 2021
…ead of strict req (#29082)

* Remove watch poll feature (#27166)

* Remove watch poll feature

* Changelog

* Update YML

* [Heartbeat] Log error on dupe monitor ID instead of strict req (#29041)

Co-authored-by: Andrew Cholakian <[email protected]>
@awahab07 awahab07 assigned awahab07 and unassigned andrewvc Nov 23, 2021
@awahab07
Copy link

Post FF Testing

LGTM

  • Before the merge, heartbeat would error out adding a monitor with duplicate ID as an existing monitor.
  • After the merge and rebuild, only warning was logged and the new monitor with duplicate ID overruled the previous one.

Before:
Screenshot 2021-11-23 at 15 11 51

After:
image

{"log.level":"warn","@timestamp":"2021-11-23T15:17:56.412+0100","log.logger":"monitor-factory","log.origin":{"file.name":"monitors/factory.go","file.line":125},"message":"monitor ID alibaba-monitor is configured for multiple monitors! IDs should be unique values, last seen config will win","service.name":"heartbeat","ecs.version":"1.6.0"}
Previous monitor list List after duplicate monitor is configured
Screenshot 2021-11-23 at 15 22 49 Screenshot 2021-11-23 at 15 21 43

@awahab07 awahab07 removed their assignment Nov 23, 2021
v1v added a commit to v1v/beats that referenced this pull request Nov 24, 2021
…ws-on-file-changes

* upstream/master:
  override host on statsd metricset (elastic#29103)
  Skip config check in autodiscover for duplicated configurations (elastic#29048)
  Change "filebeat.config.modules.enabled" to "true" (elastic#28769)
  Remove deprecated spool queue from Beats (elastic#28869)
  Add `beat` field back to beat.stats (elastic#29094)
  Revert "Move labels and annotations under kubernetes.namespace. (elastic#27917)" (elastic#29069)
  heartbeat: remove w2008 in the CI (elastic#29093)
  Remove deprecated `--template` and `--index-policy` flags (elastic#28870)
  Fix parsing of apache trace log levels (elastic#28717)
  [Elastic-Agent] IUse itnernal port for local fleet server (elastic#28993)
  [Heartbeat] Log error on dupe monitor ID instead of strict req (elastic#29041)
  Enable pprof for elastic-agent and beats (elastic#28983)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.16.0 Automated backport with mergify backport-v8.0.0 Automated backport with mergify bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team v7.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants