-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add leader election for autodiscover #20281
Conversation
Signed-off-by: chrismark <[email protected]>
❕ Build Aborted
Expand to view the summary
Build stats
Test stats 🧪
Log outputExpand to view the last 100 lines of log output
|
Signed-off-by: chrismark <[email protected]>
Signed-off-by: chrismark <[email protected]>
Signed-off-by: chrismark <[email protected]>
Signed-off-by: chrismark <[email protected]>
Tested this on a 3-node cluster on GKE. Seems to work:
I will open it for review by now. |
Pinging @elastic/integrations-platforms (Team:Platforms) |
Signed-off-by: chrismark <[email protected]>
Signed-off-by: chrismark <[email protected]>
Signed-off-by: chrismark <[email protected]>
Signed-off-by: chrismark <[email protected]>
Signed-off-by: chrismark <[email protected]>
Thanks for the review and the (offline) suggestions @jsoriano! |
Signed-off-by: chrismark <[email protected]>
Signed-off-by: chrismark <[email protected]>
Signed-off-by: chrismark <[email protected]>
Signed-off-by: chrismark <[email protected]>
Signed-off-by: chrismark <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only some small details.
|
||
// Start for EventManager interface. | ||
func (p *leaderElectionManager) Start() { | ||
ctx, cancel := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit.
ctx, cancel := context.WithCancel(context.Background()) | |
ctx, cancel := context.WithCancel(context.TODO()) |
func (p *leaderElectionManager) startLeaderElector(ctx context.Context, lec leaderelection.LeaderElectionConfig) { | ||
le, err := leaderelection.NewLeaderElector(lec) | ||
if err != nil { | ||
p.logger.Errorf("leader election lock GAINED, id %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like an error 🙂 Log it at the info or debug level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually an error..but with wrong error message. This piece is "copied" from RunOrDie
(https://github.com/kubernetes/client-go/blob/36233866f1c7c0ad3bdac1fc466cb5de3746cfa2/tools/leaderelection/leaderelection.go#L214). See the watchdog related comment for more info.
if err != nil { | ||
p.logger.Errorf("leader election lock GAINED, id %v", err) | ||
} | ||
if lec.WatchDog != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this watchdog? we are never setting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was mostly because I tried to "copy" the RunOrDie
of https://github.com/kubernetes/client-go/blob/36233866f1c7c0ad3bdac1fc466cb5de3746cfa2/tools/leaderelection/leaderelection.go#L219, because of an async problem described at kubernetes/client-go#837 .
Good catch I will remove it, it's redundant.
CHANGELOG.next.asciidoc
Outdated
@@ -367,6 +367,7 @@ https://github.com/elastic/beats/compare/v7.0.0-alpha2...master[Check the HEAD d | |||
- Added the `max_cached_sessions` option to the script processor. {pull}19562[19562] | |||
- Add support for DNS over TLS for the dns_processor. {pull}19321[19321] | |||
- Set index.max_docvalue_fields_search in index template to increase value to 200 fields. {issue}20215[20215] | |||
- Add leader election for autodiscover. {pull}20281[20281] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Add leader election for autodiscover. {pull}20281[20281] | |
- Add leader election for Kubernetes autodiscover. {pull}20281[20281] |
Signed-off-by: chrismark <[email protected]>
jenkins run the tests please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but let's wait to have a greener build.
💔 Tests FailedExpand to view the summary
Build stats
Test stats 🧪
Test errorsExpand to view the tests failures
Steps errorsExpand to view the steps failures
Log outputExpand to view the last 100 lines of log output
|
@jsoriano I think the only issue with the tests is https://travis-ci.org/github/elastic/beats/jobs/716456468#L1590 which is also failing in general currently. I think we can proceed with merging, wdyt? |
Yes, I think the issue in CI is not related. |
(cherry picked from commit 9ab9b97)
After PR elastic#20281 was merged, the signature of the `autodiscover.ProviderBuilder` changed to accept an additional name (string) parameter.
What does this PR do?
Implements leader election as part of kubernetes autodiscover provider. With this, when the beat is elected as leader it will start cluster scope metricsets.
Under the hood, all leader candidates will try to gain the lock on Lease object (https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#lease-v1-coordination-k8s-io). When the lock is gained then the configured cluster wide mericsets will get started.
Why is it important?
To get rid of
Deployment
procedure for the singleton Metricbeat instance which is needed for the cluster scope metricsets.Sample configuration that will start
state_node
metricset when a leadership is gained:Then we can monitor the
lease
to check which provider holds the lock:Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
It is recommended to testing this along with #20512 for easiness of configuration/setup etc
Make sure that events for
state_node
are being collected from only one metricbeat instance. You can the in the logs and verify that only one metricbeat has gained the lock so far.Monitor the
lease
object to keep track of the lock holder:watch kubectl describe lease beats-cluster-leader
Stop the Metricbeat instance that currently holds the lock and make sure that another one takes over and we keep getting events from
state-node
. Make sure that thelease
changed holder (logs will also let us know that a new metricset is getting started)Also test without setting
identifier
and check that it takes the default value properly likemetricbeat-cluster-leader
.Check with old autodiscover (template based/hints based) for possible regressions.
Related issues