Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beat readiness probe #3197

Open
sebgl opened this issue Jun 8, 2020 · 6 comments
Open

Beat readiness probe #3197

sebgl opened this issue Jun 8, 2020 · 6 comments
Labels
>enhancement Enhancement of existing functionality

Comments

@sebgl
Copy link
Contributor

sebgl commented Jun 8, 2020

We probably want to introduce a readiness probe for Beats.
It's a bit surprising right now to see filebeat "ready" while Elasticsearch is unavailable.

It looks like we could execute a filebeat test output command. To investigate.

@sebgl sebgl added the >enhancement Enhancement of existing functionality label Jun 8, 2020
@david-kow
Copy link
Contributor

What ready should indicate though? If Beat can start getting logs/metrics in, I'd consider it ready even if the output is not ready itself. I'd think that's what outputs (ES for instance) ready is for.

@anyasabo
Copy link
Contributor

anyasabo commented Jun 8, 2020

For filebeat, filebeat test output is at least what the helm chart uses:
https://github.com/elastic/helm-charts/blob/master/filebeat/values.yaml#L72

@anyasabo
Copy link
Contributor

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-readiness-probe

If you'd like to start sending traffic to a Pod only when a probe succeeds, specify a readiness probe. In this case, the readiness probe might be the same as the liveness probe, but the existence of the readiness probe in the spec means that the Pod will start without receiving any traffic and only start receiving traffic after the probe starts succeeding. If your Container needs to work on loading large data, configuration files, or migrations during startup, specify a readiness probe.

If you want your Container to be able to take itself down for maintenance, you can specify a readiness probe that checks an endpoint specific to readiness that is different from the liveness probe.

The main reason I can think we would want to define a readiness probe is if you were using beats to monitor your other beats. In that case I think you would want to know if the beat was up but the output was down (and so it should be ready even if the output is down).

"Is the output responding" seems more of a question of health in the beats status. I'm not sure there's a good way for ECK to retrieve that though. We currently define beat health as:

const (
	// BeatRedHealth means that the health is neither yellow nor green.
	BeatRedHealth BeatHealth = "red"

	// BeatYellowHealth means that:
	// 1) at least one Pod is Ready, and
	// 2) association is not configured, or configured and established
	BeatYellowHealth BeatHealth = "yellow"

	// BeatGreenHealth means that:
	// 1) all Pods are Ready, and
	// 2) association is not configured, or configured and established
	BeatGreenHealth BeatHealth = "green"
)

@david-kow
Copy link
Contributor

In that case I think you would want to know if the beat was up but the output was down (and so it should be ready even if the output is down).

I'm not sure I'm getting what do you mean here. If we have:

ES    <----    Metricbeat    --(monitoring)-->    Filebeat    --(shipping logs for)-->    Pod

Then we can have the following (main) failure cases:

  1. Pod is down - Metricbeat and Filebeat are ready
  2. Filebeat is down - the fact that Filebeat is down is reported by Metricbeat, but the Metricbeat itself is ready
  3. ES is down - Metricbeat can't output, but it's running (and caches the data) so it's ready

For "Is the output responding" I agree it's difficult, I think we would only know from logs that there is an issue.

@anyasabo
Copy link
Contributor

anyasabo commented Jul 21, 2020

I'm not sure I'm getting what do you mean here.

Because I did a poor job of explaining it :D What I meant was that I think we want to leave it as is for the reasons you described in your comment. If we want to do anything it would be exposing the output status in the Beats CR, but I'm not sure we can simply (maybe the beats state/status endpoint exposes the info?).

@david-kow david-kow removed the :beats label Jul 28, 2020
@pebrc
Copy link
Collaborator

pebrc commented Aug 10, 2020

We should probably close this in favour of another issue that will update the status of the Beats resource with some information about the output status.

Just as an aside because filebeat test output was mentioned, it returns an error despite a working configuration due to a DNS check it does:

[root@gke-pebrc-dev-cluster-default-pool-0ce0f2c1-nl52 filebeat]# filebeat test output
elasticsearch: http://elasticsearch:9200...
  parse url... OK
  connection...
    parse host... OK
    dns lookup... ERROR lookup elasticsearch on 10.73.16.10:53: no such host

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Enhancement of existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants