Elasticsearch readiness probe might fail if a single node is stuck #2248

DaveCTurner · 2019-12-11T16:41:20Z

It looks like we run GET _cat/nodes?local in order to check a pod's readiness:

https://github.com/elastic/cloud-on-k8s/blob/master/pkg/controller/elasticsearch/nodespec/readiness_probe.go#L32

This API call does two things: it gets a list of nodes from the local cluster state, and then it reaches out to all the nodes in the cluster and obtains some extra info from them. If a single node in the cluster fails to respond in this second step within a few seconds then I think this means we consider all the pods to be failing their readiness checks.

I think it's a bug to accept ?local on this API at all given this gotcha (see elastic/elasticsearch#50088) but also think we should be using a different API for our readiness checks. For instance, GET _cluster/health?timeout=0s returns 200 OK iff the node is part of a cluster with a master, and GET / returns (something) if the node is actually alive and responding.

(I say "something" because in certain versions GET / return 503 Service Unavailable if there is no master, but more recent versions always return 200 OK here).

The text was updated successfully, but these errors were encountered:

pebrc · 2019-12-11T17:16:28Z

We changed this in #1748. The intention was to bind pod readiness to a node local request so that we can evaluate pod readiness independently from overall cluster health/cluster membership as we are only interested in the status of the single node that is being requested and want to know whether it is principally ready to enter into operation.

From what you are saying GET / seems the request that gets us closest to that?

[Update] actually we had the health check at GET / originally and changed it because of the 503s we were seeing.

but more recent versions always return 200 OK here

So do you know which versions that are? I think we saw it on 6.x clusters IIRC

DaveCTurner · 2019-12-11T17:41:14Z

So do you know which versions that are? I think we saw it on 6.x clusters IIRC

It changed in 7.0.0, see elastic/elasticsearch#29045.

From what you are saying GET / seems the request that gets us closest to that?

Maybe, but note that this will consider a running node to be ready even if it cannot contact the master of the cluster, and arguably a node without a master isn't "ready" in any meaningful sense since it can't really serve any useful requests.

DaveCTurner · 2019-12-11T17:49:15Z

In the light of #1748 maybe we do indeed want GET /, treating 503 as a benign misspelling of 200. Alternatively GET _cluster/health?local would suffice here since that is a local request that always returns 200.

I will admit I don't fully understand the surrounding context -- deleting temporarily-unhealthy pods so enthusiastically seems like it might cause other problems. For instance a node could stop responding even to GET / for a while if it enters a long GC pause, and in that case it seems best to wait.

LoadingZhang · 2019-12-11T18:03:12Z

I'm encountering this readiness problem.
I start 3 nodes in a benchmark, each pod will get readiness probe failed every three or five minutes(not same time). if one of pod get failed, ingress(traefix or nginx) will return 503 service unavailable error, it cause write process terminated immediately.

pebrc · 2019-12-11T18:47:39Z

I will admit I don't fully understand the surrounding context -- deleting temporarily-unhealthy pods so enthusiastically seems like it might cause other problems

#1748 is a bit outdated in that regard. We have since rewritten the logic that deletes pods during upgrades and are not deleting pods anymore that become temporarily unhealthy. It is maybe also worth mentioning that the k8s readiness probe retries by default 3 times in 5 second intervals before marking a pod as not ready.

The purpose of this readiness check as I see it is only to see if a node has booted successfully to such an extent that it is ready to accept API calls. This readiness state then predicates whether or not the node is included in the k8s service so that clients can reach it.

DaveCTurner · 2019-12-11T21:49:58Z

The purpose of this readiness check as I see it is only to see if a node has booted successfully to such an extent that it is ready to accept API calls. This readiness state then predicates whether or not the node is included in the k8s service so that clients can reach it.

These are two different purposes. A node without a master can technically accept API calls, but can't really handle most of them, so I think it's debatable whether a node without a master should be included in the service so that clients can reach it.

anyasabo · 2019-12-11T22:02:52Z

If I'm understanding elastic/elasticsearch#29045 correctly, we might want to:

in 7.0+, check / returns 200
in 6.x, check / returns 200||503 and contains a body that looks like something we expect (e.g.

{
  "name" : "quickstart-es-default-0",
  "cluster_name" : "quickstart",
  "cluster_uuid" : "7Apxo0XXSD6CTNsY_UNNew",
...

Does that make sense?

pebrc · 2019-12-12T19:38:58Z

We discussed offline and are going to implement as proposed in #2248 (comment)

jpountz · 2019-12-13T15:01:59Z

We made this an issue on the Elasticsearch repo: elastic/elasticsearch#50187.

* Add port 9200 to master statefulset to simplify curling the cluster when only master is started * Change readiness probes see elastic/cloud-on-k8s#2248 * Disable pvc and use empty dir * Upgrade all iamges to test 7.5.1

pebrc added the discuss We need to figure this out label Dec 11, 2019

sebgl mentioned this issue Dec 12, 2019

Readyness probe is timing out #2259

Closed

pebrc added >bug Something isn't working v1.0.0 and removed discuss We need to figure this out labels Dec 12, 2019

barkbay self-assigned this Dec 13, 2019

barkbay mentioned this issue Dec 16, 2019

Fix readiness probe #2272

Merged

barkbay closed this as completed in #2272 Dec 17, 2019

pugnascotia mentioned this issue Mar 26, 2020

Elasticsearch Helm Chart review elastic/elasticsearch#53426

Closed

1 task

jmlrt mentioned this issue Apr 2, 2020

[elasticsearch] Revisit readinessProbe elastic/helm-charts#553

Closed

thbkrkr mentioned this issue Mar 31, 2021

readiness probe not cover left node #4354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch readiness probe might fail if a single node is stuck #2248

Elasticsearch readiness probe might fail if a single node is stuck #2248

DaveCTurner commented Dec 11, 2019

pebrc commented Dec 11, 2019 •

edited

Loading

DaveCTurner commented Dec 11, 2019

DaveCTurner commented Dec 11, 2019

LoadingZhang commented Dec 11, 2019

pebrc commented Dec 11, 2019 •

edited

Loading

DaveCTurner commented Dec 11, 2019

anyasabo commented Dec 11, 2019

pebrc commented Dec 12, 2019

jpountz commented Dec 13, 2019

Elasticsearch readiness probe might fail if a single node is stuck #2248

Elasticsearch readiness probe might fail if a single node is stuck #2248

Comments

DaveCTurner commented Dec 11, 2019

pebrc commented Dec 11, 2019 • edited Loading

DaveCTurner commented Dec 11, 2019

DaveCTurner commented Dec 11, 2019

LoadingZhang commented Dec 11, 2019

pebrc commented Dec 11, 2019 • edited Loading

DaveCTurner commented Dec 11, 2019

anyasabo commented Dec 11, 2019

pebrc commented Dec 12, 2019

jpountz commented Dec 13, 2019

pebrc commented Dec 11, 2019 •

edited

Loading

pebrc commented Dec 11, 2019 •

edited

Loading