Separate liveness and readiness checks #5048

ukutaht · 2025-02-05T13:41:41Z

Changes

Adds two new endpoints:

/api/system/health/live does nothing more than return 200 OK with a JSON body {"ok": "true"}
/api/system/health/ready checks all the necessary dependencies for accepting traffic (identical to old /api/health behaviour):
2.1 Postgres reachable
2.2 Clickhouse reachable
2.3 Ingestion caches ready

The old /api/health endpoint is still usable to give time to migrate external checks.

Technically ingestion can keep working as long as caches are loaded but postgres is unreachable. Future possibility is separating ingestion readiness check from app readiness.

lib/plausible_web/controllers/api/system_controller.ex

lib/plausible_web/router.ex

CHANGELOG.md

aerosol · 2025-02-10T06:47:33Z

What decision(s) are we willing to make based on HTTP availability alone? IMHO it's the cache(s) readiness check that should be separated.

BTW, omission on my part, only two caches are currently checked against. We should include user agents, sessions, country rules, page rules, hostname rules too.

Checking cache size is not mandatory, but convenient way of making sure there's no exit thrown out of a key lookup, at least. Without those marked green we should not expose any node via LB.

ukutaht · 2025-02-10T08:17:23Z

What decision(s) are we willing to make based on HTTP availability alone?

The readiness probe is used to decide whether to route traffic to the container or not.
The liveness probe is used by kubernetes to decide whether to restart a container or not.

As I understand it the liveness probe should only fail in fatal scenarios that the application cannot recover by itself. Database downtime or BEAM process crashes, for example, should in theory be recovered without a restart which is the main idea with this separation.

BTW, omission on my part, only two caches are currently checked against. We should include user agents, sessions, country rules, page rules, hostname rules too.
Checking cache size is not mandatory, but convenient way of making sure there's no exit thrown out of a key lookup, at least. Without those marked green we should not expose any node via LB.

I'll add all the caches to the readiness probe.

ukutaht · 2025-02-10T08:22:32Z

@aerosol from your basecamp comment

Happy to work on any /api/health improvements allowing us to achieve better orchestration. Basically my take is we should start by refining our restart strategy to „let it crash more before we take it down”, to have more meaningful traces of any underlying conditions.

This is exactly what the separate liveness probe does. The app will only be restarted when the container is unreachable or is reachable and cannot even respond with 200 OK.

aerosol · 2025-02-10T08:27:43Z

@ukutaht yes, but given we let the traffic in once caches are initially ready, which /correct me if I'm wrong/ is currently not enforced? cc @cnkk

ukutaht · 2025-02-10T12:55:39Z

Sorry I don't understand that sentence @aerosol

lib/plausible_web/controllers/api/system_controller.ex

aerosol · 2025-02-10T13:08:03Z

Sorry I don't understand that sentence

Sorry, hope all cleared up in basecamp. I think we might give it a go. Please consider merging #5057 - not critical, mostly cosmetic though.