Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate liveness and readiness checks #5048

Merged
merged 5 commits into from
Feb 10, 2025
Merged

Separate liveness and readiness checks #5048

merged 5 commits into from
Feb 10, 2025

Conversation

ukutaht
Copy link
Contributor

@ukutaht ukutaht commented Feb 5, 2025

Changes

Adds two new endpoints:

  1. /api/system/health/live does nothing more than return 200 OK with a JSON body {"ok": "true"}
  2. /api/system/health/ready checks all the necessary dependencies for accepting traffic (identical to old /api/health behaviour):
    2.1 Postgres reachable
    2.2 Clickhouse reachable
    2.3 Ingestion caches ready

The old /api/health endpoint is still usable to give time to migrate external checks.

Technically ingestion can keep working as long as caches are loaded but postgres is unreachable. Future possibility is separating ingestion readiness check from app readiness.

CHANGELOG.md Show resolved Hide resolved
@aerosol
Copy link
Member

aerosol commented Feb 10, 2025

What decision(s) are we willing to make based on HTTP availability alone? IMHO it's the cache(s) readiness check that should be separated.

BTW, omission on my part, only two caches are currently checked against. We should include user agents, sessions, country rules, page rules, hostname rules too.

Checking cache size is not mandatory, but convenient way of making sure there's no exit thrown out of a key lookup, at least. Without those marked green we should not expose any node via LB.

@ukutaht
Copy link
Contributor Author

ukutaht commented Feb 10, 2025

What decision(s) are we willing to make based on HTTP availability alone?

The readiness probe is used to decide whether to route traffic to the container or not.
The liveness probe is used by kubernetes to decide whether to restart a container or not.

As I understand it the liveness probe should only fail in fatal scenarios that the application cannot recover by itself. Database downtime or BEAM process crashes, for example, should in theory be recovered without a restart which is the main idea with this separation.

BTW, omission on my part, only two caches are currently checked against. We should include user agents, sessions, country rules, page rules, hostname rules too.
Checking cache size is not mandatory, but convenient way of making sure there's no exit thrown out of a key lookup, at least. Without those marked green we should not expose any node via LB.

I'll add all the caches to the readiness probe.

@ukutaht
Copy link
Contributor Author

ukutaht commented Feb 10, 2025

@aerosol from your basecamp comment

Happy to work on any /api/health improvements allowing us to achieve better orchestration. Basically my take is we should start by refining our restart strategy to „let it crash more before we take it down”, to have more meaningful traces of any underlying conditions.

This is exactly what the separate liveness probe does. The app will only be restarted when the container is unreachable or is reachable and cannot even respond with 200 OK.

@aerosol
Copy link
Member

aerosol commented Feb 10, 2025

@ukutaht yes, but given we let the traffic in once caches are initially ready, which /correct me if I'm wrong/ is currently not enforced? cc @cnkk

@ukutaht
Copy link
Contributor Author

ukutaht commented Feb 10, 2025

Sorry I don't understand that sentence @aerosol

@aerosol
Copy link
Member

aerosol commented Feb 10, 2025

Sorry I don't understand that sentence

Sorry, hope all cleared up in basecamp. I think we might give it a go. Please consider merging #5057 - not critical, mostly cosmetic though.

@ukutaht ukutaht added this pull request to the merge queue Feb 10, 2025
Merged via the queue into master with commit d762795 Feb 10, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants