-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate liveness and readiness checks #5048
Conversation
What decision(s) are we willing to make based on HTTP availability alone? IMHO it's the cache(s) readiness check that should be separated. BTW, omission on my part, only two caches are currently checked against. We should include user agents, sessions, country rules, page rules, hostname rules too. Checking cache size is not mandatory, but convenient way of making sure there's no exit thrown out of a key lookup, at least. Without those marked green we should not expose any node via LB. |
The readiness probe is used to decide whether to route traffic to the container or not. As I understand it the liveness probe should only fail in fatal scenarios that the application cannot recover by itself. Database downtime or BEAM process crashes, for example, should in theory be recovered without a restart which is the main idea with this separation.
I'll add all the caches to the readiness probe. |
@aerosol from your basecamp comment
This is exactly what the separate liveness probe does. The app will only be restarted when the container is unreachable or is reachable and cannot even respond with 200 OK. |
Sorry I don't understand that sentence @aerosol |
Sorry, hope all cleared up in basecamp. I think we might give it a go. Please consider merging #5057 - not critical, mostly cosmetic though. |
c7201f9
to
69bbae4
Compare
Changes
Adds two new endpoints:
/api/system/health/live
does nothing more than return 200 OK with a JSON body{"ok": "true"}
/api/system/health/ready
checks all the necessary dependencies for accepting traffic (identical to old/api/health
behaviour):2.1 Postgres reachable
2.2 Clickhouse reachable
2.3 Ingestion caches ready
The old
/api/health
endpoint is still usable to give time to migrate external checks.Technically ingestion can keep working as long as caches are loaded but postgres is unreachable. Future possibility is separating ingestion readiness check from app readiness.