-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved health checks #11161
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
@kaxil Shall I pick this one up or do we need to analyze this first? |
It's not that easy to implement if we want multiple schedulers. Also, it means that if the webserver doesn't work, we won't be able to check the status of any component. I think we should try to find a better alternative. This endpoint is fine, but when only informative and not abused to build continuous monitoring based on it. |
@mik-laj Maybe having metrics - endpoints on each component? In that way each component can be checked and restarted individually on Kubernetes. |
What about sending statsd metrics as one of the ways? We can setup alerts on top of collected metrics. I think this one, we should add anyway, since we already have statsd integrated for several things. (in my company, we already use this method to monitor everything related to airflow) |
@rootcss But it's metrics not health checks? Seeing at least 2 problematic things:
|
Totally agree @Alien2150, Statsd is not the best solution for health check but it's good to have since it's being used in other airflow modules. For a full proof health check, we'll need further discussion on it. |
Meanwhile I think a temporary solution could be to run a simple sidecar that maps back the http results to a Kubernetes Probe format. Are you folks managing the helm chart as well? |
Running an HTTP server on every component to provide metrics sounds like a much better solution to me. Do we want to continue using a custom JSON view? If we want other tools to be able to integrate with it, and it wasn't just used in the Web UI. I think it is worth thinking about using a common standard e.g. JMX, Prometheus. |
We should look at integrating Prometheus natively in Airflow for metrics -- I remember creating an Issue (not sure if it was Jira or GH) |
We should find out what Celery's plans are for deeper integration with Kubernetes. It may coincide with our needs. |
@kaxil That would be awesome. Right now the only way doing that is to use the Exporter: https://pypi.org/project/airflow-prometheus-exporter/ |
It seems simpler to add a CLI command that will check if the process is healthy. We can be inspired by the Helm Chart, but it doesn't support multiple schedulers yet. |
@mik-laj Looks good. Potentially these values should be adjustable. What do you think? |
Part of: #11161 To have a full description of the monitoring of all Airflow components, I add information about HTTP and CLI checks for Celery. Thanks to this, we will not have to search for information one by one, but everything will be in one place. Documentation for Celery does not describe the inspect ping command, but hopefully, this will be added soon.
Hi. I made some improvements and we have a health check for each component. This is described in the documentation: |
Related: #13838 |
Awesome. Thx @mik-laj |
Part of: #11161 To have a full description of the monitoring of all Airflow components, I add information about HTTP and CLI checks for Celery. Thanks to this, we will not have to search for information one by one, but everything will be in one place. Documentation for Celery does not describe the inspect ping command, but hopefully, this will be added soon. (cherry picked from commit 8cabae7)
Description
Improve health checks for Airflow
Use case / motivation
According to https://airflow.apache.org/docs/stable/howto/check-health.html the health endpoint includes db and scheduler health-information. But the Http code is not usable as a health-basis which is not useful for orchestrating software like Kubernetes.
It think it would be better to add 2 more additional routes:
On top it would be great to utilize Http code as health metric (e.g 200 for healthy and 4xx as unhealthy). In that case it would be possible to let Kubernetes health probes check the http status code in order to restart scheduler for example.
Related Issues
The text was updated successfully, but these errors were encountered: