Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved health checks #11161

Closed
Alien2150 opened this issue Sep 26, 2020 · 17 comments
Closed

Improved health checks #11161

Alien2150 opened this issue Sep 26, 2020 · 17 comments
Labels

Comments

@Alien2150
Copy link

Description

Improve health checks for Airflow

Use case / motivation

According to https://airflow.apache.org/docs/stable/howto/check-health.html the health endpoint includes db and scheduler health-information. But the Http code is not usable as a health-basis which is not useful for orchestrating software like Kubernetes.

It think it would be better to add 2 more additional routes:

  • /health/scheduler
  • /health/metadatabase

On top it would be great to utilize Http code as health metric (e.g 200 for healthy and 4xx as unhealthy). In that case it would be possible to let Kubernetes health probes check the http status code in order to restart scheduler for example.

Related Issues

@boring-cyborg
Copy link

boring-cyborg bot commented Sep 26, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@rootcss
Copy link
Contributor

rootcss commented Oct 6, 2020

@kaxil Shall I pick this one up or do we need to analyze this first?

@mik-laj
Copy link
Member

mik-laj commented Oct 7, 2020

It's not that easy to implement if we want multiple schedulers. Also, it means that if the webserver doesn't work, we won't be able to check the status of any component. I think we should try to find a better alternative. This endpoint is fine, but when only informative and not abused to build continuous monitoring based on it.

@Alien2150
Copy link
Author

@mik-laj Maybe having metrics - endpoints on each component? In that way each component can be checked and restarted individually on Kubernetes.

@rootcss
Copy link
Contributor

rootcss commented Oct 8, 2020

What about sending statsd metrics as one of the ways? We can setup alerts on top of collected metrics. I think this one, we should add anyway, since we already have statsd integrated for several things. (in my company, we already use this method to monitor everything related to airflow)

@Alien2150
Copy link
Author

Alien2150 commented Oct 8, 2020

@rootcss But it's metrics not health checks? Seeing at least 2 problematic things:

  1. With your proposed way I would need to map back metrics to health; Just imagine what would happen if the metrics service is not available. Is it then metrics or airflow being down?
  2. Also statsd sounds technology-bound solution while http is a very agnostic way?

@rootcss
Copy link
Contributor

rootcss commented Oct 8, 2020

Totally agree @Alien2150, Statsd is not the best solution for health check but it's good to have since it's being used in other airflow modules. For a full proof health check, we'll need further discussion on it.

@Alien2150
Copy link
Author

Alien2150 commented Oct 8, 2020

Meanwhile I think a temporary solution could be to run a simple sidecar that maps back the http results to a Kubernetes Probe format. Are you folks managing the helm chart as well?

@mik-laj
Copy link
Member

mik-laj commented Oct 8, 2020

Maybe having metrics - endpoints on each component? I

Running an HTTP server on every component to provide metrics sounds like a much better solution to me. Do we want to continue using a custom JSON view? If we want other tools to be able to integrate with it, and it wasn't just used in the Web UI. I think it is worth thinking about using a common standard e.g. JMX, Prometheus.

@kaxil
Copy link
Member

kaxil commented Oct 8, 2020

We should look at integrating Prometheus natively in Airflow for metrics -- I remember creating an Issue (not sure if it was Jira or GH)

@mik-laj
Copy link
Member

mik-laj commented Oct 8, 2020

We should find out what Celery's plans are for deeper integration with Kubernetes. It may coincide with our needs.

@Alien2150
Copy link
Author

@kaxil That would be awesome. Right now the only way doing that is to use the Exporter: https://pypi.org/project/airflow-prometheus-exporter/

@mik-laj
Copy link
Member

mik-laj commented Nov 4, 2020

It seems simpler to add a CLI command that will check if the process is healthy. We can be inspired by the Helm Chart, but it doesn't support multiple schedulers yet.
https://github.com/apache/airflow/blob/master/chart/templates/scheduler/scheduler-deployment.yaml#L115-L130

@Alien2150
Copy link
Author

@mik-laj Looks good. Potentially these values should be adjustable. What do you think?

kaxil pushed a commit that referenced this issue Mar 1, 2021
Part of: #11161

To have a full description of the monitoring of all Airflow components, I add information about HTTP and CLI checks for Celery. Thanks to this, we will not have to search for information one by one, but everything will be in one place.

Documentation for Celery does not describe the inspect ping command, but hopefully, this will be added soon.
@mik-laj
Copy link
Member

mik-laj commented Mar 1, 2021

Hi.

I made some improvements and we have a health check for each component. This is described in the documentation:
http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow/latest/logging-monitoring/check-health.html
You can also see it in our docker-compose.yaml file.
https://github.com/apache/airflow/blob/master/docs/apache-airflow/start/docker-compose.yaml
I think that this ticket can now be closed.

@mik-laj mik-laj closed this as completed Mar 1, 2021
@mik-laj
Copy link
Member

mik-laj commented Mar 2, 2021

Related: #13838

@Alien2150
Copy link
Author

Awesome. Thx @mik-laj

potiuk pushed a commit that referenced this issue Mar 3, 2021
Part of: #11161

To have a full description of the monitoring of all Airflow components, I add information about HTTP and CLI checks for Celery. Thanks to this, we will not have to search for information one by one, but everything will be in one place.

Documentation for Celery does not describe the inspect ping command, but hopefully, this will be added soon.

(cherry picked from commit 8cabae7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants