Improved health checks #11161

Alien2150 · 2020-09-26T04:25:51Z

Description

Improve health checks for Airflow

Use case / motivation

According to https://airflow.apache.org/docs/stable/howto/check-health.html the health endpoint includes db and scheduler health-information. But the Http code is not usable as a health-basis which is not useful for orchestrating software like Kubernetes.

It think it would be better to add 2 more additional routes:

/health/scheduler
/health/metadatabase

On top it would be great to utilize Http code as health metric (e.g 200 for healthy and 4xx as unhealthy). In that case it would be possible to let Kubernetes health probes check the http status code in order to restart scheduler for example.

Related Issues

boring-cyborg · 2020-09-26T04:25:52Z

Thanks for opening your first issue here! Be sure to follow the issue template!

rootcss · 2020-10-06T18:25:55Z

@kaxil Shall I pick this one up or do we need to analyze this first?

mik-laj · 2020-10-07T23:00:22Z

It's not that easy to implement if we want multiple schedulers. Also, it means that if the webserver doesn't work, we won't be able to check the status of any component. I think we should try to find a better alternative. This endpoint is fine, but when only informative and not abused to build continuous monitoring based on it.

Alien2150 · 2020-10-08T09:01:51Z

@mik-laj Maybe having metrics - endpoints on each component? In that way each component can be checked and restarted individually on Kubernetes.

rootcss · 2020-10-08T11:40:35Z

What about sending statsd metrics as one of the ways? We can setup alerts on top of collected metrics. I think this one, we should add anyway, since we already have statsd integrated for several things. (in my company, we already use this method to monitor everything related to airflow)

Alien2150 · 2020-10-08T12:02:54Z

@rootcss But it's metrics not health checks? Seeing at least 2 problematic things:

With your proposed way I would need to map back metrics to health; Just imagine what would happen if the metrics service is not available. Is it then metrics or airflow being down?
Also statsd sounds technology-bound solution while http is a very agnostic way?

rootcss · 2020-10-08T12:24:54Z

Totally agree @Alien2150, Statsd is not the best solution for health check but it's good to have since it's being used in other airflow modules. For a full proof health check, we'll need further discussion on it.

Alien2150 · 2020-10-08T12:27:51Z

Meanwhile I think a temporary solution could be to run a simple sidecar that maps back the http results to a Kubernetes Probe format. Are you folks managing the helm chart as well?

mik-laj · 2020-10-08T12:55:16Z

Maybe having metrics - endpoints on each component? I

Running an HTTP server on every component to provide metrics sounds like a much better solution to me. Do we want to continue using a custom JSON view? If we want other tools to be able to integrate with it, and it wasn't just used in the Web UI. I think it is worth thinking about using a common standard e.g. JMX, Prometheus.

kaxil · 2020-10-08T13:05:20Z

We should look at integrating Prometheus natively in Airflow for metrics -- I remember creating an Issue (not sure if it was Jira or GH)

mik-laj · 2020-10-08T13:09:21Z

We should find out what Celery's plans are for deeper integration with Kubernetes. It may coincide with our needs.

Alien2150 · 2020-10-08T13:33:25Z

@kaxil That would be awesome. Right now the only way doing that is to use the Exporter: https://pypi.org/project/airflow-prometheus-exporter/

mik-laj · 2020-11-04T20:34:04Z

It seems simpler to add a CLI command that will check if the process is healthy. We can be inspired by the Helm Chart, but it doesn't support multiple schedulers yet.
https://github.com/apache/airflow/blob/master/chart/templates/scheduler/scheduler-deployment.yaml#L115-L130

Alien2150 · 2020-11-05T08:06:10Z

@mik-laj Looks good. Potentially these values should be adjustable. What do you think?

Part of: #11161 To have a full description of the monitoring of all Airflow components, I add information about HTTP and CLI checks for Celery. Thanks to this, we will not have to search for information one by one, but everything will be in one place. Documentation for Celery does not describe the inspect ping command, but hopefully, this will be added soon.

mik-laj · 2021-03-01T22:12:44Z

Hi.

I made some improvements and we have a health check for each component. This is described in the documentation:
http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow/latest/logging-monitoring/check-health.html
You can also see it in our docker-compose.yaml file.
https://github.com/apache/airflow/blob/master/docs/apache-airflow/start/docker-compose.yaml
I think that this ticket can now be closed.

mik-laj · 2021-03-02T06:50:02Z

Related: #13838

Alien2150 · 2021-03-02T07:25:29Z

Awesome. Thx @mik-laj

Part of: #11161 To have a full description of the monitoring of all Airflow components, I add information about HTTP and CLI checks for Celery. Thanks to this, we will not have to search for information one by one, but everything will be in one place. Documentation for Celery does not describe the inspect ping command, but hopefully, this will be added soon. (cherry picked from commit 8cabae7)

mik-laj added kind:feature Feature Requests area:monitoring labels Sep 26, 2020

This was referenced Feb 24, 2021

change the status code which healthcheck API returns #14385

Closed

Add CLI check for scheduler(s) #14519

Merged

Add health-check for celery worker #14522

Merged

Add docs about Celery monitoring #14533

Merged

Add more tips about health checks #14537

Merged

mik-laj mentioned this issue Mar 1, 2021

Disable in-container health checks for ad-hoc containers #14536

Merged

mik-laj closed this as completed Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved health checks #11161

Improved health checks #11161

Alien2150 commented Sep 26, 2020

boring-cyborg bot commented Sep 26, 2020

rootcss commented Oct 6, 2020

mik-laj commented Oct 7, 2020 •

edited

Loading

Alien2150 commented Oct 8, 2020

rootcss commented Oct 8, 2020

Alien2150 commented Oct 8, 2020 •

edited

Loading

rootcss commented Oct 8, 2020

Alien2150 commented Oct 8, 2020 •

edited

Loading

mik-laj commented Oct 8, 2020 •

edited

Loading

kaxil commented Oct 8, 2020

mik-laj commented Oct 8, 2020

Alien2150 commented Oct 8, 2020

mik-laj commented Nov 4, 2020

Alien2150 commented Nov 5, 2020

mik-laj commented Mar 1, 2021 •

edited

Loading

mik-laj commented Mar 2, 2021

Alien2150 commented Mar 2, 2021

Improved health checks #11161

Improved health checks #11161

Comments

Alien2150 commented Sep 26, 2020

boring-cyborg bot commented Sep 26, 2020

rootcss commented Oct 6, 2020

mik-laj commented Oct 7, 2020 • edited Loading

Alien2150 commented Oct 8, 2020

rootcss commented Oct 8, 2020

Alien2150 commented Oct 8, 2020 • edited Loading

rootcss commented Oct 8, 2020

Alien2150 commented Oct 8, 2020 • edited Loading

mik-laj commented Oct 8, 2020 • edited Loading

kaxil commented Oct 8, 2020

mik-laj commented Oct 8, 2020

Alien2150 commented Oct 8, 2020

mik-laj commented Nov 4, 2020

Alien2150 commented Nov 5, 2020

mik-laj commented Mar 1, 2021 • edited Loading

mik-laj commented Mar 2, 2021

Alien2150 commented Mar 2, 2021

mik-laj commented Oct 7, 2020 •

edited

Loading

Alien2150 commented Oct 8, 2020 •

edited

Loading

Alien2150 commented Oct 8, 2020 •

edited

Loading

mik-laj commented Oct 8, 2020 •

edited

Loading

mik-laj commented Mar 1, 2021 •

edited

Loading