Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Production Helm chart support #8777

Merged
merged 27 commits into from
Jun 17, 2020
Merged

Conversation

schnie
Copy link
Contributor

@schnie schnie commented May 8, 2020

This PR adds a default helm chart for Airflow. This chart is based on the one we use at Astronomer to manage hundreds of production deployments.

The chart has been cleaned up to remove any references to the Astronomer platform or anything specific to the surrounding infrastructure.

The bin directory contains some scripts to help run this chart locally or in CI using kind, as well as packaging and releasing it to a helm repository. The scripts reference the Astronomer helm repo right now, but I figured we could use it as a starting point and tweak these scripts, or remove them completely to fit into how we want to manage this going forward. Opening this as a draft PR for now to start some discussion.


Make sure to mark the boxes below before creating PR: [x]

  • Description above provides context of the change
  • Unit tests coverage for changes (not needed for documentation changes)
  • Target Github ISSUE in description if exists
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions.
  • I will engage committers as explained in Contribution Workflow Example.

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@kaxil kaxil changed the title Add helm chart Add Production Helm chart support May 8, 2020
chart/values.yaml Outdated Show resolved Hide resolved
@thesuperzapper
Copy link
Contributor

I am also working on something similar, which is initially a rewrite of the stable/airflow chart.

@thesuperzapper
Copy link
Contributor

@schnie I have released v7.0.0 of the stable/airflow chart, see here.

There definitely are a few changes from your chart which would improve things in stable/airflow (like how you do health checks), but there are features in the stable/airflow chart which users need for production/enterprise environments, so we should either aim to include them in a new official chart, or start from the stable/airflow one.

Regardless, I think we should decouple the release of airflow from its Helm chart (and potentially keep it in a seperate repo). Also note, the official helm/charts repo is being depreciated at the end of the year, so the community will need an upgrade path.

tag: ~
pullPolicy: IfNotPresent
flower:
repository: astronomerinc/ap-airflow
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to use the official airflow image, is it just changing this image?
Or changes required across the chart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. @dimberman has some experience with this. I believe the intention is to follow up this PR with a better integration into the official image as well as Airflow 2.0. Any thoughts on this Daniel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schnie I'd say let's merge what we have and I will then

  1. Create a PR making it compatible with the airflow 2.0 official image
  2. Create a backport making it compatible with the 1.10 image.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't understand how the 'offical' helm chart for airflow could use anything but the offical Dockerfile.
For reference, we already use the offical Dockerfile at stable/airflow

@schnie schnie marked this pull request as ready for review May 22, 2020 18:30
@marclamberti
Copy link

marclamberti commented May 23, 2020

@thesuperzapper There is an issue in the Helm chart.
redis: enabled: false
produces the following error:
spec.template.spec.containers[0].env[1].valueFrom.secretKeyRef.name: Invalid value: \"\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
which is related to this in _helpers.tpl :
name: REDIS_PASSWORD valueFrom: secretKeyRef: name: {{ .Values.externalRedis.passwordSecret }} key: {{ .Values.externalRedis.passwordSecretKey }}
where passwordSecret is by default sets to "" and so the error.

Furthermore, I think it should be easier to set it for the KubernetesExecutor as well. It depends too much of the CeleryExecutor whereas both are quite used now.

Hope it helps :)

@thesuperzapper
Copy link
Contributor

@marclamberti, thanks for the heads up, you'll be glad to know that fixes for both issues you raised have made their way into v7.1.0 of the chart.

@thesuperzapper
Copy link
Contributor

@schnie, I can't see a section of this chart which will allow users to sync their DAGs from an external git repo. I think this is probably the single most common deployment config for airflow, you need to support it.

@thesuperzapper
Copy link
Contributor

thesuperzapper commented May 26, 2020

@schnie, not all K8S clusters support Ingress that are not on the public internet, therefore not all companies be able to use this chart.

You should add support for exposing the webserver/flower as LoadBalancer type Services, (and therefore using Airflow's inbuilt SSL support for HTTPS.

@turbaszek
Copy link
Member

Hi @schnie, can we help you somehow with moving this PR forward? 😉

@marclamberti
Copy link

marclamberti commented Jun 2, 2020

Wouldn't be easier to start with the chart of @thesuperzapper ? In the sense that it already include a lot of features ? I'm pretty sure it is widely used too as it is available on the stable Helm charts repo

@streetmapp
Copy link

I would definitely agree in terms of basing it from the chart that is in the stable Helm charts repo. There are some interesting components in the Astronomer and Bitnami charts that could make this chart really good. I think another good thing would be increasing the ease of getting going with the Kubernetes executor too.

@dimberman
Copy link
Contributor

@marclamberti we get a lot of questions from users of the community chart who can't seem to get it to work/have all sorts of weird issues. Has this not been your experience?

@streetmapp completely agreed. I'm working on backporting a fix that will allow users to apply arbitrary pod specs via yaml files on the KubernetesExecutor (hopefully in time for 1.10.11), and will then write much more documentation around that.

@marclamberti
Copy link

@dimberman Well indeed, I got some issues too, one at this moment actually. I agreed with @streetmapp my point was, there are three charts right now, having some very interesting features that could be nice to merge in some way. Btw, I would be glad to help :)

@schnie
Copy link
Contributor Author

schnie commented Jun 2, 2020

@schnie, not all K8S clusters support Ingress that are not on the public internet, therefore not all companies be able to use this chart.

You should add support for exposing the webserver/flower as LoadBalancer type Services, (and therefore using Airflow's inbuilt SSL support for HTTPS.

Great point, I meant to fix this before pushing. For now, I've removed the Ingress object and made the Webserver and Flower Service types configurable, defaulting to ClusterIP.

@dimberman
Copy link
Contributor

@schnie can you make the yamllint tests pass please? I think we can merge an initial PR just to get the code in and then start fixing it on master.

@streetmapp
Copy link

The ingress components were already capable of being on/off were they not? Seems odd to remove them if that was the case.

@thesuperzapper
Copy link
Contributor

thesuperzapper commented Jun 3, 2020

@dimberman, I honestly think we should consider putting any official helm charts in a separate repository, as there is little reason to pin their releases to airflow itself.

@potiuk
Copy link
Member

potiuk commented Jun 3, 2020

@dimberman, I honestly think we should consider putting any official helm charts in a separate repository, as there is little reason to pin their releases to airflow itself.

@thesuperzapper The main reason is that we're going to make both helm charts and production image used in automated tests of airflow - this way they will both be automatically tested during the tests of airflow. We are basically going to "eat of our dog food" as the term goes - so we are going to use airlfow tests in master to test our Dockerfile and Helm chart. At the same time we use our Helm and Dockerfile to test airflow - it already happens for Dockerfile - we are running Kubernetes Executor tests using production image and we are going to switch to helm charts for that soon.

Even now both Production image and Helm chart from master can be used to install older versions of Airflow - see https://github.com/apache/airflow/blob/master/IMAGES.rst - then you will how to use the current image from master to build any airflow 1.10* version of the image.

You cam treat those Dockerfile and Charts as development versions - same as all other airflow code.

Once this is more stable and proven, both production docker file and help chart will land in their dedicated official repositories (additionally to continuing development in master of Airflow).

This is all described in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-26+Production-ready+Airflow+Docker+Image+and+helm+chart

and we are approximately half way from what we considered as Done

Helm chart will be here:

Doickefile will become an official docker image

@thesuperzapper
Copy link
Contributor

@potiuk I maintain the stable/airflow helm chart, but this chart will need to move before the end of the year, as that repository is being archived.

I think we (Apache Airflow) should have a helm repository, similar to how ArgoProj has their own helm repo, initially this can contain the existing stable/airflow chart.

Please see here: https://github.com/helm/charts#deprecation-timeline

@potiuk
Copy link
Member

potiuk commented Jun 3, 2020

Great. Then we will have to aadapt our plans and make whatever everyone will be doing if "official" helm chart will no longer be official.

Regardless of that the Dockerfile an Helm Chart will be here to stay and once we stabilise it and make both image and helm robust we will find a good place for both I am sure.

chart/values.yaml Outdated Show resolved Hide resolved
chart/values.yaml Outdated Show resolved Hide resolved
@ashb
Copy link
Member

ashb commented Jun 3, 2020

Looking at the yamllint errors, I think we have to make it not check anything under chart/templates/ -- they aren't YAML files as such, but yaml+gotexttemplate, so yamllint is never going to like them.

@kaxil kaxil merged commit 66e7382 into apache:master Jun 17, 2020
@kaxil kaxil deleted the add-helm-chart branch June 17, 2020 22:38
@kaxil
Copy link
Member

kaxil commented Jun 17, 2020

Thank you all for the comments and discussions. PRs are now welcome to make any changes to existing code or add new features.

@kaxil
Copy link
Member

kaxil commented Jun 17, 2020

Big thanks to @schnie 🚀

@marclamberti
Copy link

Awesome! Time to use the official chart 🚀🚀

@potiuk
Copy link
Member

potiuk commented Jun 18, 2020

Fantastic! 🚀 🚀 . I am going to switch to the helm chart for our Kubernetes tests soon and iterate on the prod image @dimberman @schnie @kaxil @ashb -> I guess I am free to do so ? You are not working on this?

@Siddharthk
Copy link

Siddharthk commented Jun 18, 2020

Thanks for this. Was waiting for this eagerly. I tested it out quickly and I am getting below error using helm 2. Am I doing something wrong? I just want to get the yaml files and not install via helm.

$ helm template --values ./airflow/chart/values.yaml --output-dir ./flow ./airflow/chart
Error: found in requirements.yaml, but missing in charts/ directory: postgresql

@kaxil
Copy link
Member

kaxil commented Jun 18, 2020

Thanks for this. Was waiting for this eagerly. I tested it out quickly and I am getting below error using helm 2. Am I doing something wrong? I just want to get the yaml files and not install via helm.

$ helm template --values ./airflow/chart/values.yaml --output-dir ./flow ./airflow/chart
Error: found in requirements.yaml, but missing in charts/ directory: postgresql

You will need to run helm dep update first :)

@streetmapp
Copy link

What's the repo URL to be used for this chart? The one listed in the Readme points to the stable Helm repo so if you are following that, you end up pulling down the Airflow chart from there and not this one.

Also of note, later in the Readme the Astronomer repo is listed in the Walkthrough using kind section.

@kaxil
Copy link
Member

kaxil commented Jun 18, 2020

What's the repo URL to be used for this chart? The one listed in the Readme points to the stable Helm repo so if you are following that, you end up pulling down the Airflow chart from there and not this one.

Also of note, later in the Readme the Astronomer repo is listed in the Walkthrough using kind section.

We haven't published this chart yet, PRs are welcome to improve the chart including documentation and we would let everyone know once we publish it.

@thesuperzapper
Copy link
Contributor

Is the master branch really the correct place to be developing this? Especially as there is currently no release/testing infrastructure in master for this chart.

I am worried that we will end up clogging master's commit history, and confusing users about the maturity/availability of the chart feature.

@potiuk
Copy link
Member

potiuk commented Jun 19, 2020

Yes. it is good the @thesuperzapper . IMHO Monorepo with all stuff in is one of the best ideas for development for such project size. We are going to use it for our kubernetes tests and we already have unit tests in progress #9371 and we are going to develop it at the same time as developing production image and airflow code itself.

We are definitely going to release - both images and helm charts in an "official" way - the same way as all the code is released for Apache Airflow. Note that there is a very formal process to follow to release any code for Apache Airflow once the code becomes "community managed" http://www.apache.org/legal/release-policy.html - this includes voting by PMCs and at least 72 hours of voting period. So we will likely be releasing Helm chart together with official releases of Apache Airlfow (same with Docker Image). Everything in master is in "development" for Apache Airflow.

@Siddharthk
Copy link

Siddharthk commented Jun 19, 2020

I have some questions:

  1. I see that the recommended way to store dags is the docker image. But when we deploy new image, the webserver restarts thus causing UI interruption for some time. Is this a bug? Should I run 2 replicas?
airflow-postgresql-0                 1/1     Running           0          25h
airflow-scheduler-5f9c688c54-qxrd8   0/2     PodInitializing   0          14s
airflow-scheduler-76c7949bb6-qwstg   2/2     Running           0          64m
airflow-statsd-7445645ddd-f6m4c      1/1     Running           0          25h
airflow-webserver-5c4bc5d886-7zj5h   0/1     Init:0/1          0          14s
  1. By default, the chart does not use persistent volumes for task/pod logs. What is the recommended way for logs? Can pv become default?

  2. Getting below error from task pod logs while running a simple tutorial:

$ kubectl logs -f tutorial1sleep300-1ca52dc981174d3daa29e4f606355a3d  -n airflow
Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 23, in <module>
    import argcomplete
ModuleNotFoundError: No module named 'argcomplete'

@kaxil @ashb @potiuk any idea about this?

@aneesh-joseph
Copy link
Contributor

aneesh-joseph commented Jun 23, 2020

@Siddharthk , for 3, I added the below config to the airflow.cfg kubernetes section of the configmap

run_as_user = {{ .Values.uid }}
fs_group = {{ .Values.gid }}

so that it runs as the correct airflow user, instead of root

Have added this to the config map in a PR

@ryw
Copy link
Member

ryw commented Jun 23, 2020

@schnie see a few questions from @Siddharthk (was chatting w/ him in Airflow Slack)

@marclamberti
Copy link

@Siddharthk ,
1 / It's not a bug, it's the way Kubernetes works. If you update your Helm chart with a new image tag, then Kubernetes will update your app to keep the state you desire. Take a look at the document right here: https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/.
At the bottom you can read:

Similar to application Scaling, if a Deployment is exposed publicly, the Service will load-balance the traffic only to available Pods during the update.
In case your Airflow instance is running in dev/staging, that should not be a problem as it may be ok to have a little downtime between updates. In case you are in production, you should have your webserver in HA (highly available), and so, replicas running with a load balancer in front of them. During the rolling update, while one pod is being updated, the others are still accessible as well as the Airflow UI.

2/ I don't think there is a recommended way to store logs as it depends on your environment and requirements. For instance, I store my logs in S3. S3 is cheap and reliable. I can process the logs later with other tools to make analysis if needed and I can archive them at a defined time. 3

3/ About your error, I got it once with the "unofficial" chart. Sadly I don't remember how I fixed it. Check with kubectl describe your pod if you don't get additional information and keep your workers after completion so that you can debug them.

Hope it helps :)

@Siddharthk
Copy link

Thanks @aneesh-joseph @marclamberti

I have made webserver as HA for now and configured S3 for logs. For python module error, I have created below env for now which worked:

 - name: "AIRFLOW__KUBERNETES__RUN_AS_USER"
   value: "50000"

Another issue I am facing is mounting secrets to task pods. Basically I am trying to use ssh_key file for ssh operator. I added additional volumes and volume mounts for the scheduler. Will this reflect on the task pods as well?

Task pod:
airflow@tutorial1sleep300-2f920b4c375b4d54bd15a4535b49d368:/opt/airflow$ ls
airflow.cfg  config  dags  logs  requirements.txt  unittests.cfg  webserver_config.py

Scheduler pod:
airflow@airflow-scheduler-5474b69b67-j5c2c:/opt/airflow$ ls
airflow.cfg  dags  **id_rsa.pub**  logs  requirements.txt  unittests.cfg  webserver_config.py

@marclamberti
Copy link

Well I did that too actually but it was with the unofficial helm chart. Are you using the official one here?

@Siddharthk
Copy link

Siddharthk commented Jun 24, 2020

Well I did that too actually but it was with the unofficial helm chart. Are you using the official one here?

Yes, using the official one.

@aneesh-joseph
Copy link
Contributor

aneesh-joseph commented Jun 24, 2020

Another issue I am facing is mounting secrets to task pods. Basically I am trying to use ssh_key file for ssh operator. I added additional volumes and volume mounts for the scheduler. Will this reflect on the task pods as well?

No, I don't think it will mount them automatically to the worker pods created by the KubernetesExecutor. If this is for a specific task, you could probably specify the volumes and volumeMounts via the executor_config param of that task. If you want it to be mounted on all worker pods, you may have to use a pod mutation hook via airflowLocalSettings

kaxil pushed a commit to kaxil/airflow that referenced this pull request Jun 27, 2020
potiuk pushed a commit that referenced this pull request Jul 1, 2020
Co-authored-by: Ash Berlin-Taylor <[email protected]>
(cherry picked from commit 66e7382)
@potiuk potiuk added this to the Airflow 1.10.11 milestone Jul 1, 2020
kaxil pushed a commit that referenced this pull request Jul 2, 2020
Co-authored-by: Ash Berlin-Taylor <[email protected]>
(cherry picked from commit 66e7382)
kaxil pushed a commit that referenced this pull request Jul 13, 2020
Co-authored-by: Ash Berlin-Taylor <[email protected]>
(cherry picked from commit 66e7382)
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
Co-authored-by: Ash Berlin-Taylor <[email protected]>
(cherry picked from commit 66e7382)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.