Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-3718] [WIP/SPLIT] Multi-layered version of the docker image #4543

Closed

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Jan 16, 2019

[no-ci]

This PR was split into several smaller -sub tasks to make it easier to review and to merge this change in several steps.

The JIRA subtasks:

@codecov-io
Copy link

codecov-io commented Jan 16, 2019

Codecov Report

Merging #4543 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #4543   +/-   ##
======================================
  Coverage    74.1%   74.1%           
======================================
  Files         421     421           
  Lines       27662   27662           
======================================
  Hits        20498   20498           
  Misses       7164    7164

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 31318ba...bde8de4. Read the comment docs.

@potiuk potiuk force-pushed the multi-layered-dockerfile branch from 53fb830 to b72088d Compare January 16, 2019 22:14
Fokko
Fokko previously requested changes Jan 17, 2019
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @potiuk

I'm still a bit hesitant by the caching stuff. If we keep caching stuff, how can we assure that it also will build from scratch. As the famous quote:

There are only two hard things in Computer Science: cache invalidation and naming things. -- Phil Karlton
For example, Python is doing a minor release every couple of months, and also the base images are updated a couple of times a month: https://github.com/docker-library/official-images/commits/master/library/python
The same goes with the implicit updates of the libraries. I'm more in favour of invalidating the cache periodically, if not, all the time. Maybe other people have an opinion on this as well.

Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved

# Always add get update here to get latest dependencies before we redo pip install
RUN apt-get update \
&& apt-get upgrade -y --no-install-recommends \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a bad practice: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/

In addition, cleaning up the apt cache and removing /var/lib/apt/lists helps keep the image size down. Since the RUN statement starts with apt-get update, the package cache will always be refreshed prior to apt-get install.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is bad practice here ? I am not sure. I've added cleaning /var/lib/apt/list/* now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look @Fokko if the problem is already addressed and resolve if it is (I am not sure what it was in the first place). I will give hadolint a spin later on to detect some more potential bad practices but for now I think I use pretty good practices :)

Dockerfile Outdated Show resolved Hide resolved
# Airflow sources change frequently but dependency onfiguration won't change that often
# We copy setup.py and other files needed to perform setup of dependencies
# This way cache here will only be invalidated if any of the
# version/setup configuration change but not when airflow sources change
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle implicit updates of dependencies? Right now not everything is pinned at a fixed version, so the version might have a minor update.

Copy link
Member Author

@potiuk potiuk Jan 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependencies should be resolved properly and minor updates should be installed as needed. It works as follows:

  1. The first pip install will install the dependencies as they are at the moment the dockerfile is generated for the first time (line 74).

  2. Whenever any of the sources of airflow change, the docker image at line 77 gets invalidated and docker is rebuilt starting from that line. Then we upgrade apt-get dependencies to the latest ones (line 80) and then we run pip install Again (without using the old cache) to see if the pip transient dependencies will resolve to different versions of related packages. In most cases, this will be no-op but in case some transient dependencies will cause different resolution of them - they will get installed at that time (line 89)

  3. Whenever any of the setup.*, README, version.py, airflow script change (all files related to setup) we go even back a step - the Docker is invalidated at one of the lines 66-70 and it will continue from there - which means that completely fresh 'pip install' is run from scratch.

  4. We can always force installation from the scratch by increasing value of ENV FORCE_REINSTALL_APT_GET_DEPENDENCIES (currently = 1). If we manually increase it to 2 and commit such Dockerfile to docker it will invalidate docker images at line 25 and the all the installation is run from the scratch.

  5. The only thing I am not 100% sure (but about 99%) is what happens when new version of python3.6-slim gets released. I believe it will trigger the build from scratch, but that's something that we will have to confirm with DockerHub (It might depend on how they are doing caching as this part might be done differently). It's not a big problem even if this is not the case because at point 3) we do apt-get upgrade and at this time python version will get upgraded as well if it is released. So this is more of an optimisation question than a problem.

So eventually - whenever you make any change to airflow sources, you are sure that both 'apt-get upgrade' and 'pip install' have been called. Whenever you make change to setup.py - you are sure that 'pip install' is called from scratch , whenever you force it you can do apt-get install from the scratch.

I thin this fairly robust and expected behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko -> I think what we have now is pretty stable and I did my best to describe it in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+and+multi-stage+official+Airflow+image

I think we might resolve the conversation here and maybe move it to the AIP-10 doc ?

@potiuk
Copy link
Member Author

potiuk commented Jan 17, 2019

@Fokko - I am pushing an updated version. I know that famous quote, but I think in this case cache invalidation works in our favour. That quote really is about that you never know when to do the invalidation and in our case we will do very smart invalidation (as explained in detail in your question about implicit dependencies). PTAL and let me know if the strategy I explained makes sense to you.

Actually we could even build in some mechanism to invalidate such cashe automatically from time to time.

My whole problem with the current image is that it should not simply be done as is today - in the way that the whole image is always build from the scratch. There is totally no need for that and it has the nasty side effect for the users that it will pollute their docker lib directory with a lot of unused, frequently invalidated images.

In this case the problem is with cache invalidation on the user side in fact. Docker does not know when an already downloaded image will not be needed so it will cache it until someone does 'docker system prune'. Otherwise the /var/lib/docker library will grow forever for someone who will regularly pull airflow images.

@potiuk potiuk force-pushed the multi-layered-dockerfile branch 2 times, most recently from 5415b15 to a55814a Compare January 18, 2019 05:07
@potiuk
Copy link
Member Author

potiuk commented Jan 18, 2019

Hey @Fokko - I pushed all the changes.

Please let me know if you have more concerns about caching and invalidation proposed there.

Additionally if you have a concern about semi-regularly building everything from the scratch, I have done a simple POC of slight improvement to this approach - to accommodate automated building and pushing not only the "latest" tag but also the tag/released/branch ones. With advanced build settings of Dockerhub we can very easily configure custom builds and have full control over caching and building images for both - tags and master

Here is the work-in-progress of the POC (I will be testing it today):
PolideaInternal@16ffada

It allows to build master-> airflow:latest images with incremental layers as described above, but then every time a new release is created - i.e. tag matching appropriate regex is pushed - the whole image can be rebuilt from the scratch based on AIRFLOW_TAG custom build argument that you can pass. I put the AIRFLOW_TAG argument at the beginning of the Dockerfile which means that it will invalidate the whole cache and rebuild it from the scratch for every different TAG we build.

This way you can actually re-test (automatically) if rebuilding of the whole image from the scratch stlil works when you do RC builds. Without any additional effort nor actions from release manager, it will all be automated. It's just the matter of defining appropriate regex rules that will mach the tags or branches you want to build and subscribing to notifications whether the build did not fail.

It also allows to build anyone their own copy of Airflow image in their own Dockerhub based on their own branches - very useful for anyone who would make any changes to Dockerfile.

@Fokko
Copy link
Contributor

Fokko commented Jan 18, 2019

Having to regularly invalidate the cache would take away a couple of my concerns. Good to see that we can script the advanced features since the committers do not have access to the dockerhub.
I feel like we have a fundamentally different opinion on this. As this is not my decision, but the one from the community I would like to know what other people's opinion on this is.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to run this and go through a few rebuild cycles with differnt types of airflow changes before we merge this please.

.dockerignore Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated
WORKDIR $AIRFLOW_HOME
RUN mkdir -p $AIRFLOW_HOME
# First install only dependencies but no Apache Airflow itself - this way regular Airflow
RUN pip install --no-cache-dir -e.[$AIRFLOW_EXTRAS]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the extras here, or just core airflow here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends what is the scope of the "official" image. If the official image should be only core, then yes - only core should be installed, but if we decide to install all extras (as it is now in the current Dockerfile), then they should be consistently used in both pip-installs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I do it in the way that by default when I build main airflow Image I use "all" extras, but when I build CI image (via hooks/build) I use "devel_ci" extras. That sounds like reasonable approach. Let me know if you think we can do better :) Or resolve this one if you think it's good.

That's the relevant part of the hooks/build:

build_image "Airflow CI" \
            "${AIRFLOW_CI_IMAGE}" \
            "main" \
            "airflow-ci-apt-deps" \
            "devel_ci"

Dockerfile Outdated Show resolved Hide resolved
@potiuk
Copy link
Member Author

potiuk commented Jan 18, 2019

Ok. Cool. It's good to disagree sometimes as it might lead to better final results :).

I can do those things now:

  • I will wrap-up the proposal - i will try to summarise the discussion here, extract a bit more 'prior art' examples with different approache. I'd leave the 'cons' part for you to fill @Fokko if that's ok with you ?

  • I will update the docker file with the fixes and setup for the automated builds for tags/releases as well

  • I will ask community for opinion in the thread.

  • What I would love to do (following also the request by @ashb) is to merge it in the form of experimental Dockerfile (in experimental folder and producing experimental-tagged images ) so that it is built automatically in DockerHub - otherwise it will be hard to get results.

Does it sound reasonable?

@Fokko Fokko dismissed their stale review January 20, 2019 17:08

Lets discuss

@Fokko
Copy link
Contributor

Fokko commented Jan 20, 2019

@potiuk To recap my concern: When having an aggressive caching strategy, the container and the master repository might get out of sync. But before we do this, I think we should:

  1. Pin the versions of the dependencies, please refer to: https://lists.apache.org/thread.html/9e775d11cce6a3473cbe31908a17d7840072125be2dff020ff59a441@%3Cdev.airflow.apache.org%3E
  2. Make sure that we first always pull the base image. Having docker build --pull normally. So we always build from the latest base image. Which also ties in with why you shouldn't do apt-get upgrade -y in your docker image as your base image should be up to date: https://serverfault.com/questions/924394/is-it-a-good-practice-to-put-update-upgrade-statements-inside-the-dockerfile/924417

Please don't get me wrong, I'm not against a layered docker image, but I see some problems with the current state of Airflow, where the image might build with the cached version on Airflow's dockerhub, but it won't work when people try to build it locally.

@potiuk potiuk force-pushed the multi-layered-dockerfile branch from 8818096 to 6328eaf Compare January 20, 2019 21:07
@potiuk
Copy link
Member Author

potiuk commented Jan 20, 2019

@Fokko -> --pull flag is a good idea. It won't change DockerHub build behaviour as they are always starting from a clean docker when building image (to get repeatable builds). But it's good for anyone building it on their own environment which might not be clean.

As far as why apt-get upgrade -y is a bad idea - i am not sure any of the comments in the Stack Overflow question you copied actually say that. And I am not sure how docker build --pull is tied with not having to use apt-get upgrade. Just build --pull is not solving all security issues - it only forces rebuild when there is an update to the python3.6-slim base image which is happenning every few weeks - mostly because new python3.6 minor version has been relased or pip version upgraded. We are installing a lot more via apt-get and using apt-get upgrade is a good idea to keep (for security mostly).

Plus if we are really concerned that building from the scratch might not work for some people, why don't we simply run two image builds in parallel - latest (using the cache) and latest-clean (without cache - similar as release builds). This way we can be notified immediately when the "clean" build goes out of sync with the cached one enough that it starts to fail (which should be extremely rare if at all taking into account that we will be rebuilding the whole image from the scratch always when new python version gets released and when we do a release candidate of a new release). And if we find any issue and fix it - we can (after fixing it) force-rebuild latest image by increasing the vairiable in docker file.

In my latest PR change I did exactly this - whenever "release" or "latest-clean" image is build I replace --cache-from with --no-cache option and if we configure those parallel builds (i did it in my dockerhub ) - they will all be built in parallel - but then latest will be using cache which is benefitial for the users downloading it regularly.

@potiuk potiuk force-pushed the multi-layered-dockerfile branch 3 times, most recently from 021c9eb to bde8de4 Compare January 21, 2019 08:37
@potiuk
Copy link
Member Author

potiuk commented Jan 21, 2019

Ok. I will show you some working examples/POC shortly of where the full control over caching works with real DockerHub setup. I takes a bit of time to iterate with DockerHub builds as they take some time, but I think it will get there shortly.

Just to summarise what I will have now:

I will show you running examples where we can see that we can see with every commit that the build from scratch still works (:latest-clean), where the (:latest) build is usually incremental for the benefit of the downloading users.

The latest will be periodically refreshed from scratch whenever any of the events happens:

  • base image is updated (with --pull flag of the build) - whole build is rerun
  • setup.py changes (pip install is run from the scratch)
  • we decide to force rebuild it from the scratch by incrementing the variable in Dockerfile
  • I will show an example where apt-get upgrade makes perfect sense. In one of the latest builds I made there were two security fixes and a timezone fix applied thanks to running apt-get upgrade - it took like 4 minutes to build and likely few MB more to download the latest layer in order to get the latest security fixes in the latest build.

I hope that might be a good solution that addresses all your concerns and my concern about full downloads with every commit by the users. And again - I can add it as experimental for the time being, but I think it's worth to merge it to see how it behaves with real usage.

@potiuk
Copy link
Member Author

potiuk commented Jan 21, 2019

@ashb, @Fokko - I am more than happy to help with solving the dependency issues as well. I hope to have more time to work on it soon (than when I had couple of months back). I just am not sure that this is the same issue but this one can help with the dependency one. I think if we finally agree that layered approach is workable, it might open up some new ways of dealing with dependencies and solving dependencies issue might actually get solved more easily. And likely - we won't have to do much if we move to layered approach for CI images as well.

What we get with the layered solution is a kind of "temporary stability" of image. Unless setup.py or base image changes, the dependencies in base image will remain pretty stable. We could use very similar approach in CI Docker images: move it to the airflow repository, and skip "pip install' after every commit. I think that might solve most of concerns I had about stability of CI builds (and why I originated the thread). Most of the builds in CI will be "stable" - they will use cached dependencies from the last successfull build of the CI image.

It's only when someone upgrades setup.py where the transient dependencies might cause installation problems (and it's the right moment to solve them). At the same time we will always get early warnings about dependency problems with the "latest-clean" image that will be run with every commit - and this will happen without disturbing the work of people running their builds using the latest working CI image.

The good thing will be that the tests will continue to run for the users that will make simple commits - using the cached dependencies even if "latest-clean" image will fail with transient dependencies problems. Until the "clean" build is fixed test will continue to run using the latest cached dependencies. It's great for the regular committers (because their work won't be disturbed by external factors they can do nothing about). On the other hands, solving dependency issues will be at the hands of those who already deal with setup.py changes in their own changes.

This looks like frictionless, self-healing setup that might provide both stability for committers and "eventual consistency" for maintainers and we will not have to change the approach where we have "upper-open" requirements in airflow.

Last but not least (but this is quite independent) - we could think about freezing the requirements in the scripts that you use to prepare the release. But I think this is quite a different task.

@potiuk
Copy link
Member Author

potiuk commented Jan 21, 2019

Here are some examples from the POC I set-up in my own repo. Those are three different builds in Docker Hub and they are build automatically on push, I hope you do not need any permissions to see it - if you won't see it, I can copy the logs elsewhere:

  1. "latest" build using previously cached layers:
    https://cloud.docker.com/repository/registry-1.docker.io/potiuk/airflow/builds/c68a9564-f40a-492f-8cb3-d7126af73e12

It was built using previous "latest" as cache. You can see that only sources of airflow changed, no setup.py related files changed. Since last "clean" build there were 3 packages released in ubuntu: They are automatically upgraded with apt-get upgrade. Those are security fixes + timezone change:

The following packages will be upgraded:
libsystemd0 libudev1 tzdata
3 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 679 kB of archives.
After this operation, 14.3 kB of additional disk space will be used.

After that pip install -e was run again and all the requirements were "already satisified". It took 5 minutes to run the build, it is incremental difference vs. previous one - less than < 1MB to download by anyone who already downloaded previous "latest" image.

  1. Then (result of the same push) we have "latest-clean" build - rebuilt from scratch: https://cloud.docker.com/repository/registry-1.docker.io/potiuk/airflow/builds/b8e1ac70-a8fa-49a3-93ad-d27a43670e1c. --no-cache directive was used to build that one (based on the latest-clean image tag). It took 30 minutes to build. This is equivalent to the current build from master. It's almost 1GB to download for anyone who wants to get this version and already downloaded previous "latest-clean" - and that's how it should be.

  2. Then we have "release build" - I pushed an artificial tag 1.10.99999 which resulted in
    release-1.10.9999 image. It is built from the scratch as well based on "release-.*" regexp matching the tag: https://cloud.docker.com/repository/registry-1.docker.io/potiuk/airflow/builds/12ede2d4-235d-4e1e-a127-7349bb94abce . This one is only built if you push a tag following the regexp "^[0-9].+".

Here is the screenshot of the DockerHub config I use (in apache repo the source should eventually be "master" rather than "multi-layered-dockerfile" as is in my POC):

screenshot 2019-01-20 at 23 34 20

@potiuk potiuk force-pushed the multi-layered-dockerfile branch 7 times, most recently from 2b9cf85 to 6fa9131 Compare March 4, 2019 12:30
@potiuk potiuk force-pushed the multi-layered-dockerfile branch 2 times, most recently from dcc5f75 to 3f968a6 Compare March 13, 2019 09:39
@potiuk
Copy link
Member Author

potiuk commented Mar 13, 2019

@Fokko @ashb . Further update. I am progressing with fixing tests and I hope today/tomorrow will send more information to the dev list and start wider discussion. Hadoop is already fixed (env problem as suspected). Most of the other tests I know why they do not work (some missing dependencies like virtualenv, changed paths for log reading etc.). Still mysql is not starting (but I guess it's also simple configuration issue).

I also improved a lot the "development environment" - I brought more from our past experience, and I think I have something rather useful. I will split it off and make separate PR out of that - related to AIP-7 (dependent on the multi-docker change).

In the meantime you can take a look at the docs I updated along the way (this will also go to separate PR for AIP-7:

https://github.com/PolideaInternal/airflow/blob/multi-layered-dockerfile/BREEZE.md
https://github.com/PolideaInternal/airflow/blob/multi-layered-dockerfile/CONTRIBUTING.md#setting-up-a-development-environment

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave Breeze a spin, and it works very well from what I've seen so far.

Some suggestions:

  • Remove the tox from .flake8
  • Why are all the versions changed in airflow/www/package-lock.json, maybe revert this? I see also that we're going back to lower versions. What do you think?
  • Personally I would leave the .bash_aliases and the .inputrc out, since these settings are personal preferences.
  • For the go docker-show-context stuff, shoudn't we able to use a binary, instead of adding the source to the Airflow repository.
  • What do you think of adding hadolint to the pre-steps of the CI?

Further thoughts, ideally we should have one image for testing Airflow. So in that case we should deprecate https://github.com/apache/airflow-ci and use this image for testing as well.

.travis.yml Outdated Show resolved Hide resolved
.travis.yml Outdated Show resolved Hide resolved
BREEZE.md Outdated Show resolved Hide resolved
Dockerfile Show resolved Hide resolved
airflow/www/package-lock.json Outdated Show resolved Hide resolved
MY_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
export PYTHON_VERSION=${PYTHON_VERSION:=$(python -c 'import sys; print("%s.%s" % (sys.version_info.major, sys.version_info.minor))')}
export AIRFLOW_VERSION=$(cat airflow/version.py | grep version | awk {'print $3'} | tr -d "'+")
export DOCKERHUB_USER=${DOCKERHUB_USER:="potiuk"} # TODO: change back to "airflow"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this as well:

MacBook-Pro-van-Fokko:incubator-airflow fokkodriesprong$ docker ps
CONTAINER ID        IMAGE                                      COMMAND                  CREATED             STATUS              PORTS                                         NAMES
c556cab3143e        potiuk/airflow:latest-3.6-ci-v2.0.0.dev0   "/bin/bash -c '/opt/…"   6 minutes ago       Up 6 minutes        0.0.0.0:28080->8080/tcp                       ci_airflow-testing_run_dfff807d8d98
563729cd183f        rabbitmq:3.7                               "docker-entrypoint.s…"   6 minutes ago       Up 6 minutes        4369/tcp, 5671-5672/tcp, 25672/tcp            ci_rabbitmq_1
e04e043b8cf7        cassandra:3.0                              "docker-entrypoint.s…"   6 minutes ago       Up 6 minutes        7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp   ci_cassandra_1
2fd6654702cd        postgres:9.6                               "docker-entrypoint.s…"   6 minutes ago       Up 6 minutes        0.0.0.0:25433->5432/tcp                       ci_postgres_1
f41320c92c7b        godatadriven/krb5-kdc-server               "/bin/bash /app/star…"   6 minutes ago       Up 6 minutes                                                      ci_krb5-kdc-server_1
0dda7f9a6e3e        mysql:5.6                                  "docker-entrypoint.s…"   6 minutes ago       Up 6 minutes        0.0.0.0:23306->3306/tcp                       ci_mysql_1
7c1dd4bb3448        osixia/openldap:1.2.0                      "/container/tool/run…"   6 minutes ago       Up 6 minutes        389/tcp, 636/tcp                              ci_openldap_1
5a7b4a387b0b        redis:5.0.1                                "docker-entrypoint.s…"   6 minutes ago       Up 6 minutes        6379/tcp                                      ci_redis_1
76f435ff6386        mongo:3                                    "docker-entrypoint.s…"   6 minutes ago       Up 6 minutes        27017/tcp                                     ci_mongo_1

Copy link
Member Author

@potiuk potiuk Mar 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. For now I keep "potiuk" in there because I can run testing with my own DockerHub/Travis (but when we get to merging/splitting to separate commits I will bring it back to "airflow" (that's why the TODO in configuration). Actually this is also a feature requested in "Simplified Development Workflow" - you will be able to use your own DockerHub account/DockerHub repo to build and test (and use) images on your own instance before it ever hits DockerHub or TravisCI of airflow. Kinda nice for testing.

It's still not ready to merge until I actually fix all the test failures.

Dockerfile Outdated Show resolved Hide resolved
BREEZE.md Outdated Show resolved Hide resolved
@potiuk potiuk force-pushed the multi-layered-dockerfile branch 3 times, most recently from 8e2acab to cbc019a Compare March 18, 2019 00:08
@potiuk
Copy link
Member Author

potiuk commented Mar 18, 2019

@Fokko Thanks for comments :) really appreciate!

I like that you liked breeze. We find our version of breeze pretty indispensable for our daily work.

Remove the tox from .flake8

Will do. Good catch.

Why are all the versions changed in airflow/www/package-lock.json, maybe revert this? I see also that we're going back to lower versions. What do you think?

Already reverted. That was automatically done by 'npm install'. This is the default behaviour of NPM and something that maybe long term we would like to have with setup.py/requirements - exactly what I advocated for some time ago - have a fixed set of deps (lock) and update them automatically during development (npm install) so that it can be committed and used in CI environment ('npm ci'). I find this workflow very stable. But that's a side comment. We have something similar now with the layered build (as the requirements are frozen in CI image in-between setup.py changes). Something that we should discuss if community is fine with. I will re-open the discussion as I am ready to do so.

Personally I would leave the .bash_aliases and the .inputrc out, since these settings are personal preferences.

No problem, I will remove them, though I love to have those inside the docker. What I can do instead, I can implement conditional mounting of those two in case I place those files in a specific directory from sources. This way I can keep them "local" for myself but still get them used every time I enter the environment.

For the go docker-show-context stuff, shoudn't we able to use a binary, instead of adding the source to the Airflow repository.

I just removed it. I needed that only temporarily while I debugged some missing .dockerignores. I found a better solution now with the docker-show-all-context-files, which is kinda nice - it is controlled with an environment variable and it will save list of all files that are in Docker context in a file in "dist" directory. It's kind a smart and super-simple (found it on Stack Overflow somewhere)- it creates an empty container with the same context and lists all files from the container :)

What do you think of adding hadolint to the pre-steps of the CI?

Great idea. I will give it a spin! Looks cool.

Further thoughts, ideally we should have one image for testing Airflow. So in that case we should deprecate https://github.com/apache/airflow-ci and use this image for testing as well.

Absolutely. This is what already happens in this change and one of the biggest reasons to implement it. The CI scripts are building the image locally (using cache from DockerHub and local sources to incrementally add latest sources/or packages - whatever has changed) and then use those locally built images to run the tests (not the incubator-ci one)

@potiuk
Copy link
Member Author

potiuk commented Mar 18, 2019

@Fokko @ash - one more comment - I am now ready to involve community. I actually managed to hugely simplify the layer structure. I did a lot of testing - including all the timing calculation - how much we save at each stage, whether the layers are needed, whether the wheel cache layer is really helping. All this is documented now in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+and+multi-stage+official+Airflow+image

From those tests it seems that wheel cache did more harm than good. It complicated the whole setup and increased number of images we had to build and eventually only provided marginal savings. It took much longer to build the images in general (bigger Dockerfile, more layers etc.). Also when I disabled CYTHON compilation for Cassandra driver, pip installation time went down by 5(!) minutes or so and then it started to be comparable with the installation from wheels.

So my most recent proposal is even more simplified - we now have only three stages in the Dockerfile and two images in DockerHub (just the slim Airflow Image and fat Airflow CI image). And it's also a bit easier to explain and understand.

I will explain it in the community devlist email.

@potiuk potiuk force-pushed the multi-layered-dockerfile branch from cbc019a to 41562d2 Compare March 18, 2019 08:53
@potiuk
Copy link
Member Author

potiuk commented Mar 18, 2019

@ashb @Fokko @mik-laj -> I split off the "Breeze" environment into separate Draft PR #4932 - I agreed with @gerardo , creator of the AIP-7 that I will take the lead on documenting and discussing the AIP-7 work.

@potiuk
Copy link
Member Author

potiuk commented Mar 18, 2019

@Fokko ->

  • .tox removed
  • .bash_aliases and .inputrc are removed from the Dockerfile

@@ -144,6 +144,14 @@ echo
# So for example 'latest-3.6' will produce PYTHON_VERSION=3.6
export PYTHON_VERSION=$(echo "${LOCAL_IMAGE}" | rev | cut -d '-' -f1 | rev )


# In case of CRON jobs on Travis we run builds without cache
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko -> this will take care about daily running full build from scratch on Travis. I think that might address a lot of your concerns regarding "will it build from scratch". No caches are used for that build, no images are pulled from DockerHub, it will just reinstall everything from the scratch. I guess we can setup notification to send error messages to commit list (or maybe it will happen automatically). This way commiters might be warned about broken transitive dependencies without actually breaking opened pull requests (people will continue to use "working" versions of pip dependencies from last successfully built DockerHub master build. When it gets fixed by committers and merged into master, it will build and new set of dependencies will be used by everyone. I think it will work in self-healing way with minimal disruptions for contributors and committers (and maintainers).

I scheduled daily build at my Travis CI copy and will let you know when it runs so that you can take a look
https://travis-ci.org/potiuk/airflow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of such Cron job @Fokko: https://travis-ci.org/potiuk/airflow/jobs/509211696. It can run daily on travis and it will run all the tests on a "from the scratch" build. It takes 7:20 minutes to build the image from scratch on Travis worker (per job) but we have clear indication whether clean build of Docker works from the scratch with latest apt-get and pip dependencies and whether all tests are successful.

This way all the PR will use much faster built incremental image, but we also have full, longer build that is performed automatically and regularly.

At the same time such daily build - since it will be build daily (for example) might be an early indication that one of the transitive dependencies is now failing the installation (but it will not disrupt tests for opened PRs because they will use latest "correctly building" dependencies from the last successful master build.

I think this is a very nice way of addressing both problems - stability for people working on their PRs and assurance for maintainers that the build is "safe".

I think if we do it this way, we might even not need to solve the transient dependencies case for work in progress - we could focus only on making sure that there is a stable set of dependencies for official releases somehow.

@potiuk potiuk force-pushed the multi-layered-dockerfile branch from cee9fc0 to 45b1f35 Compare March 18, 2019 10:26
@dimberman
Copy link
Contributor

@potiuk is there any way to create some subtasks for this ticket and submit smaller PRs? There is a LOT going on here and I'm worried some issues might fly in under the radar.

@potiuk potiuk force-pushed the multi-layered-dockerfile branch 2 times, most recently from 0c466f3 to a0e293f Compare March 18, 2019 17:40
@potiuk potiuk force-pushed the multi-layered-dockerfile branch from a0e293f to 4ab6000 Compare March 18, 2019 17:42
@potiuk
Copy link
Member Author

potiuk commented Mar 18, 2019

@dimberman - sure. I can. I imagine it's a a pain for reviewers to review it all. I am quite used to split and join vomits and re-arrange branches as needed. So this should be no big problem. It actually make sense to split it as some of the changes might wait until all the tests are fixed.

I already thought about it and I think this approach might make sense:

  1. first replace the original Docker with the multi-layered one (plus accompanying .ignore files) - the Dockerfile will build stand-alone and in DockerHub and will produce similar Docker image as we have currently. I can even extract out the CI-relevant stage and submit it at later PR

  2. then as next PR I can add hook/build scripts that will add custom build in DockerHub and produce both CI And Airflow images - at this time I could add the CI stage back.

  3. next PR would be to change tests to use the new Dockerfile/images to run Travis tests.

  4. final PR (which i already split to separate PR) is the improved development environment.

It will be difficult to split it to more granular changes I am afraid, but having those four PRs might actually make sense - also it might make it easier to introduce it - we can check step by step that things work/reconfigure the DockerHub gradually and verify that it works fine.

Do you think such split will address your concern? Or you thought about more incremental approach ?

@Fokko @ashb - what do you think ?

@potiuk potiuk changed the title [AIRFLOW-3718] [WIP] Multi-layered version of the docker image [AIRFLOW-3718] [WIP/SPLIT] Multi-layered version of the docker image Mar 18, 2019
@potiuk
Copy link
Member Author

potiuk commented Mar 18, 2019

@Fokko @ashb @dimberman : I think the suggestion to split the change into three separate PR/commits makes perfect sense. I just did it. So I close that one. I will keep returning to still resolved discussions here to make sure they are addressed.

The JIRA subtasks:

The sequence of implementation is described in AIP-10

@potiuk potiuk closed this Mar 18, 2019
@potiuk potiuk deleted the multi-layered-dockerfile branch September 19, 2019 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants