-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-3718] [WIP/SPLIT] Multi-layered version of the docker image #4543
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4543 +/- ##
======================================
Coverage 74.1% 74.1%
======================================
Files 421 421
Lines 27662 27662
======================================
Hits 20498 20498
Misses 7164 7164 Continue to review full report at Codecov.
|
53fb830
to
b72088d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work @potiuk
I'm still a bit hesitant by the caching stuff. If we keep caching stuff, how can we assure that it also will build from scratch. As the famous quote:
There are only two hard things in Computer Science: cache invalidation and naming things. -- Phil Karlton
For example, Python is doing a minor release every couple of months, and also the base images are updated a couple of times a month: https://github.com/docker-library/official-images/commits/master/library/python
The same goes with the implicit updates of the libraries. I'm more in favour of invalidating the cache periodically, if not, all the time. Maybe other people have an opinion on this as well.
|
||
# Always add get update here to get latest dependencies before we redo pip install | ||
RUN apt-get update \ | ||
&& apt-get upgrade -y --no-install-recommends \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a bad practice: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/
In addition, cleaning up the apt cache and removing /var/lib/apt/lists helps keep the image size down. Since the RUN statement starts with apt-get update, the package cache will always be refreshed prior to apt-get install.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is bad practice here ? I am not sure. I've added cleaning /var/lib/apt/list/* now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look @Fokko if the problem is already addressed and resolve if it is (I am not sure what it was in the first place). I will give hadolint a spin later on to detect some more potential bad practices but for now I think I use pretty good practices :)
# Airflow sources change frequently but dependency onfiguration won't change that often | ||
# We copy setup.py and other files needed to perform setup of dependencies | ||
# This way cache here will only be invalidated if any of the | ||
# version/setup configuration change but not when airflow sources change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle implicit updates of dependencies? Right now not everything is pinned at a fixed version, so the version might have a minor update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dependencies should be resolved properly and minor updates should be installed as needed. It works as follows:
-
The first pip install will install the dependencies as they are at the moment the dockerfile is generated for the first time (line 74).
-
Whenever any of the sources of airflow change, the docker image at line 77 gets invalidated and docker is rebuilt starting from that line. Then we upgrade apt-get dependencies to the latest ones (line 80) and then we run
pip install
Again (without using the old cache) to see if the pip transient dependencies will resolve to different versions of related packages. In most cases, this will be no-op but in case some transient dependencies will cause different resolution of them - they will get installed at that time (line 89) -
Whenever any of the setup.*, README, version.py, airflow script change (all files related to setup) we go even back a step - the Docker is invalidated at one of the lines 66-70 and it will continue from there - which means that completely fresh 'pip install' is run from scratch.
-
We can always force installation from the scratch by increasing value of ENV FORCE_REINSTALL_APT_GET_DEPENDENCIES (currently = 1). If we manually increase it to 2 and commit such Dockerfile to docker it will invalidate docker images at line 25 and the all the installation is run from the scratch.
-
The only thing I am not 100% sure (but about 99%) is what happens when new version of python3.6-slim gets released. I believe it will trigger the build from scratch, but that's something that we will have to confirm with DockerHub (It might depend on how they are doing caching as this part might be done differently). It's not a big problem even if this is not the case because at point 3) we do
apt-get upgrade
and at this time python version will get upgraded as well if it is released. So this is more of an optimisation question than a problem.
So eventually - whenever you make any change to airflow sources, you are sure that both 'apt-get upgrade' and 'pip install' have been called. Whenever you make change to setup.py - you are sure that 'pip install' is called from scratch , whenever you force it you can do apt-get install from the scratch.
I thin this fairly robust and expected behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko -> I think what we have now is pretty stable and I did my best to describe it in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+and+multi-stage+official+Airflow+image
I think we might resolve the conversation here and maybe move it to the AIP-10 doc ?
@Fokko - I am pushing an updated version. I know that famous quote, but I think in this case cache invalidation works in our favour. That quote really is about that you never know when to do the invalidation and in our case we will do very smart invalidation (as explained in detail in your question about implicit dependencies). PTAL and let me know if the strategy I explained makes sense to you. Actually we could even build in some mechanism to invalidate such cashe automatically from time to time. My whole problem with the current image is that it should not simply be done as is today - in the way that the whole image is always build from the scratch. There is totally no need for that and it has the nasty side effect for the users that it will pollute their docker lib directory with a lot of unused, frequently invalidated images. In this case the problem is with cache invalidation on the user side in fact. Docker does not know when an already downloaded image will not be needed so it will cache it until someone does 'docker system prune'. Otherwise the /var/lib/docker library will grow forever for someone who will regularly pull airflow images. |
5415b15
to
a55814a
Compare
Hey @Fokko - I pushed all the changes. Please let me know if you have more concerns about caching and invalidation proposed there. Additionally if you have a concern about semi-regularly building everything from the scratch, I have done a simple POC of slight improvement to this approach - to accommodate automated building and pushing not only the "latest" tag but also the tag/released/branch ones. With advanced build settings of Dockerhub we can very easily configure custom builds and have full control over caching and building images for both - tags and master Here is the work-in-progress of the POC (I will be testing it today): It allows to build master-> airflow:latest images with incremental layers as described above, but then every time a new release is created - i.e. tag matching appropriate regex is pushed - the whole image can be rebuilt from the scratch based on AIRFLOW_TAG custom build argument that you can pass. I put the AIRFLOW_TAG argument at the beginning of the Dockerfile which means that it will invalidate the whole cache and rebuild it from the scratch for every different TAG we build. This way you can actually re-test (automatically) if rebuilding of the whole image from the scratch stlil works when you do RC builds. Without any additional effort nor actions from release manager, it will all be automated. It's just the matter of defining appropriate regex rules that will mach the tags or branches you want to build and subscribing to notifications whether the build did not fail. It also allows to build anyone their own copy of Airflow image in their own Dockerhub based on their own branches - very useful for anyone who would make any changes to Dockerfile. |
Having to regularly invalidate the cache would take away a couple of my concerns. Good to see that we can script the advanced features since the committers do not have access to the dockerhub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to run this and go through a few rebuild cycles with differnt types of airflow changes before we merge this please.
Dockerfile
Outdated
WORKDIR $AIRFLOW_HOME | ||
RUN mkdir -p $AIRFLOW_HOME | ||
# First install only dependencies but no Apache Airflow itself - this way regular Airflow | ||
RUN pip install --no-cache-dir -e.[$AIRFLOW_EXTRAS] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the extras here, or just core airflow here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends what is the scope of the "official" image. If the official image should be only core, then yes - only core should be installed, but if we decide to install all extras (as it is now in the current Dockerfile), then they should be consistently used in both pip-installs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I do it in the way that by default when I build main airflow Image I use "all" extras, but when I build CI image (via hooks/build) I use "devel_ci" extras. That sounds like reasonable approach. Let me know if you think we can do better :) Or resolve this one if you think it's good.
That's the relevant part of the hooks/build:
build_image "Airflow CI" \
"${AIRFLOW_CI_IMAGE}" \
"main" \
"airflow-ci-apt-deps" \
"devel_ci"
Ok. Cool. It's good to disagree sometimes as it might lead to better final results :). I can do those things now:
Does it sound reasonable? |
@potiuk To recap my concern: When having an aggressive caching strategy, the container and the master repository might get out of sync. But before we do this, I think we should:
Please don't get me wrong, I'm not against a layered docker image, but I see some problems with the current state of Airflow, where the image might build with the cached version on Airflow's dockerhub, but it won't work when people try to build it locally. |
8818096
to
6328eaf
Compare
@Fokko -> --pull flag is a good idea. It won't change DockerHub build behaviour as they are always starting from a clean docker when building image (to get repeatable builds). But it's good for anyone building it on their own environment which might not be clean. As far as why Plus if we are really concerned that building from the scratch might not work for some people, why don't we simply run two image builds in parallel - latest (using the cache) and latest-clean (without cache - similar as release builds). This way we can be notified immediately when the "clean" build goes out of sync with the cached one enough that it starts to fail (which should be extremely rare if at all taking into account that we will be rebuilding the whole image from the scratch always when new python version gets released and when we do a release candidate of a new release). And if we find any issue and fix it - we can (after fixing it) force-rebuild latest image by increasing the vairiable in docker file. In my latest PR change I did exactly this - whenever "release" or "latest-clean" image is build I replace --cache-from with --no-cache option and if we configure those parallel builds (i did it in my dockerhub ) - they will all be built in parallel - but then latest will be using cache which is benefitial for the users downloading it regularly. |
021c9eb
to
bde8de4
Compare
Ok. I will show you some working examples/POC shortly of where the full control over caching works with real DockerHub setup. I takes a bit of time to iterate with DockerHub builds as they take some time, but I think it will get there shortly. Just to summarise what I will have now: I will show you running examples where we can see that we can see with every commit that the build from scratch still works (:latest-clean), where the (:latest) build is usually incremental for the benefit of the downloading users. The latest will be periodically refreshed from scratch whenever any of the events happens:
I hope that might be a good solution that addresses all your concerns and my concern about full downloads with every commit by the users. And again - I can add it as experimental for the time being, but I think it's worth to merge it to see how it behaves with real usage. |
@ashb, @Fokko - I am more than happy to help with solving the dependency issues as well. I hope to have more time to work on it soon (than when I had couple of months back). I just am not sure that this is the same issue but this one can help with the dependency one. I think if we finally agree that layered approach is workable, it might open up some new ways of dealing with dependencies and solving dependencies issue might actually get solved more easily. And likely - we won't have to do much if we move to layered approach for CI images as well. What we get with the layered solution is a kind of "temporary stability" of image. Unless setup.py or base image changes, the dependencies in base image will remain pretty stable. We could use very similar approach in CI Docker images: move it to the airflow repository, and skip "pip install' after every commit. I think that might solve most of concerns I had about stability of CI builds (and why I originated the thread). Most of the builds in CI will be "stable" - they will use cached dependencies from the last successfull build of the CI image. It's only when someone upgrades setup.py where the transient dependencies might cause installation problems (and it's the right moment to solve them). At the same time we will always get early warnings about dependency problems with the "latest-clean" image that will be run with every commit - and this will happen without disturbing the work of people running their builds using the latest working CI image. The good thing will be that the tests will continue to run for the users that will make simple commits - using the cached dependencies even if "latest-clean" image will fail with transient dependencies problems. Until the "clean" build is fixed test will continue to run using the latest cached dependencies. It's great for the regular committers (because their work won't be disturbed by external factors they can do nothing about). On the other hands, solving dependency issues will be at the hands of those who already deal with setup.py changes in their own changes. This looks like frictionless, self-healing setup that might provide both stability for committers and "eventual consistency" for maintainers and we will not have to change the approach where we have "upper-open" requirements in airflow. Last but not least (but this is quite independent) - we could think about freezing the requirements in the scripts that you use to prepare the release. But I think this is quite a different task. |
Here are some examples from the POC I set-up in my own repo. Those are three different builds in Docker Hub and they are build automatically on push, I hope you do not need any permissions to see it - if you won't see it, I can copy the logs elsewhere:
It was built using previous "latest" as cache. You can see that only sources of airflow changed, no setup.py related files changed. Since last "clean" build there were 3 packages released in ubuntu: They are automatically upgraded with
After that
Here is the screenshot of the DockerHub config I use (in apache repo the source should eventually be "master" rather than "multi-layered-dockerfile" as is in my POC): |
2b9cf85
to
6fa9131
Compare
dcc5f75
to
3f968a6
Compare
@Fokko @ashb . Further update. I am progressing with fixing tests and I hope today/tomorrow will send more information to the dev list and start wider discussion. Hadoop is already fixed (env problem as suspected). Most of the other tests I know why they do not work (some missing dependencies like virtualenv, changed paths for log reading etc.). Still mysql is not starting (but I guess it's also simple configuration issue). I also improved a lot the "development environment" - I brought more from our past experience, and I think I have something rather useful. I will split it off and make separate PR out of that - related to AIP-7 (dependent on the multi-docker change). In the meantime you can take a look at the docs I updated along the way (this will also go to separate PR for AIP-7: https://github.com/PolideaInternal/airflow/blob/multi-layered-dockerfile/BREEZE.md |
81e1585
to
9f200b3
Compare
9f200b3
to
c36fdfa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave Breeze a spin, and it works very well from what I've seen so far.
Some suggestions:
- Remove the tox from .flake8
- Why are all the versions changed in airflow/www/package-lock.json, maybe revert this? I see also that we're going back to lower versions. What do you think?
- Personally I would leave the
.bash_aliases
and the.inputrc
out, since these settings are personal preferences. - For the go docker-show-context stuff, shoudn't we able to use a binary, instead of adding the source to the Airflow repository.
- What do you think of adding hadolint to the pre-steps of the CI?
Further thoughts, ideally we should have one image for testing Airflow. So in that case we should deprecate https://github.com/apache/airflow-ci and use this image for testing as well.
MY_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" | ||
export PYTHON_VERSION=${PYTHON_VERSION:=$(python -c 'import sys; print("%s.%s" % (sys.version_info.major, sys.version_info.minor))')} | ||
export AIRFLOW_VERSION=$(cat airflow/version.py | grep version | awk {'print $3'} | tr -d "'+") | ||
export DOCKERHUB_USER=${DOCKERHUB_USER:="potiuk"} # TODO: change back to "airflow" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this as well:
MacBook-Pro-van-Fokko:incubator-airflow fokkodriesprong$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c556cab3143e potiuk/airflow:latest-3.6-ci-v2.0.0.dev0 "/bin/bash -c '/opt/…" 6 minutes ago Up 6 minutes 0.0.0.0:28080->8080/tcp ci_airflow-testing_run_dfff807d8d98
563729cd183f rabbitmq:3.7 "docker-entrypoint.s…" 6 minutes ago Up 6 minutes 4369/tcp, 5671-5672/tcp, 25672/tcp ci_rabbitmq_1
e04e043b8cf7 cassandra:3.0 "docker-entrypoint.s…" 6 minutes ago Up 6 minutes 7000-7001/tcp, 7199/tcp, 9042/tcp, 9160/tcp ci_cassandra_1
2fd6654702cd postgres:9.6 "docker-entrypoint.s…" 6 minutes ago Up 6 minutes 0.0.0.0:25433->5432/tcp ci_postgres_1
f41320c92c7b godatadriven/krb5-kdc-server "/bin/bash /app/star…" 6 minutes ago Up 6 minutes ci_krb5-kdc-server_1
0dda7f9a6e3e mysql:5.6 "docker-entrypoint.s…" 6 minutes ago Up 6 minutes 0.0.0.0:23306->3306/tcp ci_mysql_1
7c1dd4bb3448 osixia/openldap:1.2.0 "/container/tool/run…" 6 minutes ago Up 6 minutes 389/tcp, 636/tcp ci_openldap_1
5a7b4a387b0b redis:5.0.1 "docker-entrypoint.s…" 6 minutes ago Up 6 minutes 6379/tcp ci_redis_1
76f435ff6386 mongo:3 "docker-entrypoint.s…" 6 minutes ago Up 6 minutes 27017/tcp ci_mongo_1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. For now I keep "potiuk" in there because I can run testing with my own DockerHub/Travis (but when we get to merging/splitting to separate commits I will bring it back to "airflow" (that's why the TODO in configuration). Actually this is also a feature requested in "Simplified Development Workflow" - you will be able to use your own DockerHub account/DockerHub repo to build and test (and use) images on your own instance before it ever hits DockerHub or TravisCI of airflow. Kinda nice for testing.
It's still not ready to merge until I actually fix all the test failures.
8e2acab
to
cbc019a
Compare
@Fokko Thanks for comments :) really appreciate! I like that you liked breeze. We find our version of breeze pretty indispensable for our daily work.
Will do. Good catch.
Already reverted. That was automatically done by 'npm install'. This is the default behaviour of NPM and something that maybe long term we would like to have with setup.py/requirements - exactly what I advocated for some time ago - have a fixed set of deps (lock) and update them automatically during development (npm install) so that it can be committed and used in CI environment ('npm ci'). I find this workflow very stable. But that's a side comment. We have something similar now with the layered build (as the requirements are frozen in CI image in-between setup.py changes). Something that we should discuss if community is fine with. I will re-open the discussion as I am ready to do so.
No problem, I will remove them, though I love to have those inside the docker. What I can do instead, I can implement conditional mounting of those two in case I place those files in a specific directory from sources. This way I can keep them "local" for myself but still get them used every time I enter the environment.
I just removed it. I needed that only temporarily while I debugged some missing .dockerignores. I found a better solution now with the docker-show-all-context-files, which is kinda nice - it is controlled with an environment variable and it will save list of all files that are in Docker context in a file in "dist" directory. It's kind a smart and super-simple (found it on Stack Overflow somewhere)- it creates an empty container with the same context and lists all files from the container :)
Great idea. I will give it a spin! Looks cool.
Absolutely. This is what already happens in this change and one of the biggest reasons to implement it. The CI scripts are building the image locally (using cache from DockerHub and local sources to incrementally add latest sources/or packages - whatever has changed) and then use those locally built images to run the tests (not the incubator-ci one) |
@Fokko @ash - one more comment - I am now ready to involve community. I actually managed to hugely simplify the layer structure. I did a lot of testing - including all the timing calculation - how much we save at each stage, whether the layers are needed, whether the wheel cache layer is really helping. All this is documented now in https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-10+Multi-layered+and+multi-stage+official+Airflow+image From those tests it seems that wheel cache did more harm than good. It complicated the whole setup and increased number of images we had to build and eventually only provided marginal savings. It took much longer to build the images in general (bigger Dockerfile, more layers etc.). Also when I disabled CYTHON compilation for Cassandra driver, pip installation time went down by 5(!) minutes or so and then it started to be comparable with the installation from wheels. So my most recent proposal is even more simplified - we now have only three stages in the Dockerfile and two images in DockerHub (just the slim Airflow Image and fat Airflow CI image). And it's also a bit easier to explain and understand. I will explain it in the community devlist email. |
cbc019a
to
41562d2
Compare
@Fokko ->
|
@@ -144,6 +144,14 @@ echo | |||
# So for example 'latest-3.6' will produce PYTHON_VERSION=3.6 | |||
export PYTHON_VERSION=$(echo "${LOCAL_IMAGE}" | rev | cut -d '-' -f1 | rev ) | |||
|
|||
|
|||
# In case of CRON jobs on Travis we run builds without cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko -> this will take care about daily running full build from scratch on Travis. I think that might address a lot of your concerns regarding "will it build from scratch". No caches are used for that build, no images are pulled from DockerHub, it will just reinstall everything from the scratch. I guess we can setup notification to send error messages to commit list (or maybe it will happen automatically). This way commiters might be warned about broken transitive dependencies without actually breaking opened pull requests (people will continue to use "working" versions of pip dependencies from last successfully built DockerHub master build. When it gets fixed by committers and merged into master, it will build and new set of dependencies will be used by everyone. I think it will work in self-healing way with minimal disruptions for contributors and committers (and maintainers).
I scheduled daily build at my Travis CI copy and will let you know when it runs so that you can take a look
https://travis-ci.org/potiuk/airflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example of such Cron job @Fokko: https://travis-ci.org/potiuk/airflow/jobs/509211696. It can run daily on travis and it will run all the tests on a "from the scratch" build. It takes 7:20 minutes to build the image from scratch on Travis worker (per job) but we have clear indication whether clean build of Docker works from the scratch with latest apt-get and pip dependencies and whether all tests are successful.
This way all the PR will use much faster built incremental image, but we also have full, longer build that is performed automatically and regularly.
At the same time such daily build - since it will be build daily (for example) might be an early indication that one of the transitive dependencies is now failing the installation (but it will not disrupt tests for opened PRs because they will use latest "correctly building" dependencies from the last successful master build.
I think this is a very nice way of addressing both problems - stability for people working on their PRs and assurance for maintainers that the build is "safe".
I think if we do it this way, we might even not need to solve the transient dependencies case for work in progress - we could focus only on making sure that there is a stable set of dependencies for official releases somehow.
cee9fc0
to
45b1f35
Compare
@potiuk is there any way to create some subtasks for this ticket and submit smaller PRs? There is a LOT going on here and I'm worried some issues might fly in under the radar. |
0c466f3
to
a0e293f
Compare
a0e293f
to
4ab6000
Compare
@dimberman - sure. I can. I imagine it's a a pain for reviewers to review it all. I am quite used to split and join vomits and re-arrange branches as needed. So this should be no big problem. It actually make sense to split it as some of the changes might wait until all the tests are fixed. I already thought about it and I think this approach might make sense:
It will be difficult to split it to more granular changes I am afraid, but having those four PRs might actually make sense - also it might make it easier to introduce it - we can check step by step that things work/reconfigure the DockerHub gradually and verify that it works fine. Do you think such split will address your concern? Or you thought about more incremental approach ? |
@Fokko @ashb @dimberman : I think the suggestion to split the change into three separate PR/commits makes perfect sense. I just did it. So I close that one. I will keep returning to still resolved discussions here to make sure they are addressed. The JIRA subtasks:
The sequence of implementation is described in AIP-10 |
[no-ci]
This PR was split into several smaller -sub tasks to make it easier to review and to merge this change in several steps.
The JIRA subtasks: