Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Create base-image and minimize layer count #324

Conversation

johscheuer
Copy link

What changes were proposed in this pull request?

Use one single layer for the run command and tool installation. Also move all the common parts into one base-image.

Fixes #323

How was this patch tested?

Tested with the Minikube integration test (https://github.com/apache-spark-on-k8s/spark/blob/branch-2.1-kubernetes/resource-managers/kubernetes/README.md#preserve-the-minikube-vm)

cc @ash211

@johscheuer johscheuer force-pushed the rework-dockerfiles branch from b40dec5 to 3ef935f Compare June 2, 2017 07:48
@johscheuer johscheuer force-pushed the rework-dockerfiles branch from b37671b to 5011163 Compare June 2, 2017 08:13
@mccheah
Copy link

mccheah commented Jun 2, 2017

Thanks for contributing this - I see no problems and think we can afford to put this in the next release. What do you think @ash211 ?

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark
COPY examples /opt/spark/examples
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between ADD and COPY?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like ADD supports URLs and extracting files out of tarballs which COPY does not

https://stackoverflow.com/questions/24958140/what-is-the-difference-between-the-copy-and-add-commands-in-a-dockerfile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good summary of ADD vs COPY : https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#add-or-copy

ADD is somewhat deprecated because it has complex behaviors. Recommended best practice is to copy tar-balls and unpack them explicitly in a RUN

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The support for unpacking tarballs is actually one of the side-effects of ADD that is considered harder to reason about. I think the idea is, just looking at a docker-file you may not be able to tell exactly what operations that ADD is doing. I don't have a real strong opinion on that. It's arguably more convenient too. But that's the current "official" position.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @erikerlandson said, I would prefer to use COPY since the files only copied into the image (and no magic is needed).

@@ -15,25 +15,12 @@
# limitations under the License.
#

FROM openjdk:8-alpine
FROM spark-base
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a tag, i.e. latest? How do we ensure that the child images are kept in sync with the right tag of the base?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo there should be a tag per spark release, something like spark-base:2.1.0, etc

I'd propose renaming it spark-kubernetes-base

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we handle snapshots then? Or are we only going to push images on every release, at which point we update all of the Dockerfiles?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally these would have version tags like 2.1.0-kubernetes-0.3.0 rather than something like latest (I'm a fan of immutable artifacts)

We'd need to have some way of templating Dockerfiles to do that though I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest is for snapshots

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with how people treat hadoop versions, but if we need multiple hadoop versions as part of the build crossproduct, then that would also be a part of the tag

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we need is some string replacement in the release process e.q. when version 2.1 is released a placeholder in the files (like RELEASE) get's replaced by the actual value.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as we build in the correct order (spark-base first) I think we're fine.

@erikerlandson do you feel strongly about renaming this to spark-kubernetes-base? The others are just driver executor etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My instinct is that it would be preferable to name them all spark-kubernetes-*, but I don't consider it a show stopper

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want spark-kubernetes because we might namespace conflict with Nomad in particular, which is another effort being done to run the driver and executors in Docker containers.

@mccheah
Copy link

mccheah commented Jun 2, 2017

Actually the image tagging question stands as something that might make us want to reconsider this approach.

@ash211
Copy link

ash211 commented Jun 5, 2017

@johscheuer I'm not sure what spark-base gives us that Docker doesn't already provide through its layer deduplication. When I pull the existing driver and executor images before this change, the docker daemon recognizes that the earlier layers have the same hash and reuses the layer, even without the separate tagging for the spark-base layer.

Can you explain more what spark-base does?

It seems most of the benefits are in squashing the RUN commands vs the introduction of spark-base

@mccheah
Copy link

mccheah commented Jun 5, 2017

In using a base image we can avoid duplicating the Dockerfile code itself.

@mccheah
Copy link

mccheah commented Jun 5, 2017

So for example when we start supporting Python we only have to include the Python installation in one Dockerfile and not two.

@johscheuer
Copy link
Author

The main idea is like @mccheah said to avoid duplicating Dockerfile code. Assume we need to adjust a library or we want to include another one in all images then we would need to modify all images.

If you want I can separate these changes into two PR's -> one for the merge of the RUN commands and another one for the base image (open for discussion).

docker build -t registry-host:5000/spark-driver:latest -f dockerfiles/driver/Dockerfile .
docker build -t registry-host:5000/spark-executor:latest -f dockerfiles/executor/Dockerfile .
docker build -t registry-host:5000/spark-init:latest -f dockerfiles/init-container/Dockerfile .
docker push registry-host:5000/spark-base:latest
docker push registry-host:5000/spark-driver:latest
docker push registry-host:5000/spark-executor:latest
docker push registry-host:5000/spark-init:latest
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a note after this that the build order matters -- the spark-base must be built first, then the others can be built afterwards in any order

@@ -15,25 +15,12 @@
# limitations under the License.
#

FROM openjdk:8-alpine
FROM spark-base
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as we build in the correct order (spark-base first) I think we're fine.

@erikerlandson do you feel strongly about renaming this to spark-kubernetes-base? The others are just driver executor etc

@ash211
Copy link

ash211 commented Jun 8, 2017

LGTM pending question for Erik (and updating the branch)

@ash211 ash211 merged commit 78baf9b into apache-spark-on-k8s:branch-2.1-kubernetes Jun 8, 2017
@ash211
Copy link

ash211 commented Jun 8, 2017

Thanks @johscheuer !

foxish pushed a commit that referenced this pull request Jul 24, 2017
* Create base-image and minimize layer count

* Create running-on-kubernetes.md
ifilonenko pushed a commit to ifilonenko/spark that referenced this pull request Feb 26, 2019
…-kubernetes

Rebase to upstream's version of Kubernetes support.
puneetloya pushed a commit to puneetloya/spark that referenced this pull request Mar 11, 2019
* Create base-image and minimize layer count

* Create running-on-kubernetes.md
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants