-
Notifications
You must be signed in to change notification settings - Fork 118
Create base-image and minimize layer count #324
Create base-image and minimize layer count #324
Conversation
b40dec5
to
3ef935f
Compare
b37671b
to
5011163
Compare
Thanks for contributing this - I see no problems and think we can afford to put this in the next release. What do you think @ash211 ? |
ENV SPARK_HOME /opt/spark | ||
|
||
WORKDIR /opt/spark | ||
COPY examples /opt/spark/examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between ADD
and COPY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like ADD
supports URLs and extracting files out of tarballs which COPY
does not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good summary of ADD
vs COPY
: https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#add-or-copy
ADD
is somewhat deprecated because it has complex behaviors. Recommended best practice is to copy tar-balls and unpack them explicitly in a RUN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The support for unpacking tarballs is actually one of the side-effects of ADD that is considered harder to reason about. I think the idea is, just looking at a docker-file you may not be able to tell exactly what operations that ADD is doing. I don't have a real strong opinion on that. It's arguably more convenient too. But that's the current "official" position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like @erikerlandson said, I would prefer to use COPY
since the files only copied into the image (and no magic is needed).
@@ -15,25 +15,12 @@ | |||
# limitations under the License. | |||
# | |||
|
|||
FROM openjdk:8-alpine | |||
FROM spark-base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have a tag, i.e. latest
? How do we ensure that the child images are kept in sync with the right tag of the base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo there should be a tag per spark release, something like spark-base:2.1.0
, etc
I'd propose renaming it spark-kubernetes-base
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we handle snapshots then? Or are we only going to push images on every release, at which point we update all of the Dockerfiles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally these would have version tags like 2.1.0-kubernetes-0.3.0
rather than something like latest
(I'm a fan of immutable artifacts)
We'd need to have some way of templating Dockerfiles to do that though I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latest
is for snapshots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with how people treat hadoop versions, but if we need multiple hadoop versions as part of the build crossproduct, then that would also be a part of the tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we need is some string replacement in the release process e.q. when version 2.1
is released a placeholder in the files (like RELEASE
) get's replaced by the actual value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as long as we build in the correct order (spark-base first) I think we're fine.
@erikerlandson do you feel strongly about renaming this to spark-kubernetes-base
? The others are just driver
executor
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My instinct is that it would be preferable to name them all spark-kubernetes-*
, but I don't consider it a show stopper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want spark-kubernetes
because we might namespace conflict with Nomad in particular, which is another effort being done to run the driver and executors in Docker containers.
Actually the image tagging question stands as something that might make us want to reconsider this approach. |
@johscheuer I'm not sure what Can you explain more what It seems most of the benefits are in squashing the |
In using a base image we can avoid duplicating the Dockerfile code itself. |
So for example when we start supporting Python we only have to include the Python installation in one Dockerfile and not two. |
The main idea is like @mccheah said to avoid duplicating Dockerfile code. Assume we need to adjust a library or we want to include another one in all images then we would need to modify all images. If you want I can separate these changes into two PR's -> one for the merge of the |
docker build -t registry-host:5000/spark-driver:latest -f dockerfiles/driver/Dockerfile . | ||
docker build -t registry-host:5000/spark-executor:latest -f dockerfiles/executor/Dockerfile . | ||
docker build -t registry-host:5000/spark-init:latest -f dockerfiles/init-container/Dockerfile . | ||
docker push registry-host:5000/spark-base:latest | ||
docker push registry-host:5000/spark-driver:latest | ||
docker push registry-host:5000/spark-executor:latest | ||
docker push registry-host:5000/spark-init:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a note after this that the build order matters -- the spark-base must be built first, then the others can be built afterwards in any order
@@ -15,25 +15,12 @@ | |||
# limitations under the License. | |||
# | |||
|
|||
FROM openjdk:8-alpine | |||
FROM spark-base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as long as we build in the correct order (spark-base first) I think we're fine.
@erikerlandson do you feel strongly about renaming this to spark-kubernetes-base
? The others are just driver
executor
etc
LGTM pending question for Erik (and updating the branch) |
Thanks @johscheuer ! |
* Create base-image and minimize layer count * Create running-on-kubernetes.md
…-kubernetes Rebase to upstream's version of Kubernetes support.
* Create base-image and minimize layer count * Create running-on-kubernetes.md
What changes were proposed in this pull request?
Use one single layer for the run command and tool installation. Also move all the common parts into one base-image.
Fixes #323
How was this patch tested?
Tested with the Minikube integration test (https://github.com/apache-spark-on-k8s/spark/blob/branch-2.1-kubernetes/resource-managers/kubernetes/README.md#preserve-the-minikube-vm)
cc @ash211