Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Create base-image and minimize layer count #324

Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/running-on-kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,17 +49,19 @@ If you wish to use pre-built docker images, you may use the images published in
</table>

You may also build these docker images from sources, or customize them as required. Spark distributions include the
Docker files for the driver, executor, and init-container at `dockerfiles/driver/Dockerfile`,
Docker files for the base-image, driver, executor, and init-container at `dockerfiles/spark-base/Dockerfile`, `dockerfiles/driver/Dockerfile`,
`dockerfiles/executor/Dockerfile`, and `dockerfiles/init-container/Dockerfile` respectively. Use these Docker files to
build the Docker images, and then tag them with the registry that the images should be sent to. Finally, push the images
to the registry.

For example, if the registry host is `registry-host` and the registry is listening on port 5000:

cd $SPARK_HOME
docker build -t registry-host:5000/spark-base:latest -f dockerfiles/driver/spark-base .
docker build -t registry-host:5000/spark-driver:latest -f dockerfiles/driver/Dockerfile .
docker build -t registry-host:5000/spark-executor:latest -f dockerfiles/executor/Dockerfile .
docker build -t registry-host:5000/spark-init:latest -f dockerfiles/init-container/Dockerfile .
docker push registry-host:5000/spark-base:latest
docker push registry-host:5000/spark-driver:latest
docker push registry-host:5000/spark-executor:latest
docker push registry-host:5000/spark-init:latest
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a note after this that the build order matters -- the spark-base must be built first, then the others can be built afterwards in any order

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,26 +15,13 @@
# limitations under the License.
#

FROM openjdk:8-alpine
FROM spark-base

# If this docker file is being used in the context of building your images from a Spark distribution, the docker build
# command should be invoked from the top level directory of the Spark distribution. E.g.:
# docker build -t spark-driver:latest -f dockerfiles/driver/Dockerfile .

RUN apk upgrade --update
RUN apk add --update bash tini
RUN mkdir -p /opt/spark
RUN touch /opt/spark/RELEASE

ADD jars /opt/spark/jars
ADD examples /opt/spark/examples
ADD bin /opt/spark/bin
ADD sbin /opt/spark/sbin
ADD conf /opt/spark/conf

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark
COPY examples /opt/spark/examples
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between ADD and COPY?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like ADD supports URLs and extracting files out of tarballs which COPY does not

https://stackoverflow.com/questions/24958140/what-is-the-difference-between-the-copy-and-add-commands-in-a-dockerfile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good summary of ADD vs COPY : https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#add-or-copy

ADD is somewhat deprecated because it has complex behaviors. Recommended best practice is to copy tar-balls and unpack them explicitly in a RUN

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The support for unpacking tarballs is actually one of the side-effects of ADD that is considered harder to reason about. I think the idea is, just looking at a docker-file you may not be able to tell exactly what operations that ADD is doing. I don't have a real strong opinion on that. It's arguably more convenient too. But that's the current "official" position.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @erikerlandson said, I would prefer to use COPY since the files only copied into the image (and no magic is needed).


CMD SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && \
if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,26 +15,13 @@
# limitations under the License.
#

FROM openjdk:8-alpine
FROM spark-base

# If this docker file is being used in the context of building your images from a Spark distribution, the docker build
# command should be invoked from the top level directory of the Spark distribution. E.g.:
# docker build -t spark-executor:latest -f dockerfiles/executor/Dockerfile .

RUN apk upgrade --update
RUN apk add --update bash tini
RUN mkdir -p /opt/spark
RUN touch /opt/spark/RELEASE

ADD jars /opt/spark/jars
ADD examples /opt/spark/examples
ADD bin /opt/spark/bin
ADD sbin /opt/spark/sbin
ADD conf /opt/spark/conf

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark
COPY examples /opt/spark/examples

# TODO support spark.executor.extraClassPath
CMD SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,24 +15,10 @@
# limitations under the License.
#

FROM openjdk:8-alpine
FROM spark-base

# If this docker file is being used in the context of building your images from a Spark distribution, the docker build
# command should be invoked from the top level directory of the Spark distribution. E.g.:
# docker build -t spark-executor:latest -f dockerfiles/executor/Dockerfile .

RUN apk upgrade --update
RUN apk add --update bash tini
RUN mkdir -p /opt/spark
RUN touch /opt/spark/RELEASE

ADD jars /opt/spark/jars
ADD bin /opt/spark/bin
ADD sbin /opt/spark/sbin
ADD conf /opt/spark/conf

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark

ENTRYPOINT [ "/sbin/tini", "--", "bin/spark-class", "org.apache.spark.deploy.rest.kubernetes.KubernetesSparkDependencyDownloadInitContainer" ]
Original file line number Diff line number Diff line change
Expand Up @@ -15,24 +15,10 @@
# limitations under the License.
#

FROM openjdk:8-alpine
FROM spark-base

# If this docker file is being used in the context of building your images from a Spark distribution, the docker build
# command should be invoked from the top level directory of the Spark distribution. E.g.:
# docker build -t spark-executor:latest -f dockerfiles/executor/Dockerfile .

RUN apk upgrade --update
RUN apk add --update bash tini
RUN mkdir -p /opt/spark
RUN touch /opt/spark/RELEASE

ADD jars /opt/spark/jars
ADD bin /opt/spark/bin
ADD sbin /opt/spark/sbin
ADD conf /opt/spark/conf

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark

ENTRYPOINT [ "/sbin/tini", "--", "bin/spark-class", "org.apache.spark.deploy.rest.kubernetes.ResourceStagingServer" ]
Original file line number Diff line number Diff line change
Expand Up @@ -15,25 +15,12 @@
# limitations under the License.
#

FROM openjdk:8-alpine
FROM spark-base
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a tag, i.e. latest? How do we ensure that the child images are kept in sync with the right tag of the base?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo there should be a tag per spark release, something like spark-base:2.1.0, etc

I'd propose renaming it spark-kubernetes-base

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we handle snapshots then? Or are we only going to push images on every release, at which point we update all of the Dockerfiles?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally these would have version tags like 2.1.0-kubernetes-0.3.0 rather than something like latest (I'm a fan of immutable artifacts)

We'd need to have some way of templating Dockerfiles to do that though I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest is for snapshots

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with how people treat hadoop versions, but if we need multiple hadoop versions as part of the build crossproduct, then that would also be a part of the tag

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we need is some string replacement in the release process e.q. when version 2.1 is released a placeholder in the files (like RELEASE) get's replaced by the actual value.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as we build in the correct order (spark-base first) I think we're fine.

@erikerlandson do you feel strongly about renaming this to spark-kubernetes-base? The others are just driver executor etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My instinct is that it would be preferable to name them all spark-kubernetes-*, but I don't consider it a show stopper

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want spark-kubernetes because we might namespace conflict with Nomad in particular, which is another effort being done to run the driver and executors in Docker containers.


# If this docker file is being used in the context of building your images from a Spark distribution, the docker build
# command should be invoked from the top level directory of the Spark distribution. E.g.:
# docker build -t spark-shuffle:latest -f dockerfiles/shuffle/Dockerfile .

RUN apk upgrade --update
RUN apk add --update bash tini
RUN mkdir -p /opt/spark
RUN touch /opt/spark/RELEASE

ADD jars /opt/spark/jars
ADD examples /opt/spark/examples
ADD bin /opt/spark/bin
ADD sbin /opt/spark/sbin
ADD conf /opt/spark/conf

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark
COPY examples /opt/spark/examples

ENTRYPOINT [ "/sbin/tini", "--", "bin/spark-class", "org.apache.spark.deploy.ExternalShuffleService", "1" ]
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

FROM openjdk:8-alpine

# If this docker file is being used in the context of building your images from a Spark distribution, the docker build
# command should be invoked from the top level directory of the Spark distribution. E.g.:
# docker build -t spark-base:latest -f dockerfiles/spark-base/Dockerfile .

RUN apk upgrade --no-cache && \
apk add --no-cache bash tini && \
mkdir -p /opt/spark && \
touch /opt/spark/RELEASE

COPY jars /opt/spark/jars
COPY bin /opt/spark/bin
COPY sbin /opt/spark/sbin
COPY conf /opt/spark/conf

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ private[spark] class SparkDockerImageBuilder(private val dockerEnv: Map[String,

private val DOCKER_BUILD_PATH = Paths.get("target", "docker")
// Dockerfile paths must be relative to the build path.
private val BASE_DOCKER_FILE = "dockerfiles/spark-base/Dockerfile"
private val DRIVER_DOCKER_FILE = "dockerfiles/driver/Dockerfile"
private val EXECUTOR_DOCKER_FILE = "dockerfiles/executor/Dockerfile"
private val SHUFFLE_SERVICE_DOCKER_FILE = "dockerfiles/shuffle-service/Dockerfile"
Expand Down Expand Up @@ -60,6 +61,7 @@ private[spark] class SparkDockerImageBuilder(private val dockerEnv: Map[String,

def buildSparkDockerImages(): Unit = {
Eventually.eventually(TIMEOUT, INTERVAL) { dockerClient.ping() }
buildImage("spark-base", BASE_DOCKER_FILE)
buildImage("spark-driver", DRIVER_DOCKER_FILE)
buildImage("spark-executor", EXECUTOR_DOCKER_FILE)
buildImage("spark-shuffle", SHUFFLE_SERVICE_DOCKER_FILE)
Expand Down