Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up docker image building and switch base image to alpine #17731

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

FrankChen021
Copy link
Member

@FrankChen021 FrankChen021 commented Feb 15, 2025

There are several problems in the Dockerfile

1. Extreme slow building on Apple Silicon Chips

Previously, to allow building docker on Apple Silicon Chips like M1, the docker file forces the building under the amd64 platform. This is to address the building problem that node-sass does not support ARM, see #13012

FROM --platform=linux/amd64 maven:3.9 as builder

However, this drastically slows down the docker building process on these platforms, like it takes more than 15 minutes to build an image on my M1 laptop.

The main reason is that Apple has to use x86 emulator to run the building process.

2. Unfriendly to debug

Currently the distroless base image is used, it's a secure image but it's unfriendly to debug. there's no curl, no wget, no lsof, and nettools. It's painful to debug if we have to debug some live issues.

3. web-console is repeatedly built even if it's not changed

Most of the development does not involve the web-console module, now it's part of the building process of other backend services when using mvn package command.
Since the web-console module take time to build, it also slows down the building process

And there're some other problems which are described in the following section.

Changes Description

  1. The entire building process is split into two stages, the web-console build stage which runs under amd64 platform, and the distribution building stage which adapts local development platform. And during the distribution building stage, the web-console will be copied for final distribution package.

    This improves the building process drastically. Now on my laptop, it takes 120 seconds to complete the web-console building stage, and 210 seconds to complete the backend service building stage which are acceptable.

 => [web-console-builder 4/4] RUN --mount=type=cache,target=/root/.m2 if [ "true" = "true" ]; then     cd /src/web-console && mvn -B -ff -DskipUTs clean package; fi       126.4s
 => [builder 4/7] WORKDIR /src                                                                                                                                               0.0s
 => [builder 5/7] COPY --from=web-console-builder /src/web-console/target/web-console*.jar /src/web-console/target/                                                          0.0s
 => [builder 6/7] RUN --mount=type=cache,target=/root/.m2 if [ "true" = "true" ]; then       mvn -B -ff       clean install       -Pdist,bundle-contrib-exts       -Pskip  211.5s
  1. DO NOT use mvn to build web-console
    This can greatly improve the building performance when contents under web-console directory are not changed by leveraging the docker cache.

To make it, we bulid the web-console in a node image directly. In development, when web-console module is not changed, this reduces the entire building process of web-console

  1. Unifed the JDK during building and final run environment

Previously, the maven:3.9, which comes with JDK17, is used for building stage. This does NOT respect the JDK_VERSION argument in the docker file. This means if we're going to build druid in 21 by specifying the JDK_VERSION, the distribution was still buit under JDK17 but packaged to run in JRE 21 environment.

In this PR, this is fixed. The buliding stage and final image use the SAME version of JDK

  1. Switching base from gcr.io/distroless/java$JDK_VERSION-debian12 to alpine

This also drastically simplifies the docker file. Previously, we have to install busybox, download bash from somewhere in the Dockerfile, which makes the Dockerfile very complicated.

Since alpine comes with shell, these steps are eliminated. The change does NOT involve size bloat of image. On my local it shows that size of alpine based image is 746MB which is a little bit smaller than that of distroless image.

druid                         latest                     6eb4ec6dc77f   34 minutes ago   746MB
druid                         distroless                 1daa75c32b0c   7 hours ago      761MB

And some command used tools like curl,lsof,netools are packaged in the final docker image.

  1. Remove the evaluation of VERSION

Previously we use the following command to evaluate the version, but this step takes VERY LONG time on my laptop

RUN --mount=type=cache,target=/root/.m2 VERSION=$(mvn -B -q org.apache.maven.plugins:maven-help-plugin:3.2.0:evaluate \
      -Dexpression=project.version -DforceStdout=true \
    ) \
...

We can see that after 254 seconds, the command is still running.

 => [builder 7/8] RUN VERSION=$(mvn -B -q org.apache.maven.plugins:maven-help-plugin:3.2.0:evaluate       -Dstyle.color=never -Dexpression=project.version -DforceStdout=  254.3s

This is eliminated because by applying 'clean' to the maven command, we ensure that there's only one tar file under the distribution and we can use wild match to find the file and decompress it

  1. test-related modules are execluded from distribution stage.

  2. druid.sh is also updated to ensure druid.host has value before starting java process. This helps exposing problem more earlier.

Release note

The default image is switched from gcr.io/distroless/java17-debian12 to alpine

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • been tested in a test Druid cluster.

@FrankChen021 FrankChen021 added the Docker https://hub.docker.com/r/apache/druid label Feb 15, 2025
@github-actions github-actions bot added the GHA label Feb 17, 2025
@kgyrtkirk
Copy link
Member

I was taking a look and was wondering about the following:

  • I feel like the BUILD_FROM_SOURCE option is very wierd; why build these inside docker instead of packaging the realse into a docker image?
    ** is it possible with an M1 to build the dist build on the host and avoid building inside docker?
  • I don't really like that the new docker build customizes the distribution build logic inside the Dockerfile - with the hazard of using differently versioned tools
  • I think the web-console module doesn't correctly support incremental builds; so it gets rebuild every time; I think fixing that would also make these things painfull - using the docker cache implicitly adds a incremental build option...

@FrankChen021
Copy link
Member Author

I was taking a look and was wondering about the following:

  • I feel like the BUILD_FROM_SOURCE option is very wierd; why build these inside docker instead of packaging the realse into a docker image?
    ** is it possible with an M1 to build the dist build on the host and avoid building inside docker?

the BUILD_FROM_SOURCE is a legacy feature that I didn't make change and keep it. However, this is how I built the docker image on my M1 when building docker directly takes long time. The problem for M1 is that buliding a docker images is divided into two steps, first building distribution jar on host machine, then using this environment to build a docker image. This should be fixed because sometimes i even can't remember i should follow these two steps to get a docker image.

  • I don't really like that the new docker build customizes the distribution build logic inside the Dockerfile - with the hazard of using differently versioned tools

The core problem here is that web-console is different from backend services, it's a front-end project that has its own building toolchain

  • I think the web-console module doesn't correctly support incremental builds; so it gets rebuild every time; I think fixing that would also make these things painfull - using the docker cache implicitly adds a incremental build option...

This is why I made some changes to the web-console module so that we can use the docker cache

@kgyrtkirk
Copy link
Member

i even can't remember i should follow these two steps to get a docker image.

yeah - things could be complicated ; maybe it would be usefull to place a script under the dev folder?

This is why I made some changes to the web-console module so that we can use the docker cache

I wonder if there is a way to convince maven to not rebuild that all the time; that comes up a lot of other places as well..so fixing it more deeply could address those as well..

Thank you for the insights - I think the best would be to re-pack the dist tarball which was produce outside docker (make BUILD_FROM_SOURCE=false the default); do you think that would work well with your M1 based system?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants