-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated builds using cache produce broken images #1162
Comments
I can confirm that this issue also happens randomly to me when caching is enabled. It happens in non-python project as well and the only workaround I found so far is to remove every cache image and re-build the image. The error message when pulling the image is different on OSX and on Linux:
I think it started occurring in |
Thanks @gilbsgilbs I will take a look into this. |
seems related/equal to #1158 - FROM continuumio/miniconda3:latest
COPY ./environment.yml /app/environment.yml
RUN conda env update -n base -f /app/environment.yml && conda clean --all -y -q && rm -rf /opt/conda/pkgs
COPY ./ourproject.a /app/ourproject.a
COPY ./ourproject.b /app/ourproject.b
COPY ./ourproject.c /app/ourproject.c
COPY ./ourproject.d /app/ourproject.d
COPY ./ourproject.e /app/ourproject.e
COPY ./ourproject.f /app/ourproject.f
ARG VERSION=docker.dev
ENV DOCKER_BUILD_VERSION $VERSION
RUN SETUPTOOLS_SCM_PRETEND_VERSION=$DOCKER_BUILD_VERSION pip install -e /app/ourproject.a && SETUPTOOLS_SCM_PRETEND_VERSION=$DOCKER_BUILD_VERSION pip install -e /app/ourproject.b -e /app/ourproject.c -e /app/ourproject.d -e /app/ourproject.e -e /app/ourproject.f
WORKDIR /work our kaniko is invoked like this in our .gitlab-ci.yml (basically like it is recommended by gitlab) .kaniko_job_template: &kaniko_job
image:
name: gcr.io/kaniko-project/executor:debug
entrypoint: [""]
timeout: 1h
setup:anaconda docker image:
stage: setup
<<: *kaniko_job
script:
- 'echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json'
- '/kaniko/warmer --cache-dir=/cache --image=continuumio/miniconda3:latest --verbosity=debug || true'
- '/kaniko/executor version'
- '/kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/dockerfiles/conda/Dockerfile --cache=true --cache-dir=/cache --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG-latest --verbosity=debug --cleanup'
|
We're also seeing that corrupted images increase in size by O(duplicate of one of the layers). Have you managed to work out which docker command the additional layer corresponds to? We have suspected that it has to do with layers that don't change the image size ( Re version: We have used 0.16 briefly, but haven't noticed this issue until 0.17 and up (we upgraded due to other bugs). We assumed that this particular issue is due to our dockerfiles being a bit "loose" (haven't linted them) so we just tried to upgrade our way out of it, but now we're close to target date for our project and it's the last blocker. Also for context. This is how we invoke kaniko:
What we're seeing is that even though the broken layer is in the cache registry, kaniko doesn't detect it and doesn't fail to build. Instead pushes an image with that broken layer in place |
Also running into the error:
Building with image: Command:
Dockerfile:
|
Is there anything else we can provide to help with this? |
@swist Sorry with the current situation I have been able to get to this sooner.
Is there anything else i am missing? I also deployed a pod with the image produced on GKE and the pod was running fine. I am looking into code now to see what more can i do. |
We're pretty sure that adding a dependency is going to invalidate the broken layer - it is going force a reinstall. What we have found has sometimes fails is when you add an extra empty file in the context so that it gets copied in just before the last layer I'll try to get you a minimal reproducible example, unfortunately the only thing I have currently is my org's code. |
According to community slack, something like this should do:
|
I'm having this issue with a non-multistage build for a python application with
It's uncleared what triggered the issue, though clearing the cache 'fixes' it. if additional information helps please ask and I can try and provide. |
Only solution to still use cache layers in a Docker registry is to rollback to 0.16.0 from any version from 0.17.0 to 0.19.0, this is a major regression and should be p0. |
Still happens for us. Take any multistage build. Add/remove line that sets @tejal29 do you have any more context as to how this is happening? Were you able to reproduce on your setup? |
We're seeing this problem also. It seems to happen whenever we build this Dockerfile:
|
yes. i was able to reproduce this when i ran the above dockerfile 3rd time. |
We are also hitting this problem
image: command: More detail in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/28988 |
Having previously ran into the same problem (Python based, multi-step Dockerfile), i've tried to use the files here to reproduce the issue but in vain. @cx1111 - I've tried your Dockerfile (the one based on continuumio/miniconda3:4.7.12) on both v0.18.0 and v0.19.0, with an environment.yml file of a single dependency @rhs - same here, didn't manage to reproduce it. Any chance you can share the requirements.txt file? I tried it with a pretty small set of Python requirements, so I don't know if it might be related. |
@swist I have tried to use your Dockerfile with a very basic I'll created a separate issue here: #1202 |
I also encountered this problem. I'm using Google Cloud Build. It builds and works completely fine with
Dockerfile: FROM node:12-slim
WORKDIR /app
EXPOSE 8080
CMD ["node", "app-client.js"]
RUN apt-get update && apt-get install --yes --no-install-recommends \
ca-certificates \
curl \
gnupg \
&& echo "deb http://packages.cloud.google.com/apt gcsfuse-stretch main" \
| tee /etc/apt/sources.list.d/gcsfuse.list \
&& curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
&& apt-get update \
&& apt-get install --yes gcsfuse \
&& apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
COPY package*.json ./
RUN npm ci
COPY . . cloudbuild.yaml steps:
- id: Build and push container
name: 'gcr.io/kaniko-project/executor:latest'
args: [
'--dockerfile=Dockerfile',
'--destination=$_IMAGE_NAME:$COMMIT_SHA',
'--destination=$_IMAGE_NAME:$BRANCH_NAME',
'--cache=true',
]
options:
substitutionOption: ALLOW_LOOSE
substitutions:
_IMAGE_NAME: gcr.io/$PROJECT_ID/project_name
_DOCKERFILE_NAME: Dockerfile
_DOCKERFILE_DIR: '' |
Finally managed to reproduce it consistently and figured out that it has to do with incorrect whiteout of certain files. I am still not very familiar with the logic in that area, but seeing that it was recently refactored in #1069, perhaps @tejal29 or @cvgw have a better intuition on what may have gone wrong? To Reproduce: If you run on certain Linux kernels, you may fail to build it with the latest kaniko image (see issue 1202). Dockerfile
Context:
pyproject.toml is the only file whose contents are meaningful. The text files are just "hello world".
Build command to GCR:
Build it once, and you'll see an image on GCR with size of ~190MB that can be pulled.
Some observations:
The produced image is still with the wrong size (320MB, instead of 190MB), but it actually can be pulled normally. Which indicates that the bug that doubled layer's size is still not resolved. |
Use kaniko v0.16.0 GoogleContainerTools/kaniko#1162
We're continuously bumping into this as well and it makes the cache completely unusable. Is there anything we can do to help prioritise this? Thanks for all your effort. |
I'm experiencing the same issue on 0.19.0. It works fine on a small Node.js image, but on a more complicated PHP image it dies. Trying now to revert to 0.16.0 to see if that will solve it for me. |
Hey folks, sorry for the delay. I spent yesterday, debugging this issue and here are my findings.
Note: This does not happen consistently. After doing some debugging and from error logs, the cases where
There was 2 entries for the file appeared in the layer twice in the first uncached layer or across layers.
I fixed this in PR #1214 by moving the Step 0 i.e. initial snapshot when there is first layer which is not chached. I verified Dockerfile given by @cx1111 and example provided by @dani29 I have pushed this fix in the following images
Can someone give it a try? |
@nethunter Can you please try:
@mitchfriedman Sorry things have been slow. But please try
|
|
@IlyaSemenov, thanks for verifying. Would have to delete the remote cache. |
Unfortunately
|
@TimShilov are you making code changes before running each build? Also are you by any chance reusing the same kaniko pod? |
@tejal29 I was making dummy changes (just whitespace) in one file between commits. Just tried one more time just to make sure I didn't missed anything. Removed remote cache, first image worked fine. After second build I got the error again. Is there anything else I can do or try to help? |
ok. let me try with GCB build. I am trying to find a small |
@TimShilov, i tried your dockerfile and it did work for me.
Can you please check if there is a substantial increase in the size of the 2 images (like almost doubling)? |
I am going to do the build one more time! |
confirming, i saw the error on 4th build. |
@TimShilov I finally got your issue fixed. the condition to detect if initial snapshot should be taked was not complete.
I will do some more testing and give you an image to try soon |
@TimShilov , Can you try with this image:
|
@tejal29 Tried again with Thanks for your help! 👍 |
Thanks @TimShilov for your help verifying this build. |
@tejal29 Thanks for fixing this, really appreciate it! I wanted to try using
Previously I was using |
It is NOT fixed in debug-v0.20.0
before your echo. And gitlab needs to change its documentation. |
bump executor version to fix this bug: GoogleContainerTools/kaniko#1162
Actual behavior
Hey we’re having some problems with kaniko 0.19 and python. We have a dockerfile that looks vaguely like this:
For reasons independent of us, we can’t actually re-organize the sourcecode to remove the
RUN rm
lines.What we’re seeing is that if our cache ECR repo is empty, then the image is fine.
We then build an image using the cache, but change one of later layers (for example adding a file to
/service/cat
)The build then completes, but upon pulling the image from ECR to a kubelet, we see:
Failed to pull image
Expected behavior
Subsequent builds using cache should produce an image that can be ran on kubernetes
To Reproduce
See above
Additional Information
See above
Set up a poetry project analogous to the issue here Relative path imports are resolved relative to CWD instead of pyproject.toml python-poetry/poetry#1757
Triage Notes for the Maintainers
--cache
flagThe text was updated successfully, but these errors were encountered: