Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are COPY layers cacheable? #1357

Closed
isker opened this issue Jul 23, 2020 · 9 comments · Fixed by #1408
Closed

Why are COPY layers cacheable? #1357

isker opened this issue Jul 23, 2020 · 9 comments · Fixed by #1408
Labels
kind/question Further information is requested

Comments

@isker
Copy link
Contributor

isker commented Jul 23, 2020

The README says that only RUN layers are cacheable. But last year COPYs were also made cacheable
in dbabcb1 (cc @donmccasland).

I'm not an expert in docker, but it's not clear at face value why you'd want to cache such layers. If you're COPYing some data in, you already have it locally, right? But when layer caching is enabled with a remote repository, instead of just executing that local copy, you hash that local data, then go to the network to download a layer containing the same data. Or, upon a miss, you spend time pushing the layer to your cache.

We have a performance-sensitive image build using Kaniko that has a few RUNs that we'd like cached, and some large COPYs that, being cached, are actually offsetting performance improvements from caching RUNs. And our image cache is much larger than we'd like.

@donmccasland can you clarify the intent of cacheable COPYs? And would you be receptive to a PR adding a flag to disable caching for COPY commands?

Thanks for your time!

@kamaln7
Copy link
Contributor

kamaln7 commented Jul 30, 2020

My guess is that this was added for non disk-local contexts like S3 & friends. If that is the case maybe disabling COPY caching when the context is a dir:// or tar:// would be a good balance?

@isker
Copy link
Contributor Author

isker commented Aug 10, 2020

As far as I can tell from reading the docs, you can only COPY things that are already in the local docker build context.

Even if they could be remote, you'd still have to hash them in order to check the layer cache. And to hash them you'd presumably have to download them locally first.

@isker
Copy link
Contributor Author

isker commented Aug 10, 2020

@donmccasland sorry to ping you again, but if you can provide any guidance here I'd appreciate it! If there's anything to do I'm more than happy to implement.

@tejal29
Copy link
Contributor

tejal29 commented Aug 12, 2020

@isker Sorry for the delay. @donmccasland is no longer actively working on this project.

Reason why copy layers are cacheable is to determine if the subsequent run commands can be used from cache.
e.g.
in order to determine if the subsequent Run command can be used from cached layer, kaniko computes if package.json has changed. if it has changed, then cache layer for Run yarn install is invalidated.
If not, then we use the cache layer for Run command.

Copy package.json . 
Run yarn install

The way kaniko caching works is, it uses the cache until a cache key for one command in dockerfile is changed.
once it detects a change, it stops using cache for subsequent run commands.

I would suggest, you could re-arrange your dockerfile so that file copies not affecting run command should come later.
Doed that make sense?

@tejal29 tejal29 added the kind/question Further information is requested label Aug 12, 2020
@isker
Copy link
Contributor Author

isker commented Aug 12, 2020

@tejal29 Thanks for responding.

The way kaniko caching works is, it uses the cache until a cache key for one command in dockerfile is changed.
once it detects a change, it stops using cache for subsequent run commands.

This is what I did not understand. Unfortunately our Dockerfile is fundamentally copy-heavy. The ordering of COPYs and RUNs is already optimal in this regard.

If you'll entertain more questions, why does Kaniko's layer cache work that way? Why can't it go back to the cache after producing a layer locally?

I will open a PR documenting this behavior in the README.

@kamaln7
Copy link
Contributor

kamaln7 commented Aug 15, 2020

Thanks for the explanation @tejal29 that makes sense!

I'm trying to follow the caching logic here and I'm probably missing something obvious, but would something like this work?

With COPY commands specifically, instead of:

  1. calculate a composite key
  2. attempt to retrieve a layer with the same key
  3. if:
    • found: replace the COPY command with a "fake" one that applies the cached layer instead
    • not found: stop the caching process, keep the COPY command as-is.
      • later after it's executed, upload the new cached layer to the registry with the new key for future builds

do something like:

  1. calculate a composite key
  2. check if a remote layer with the same key exists - but do not retrieve it
  3. if:
    • found: keep the COPY command as-is, don't stop the caching process
    • not found: keep the COPY command as-is, stop the caching process
      • later after it's executed, upload the an empty layer to the registry with the new key for future builds

This way the local context is always used as a source for COPY commands, but would also be used to determine whether the existing cache for subsequent layers can be used or not.

Edit: thinking about it some more, we don’t even need to upload an empty layer or check for one’s existence. Just include the COPY command’s key in the subsequent commands’ composite keys-which is already done?

@tejal29
Copy link
Contributor

tejal29 commented Aug 21, 2020

@isker we can do in a person sync sometime if that helps.
I understand that the image cache size issue and would love to speed up builds for Copy heavy dockerfiles.

I like @kamaln7's solution to not stop caching where

  1. calculate the composite key for copy command but only use to seed subsequent Run commands cache key.
    All copy command will be executed from local sources
  2. for Run command, if remote layer with the same key is present, retrieve the layer
  3. if remote layer with the same key does not exist, stop caching subsequent commands.

@kamaln7 would like to submit a PR or a Design Doc for this?
Would love to get this in

@isker
Copy link
Contributor Author

isker commented Aug 31, 2020

@tejal29 I will try to produce a PR for this. It looks to me that we could achieve all of the things you listed just by deleting the code that makes COPYs cacheable.

Like @kamaln7 said, we already calculate the composite cache key for non-cacheable COPYs:

case *commands.CopyCommand:
compositeKey = s.populateCopyCmdCompositeKey(command, v.From(), compositeKey)

This makes me wonder why COPYs were cacheable in the first place 🤔 .

isker pushed a commit to isker/kaniko that referenced this issue Aug 31, 2020
Cached COPY layers are expensive in that they both need to beretrieved
over the network and occupy space in the layer cache.

They are unnecessary in that we already have all resources needed to
execute the COPY locally, and doing so is a trivial file-system
operation.  This is in contrast to RUN layers, which can do
arbitrary and unbounded work.

The end result is that cached COPY commands were more expensive when
cached, not less.  Remove them.

Resolves GoogleContainerTools#1357
isker pushed a commit to isker/kaniko that referenced this issue Aug 31, 2020
Cached COPY layers are expensive in that they both need to be retrieved
over the network and occupy space in the layer cache.

They are unnecessary in that we already have all resources needed to
execute the COPY locally, and doing so is a trivial file-system
operation.  This is in contrast to RUN layers, which can do
arbitrary and unbounded work.

The end result is that cached COPY commands were more expensive when
cached, not less.  Remove them.

Resolves GoogleContainerTools#1357
isker added a commit to isker/kaniko that referenced this issue Aug 31, 2020
Cached COPY layers are expensive in that they both need to be retrieved
over the network and occupy space in the layer cache.

They are unnecessary in that we already have all resources needed to
execute the COPY locally, and doing so is a trivial file-system
operation.  This is in contrast to RUN layers, which can do
arbitrary and unbounded work.

The end result is that cached COPY commands were more expensive when
cached, not less.  Remove them.

Resolves GoogleContainerTools#1357
@isker isker mentioned this issue Aug 31, 2020
4 tasks
isker added a commit to isker/kaniko that referenced this issue Aug 31, 2020
Cached COPY layers are expensive in that they both need to be retrieved
over the network and occupy space in the layer cache.

They are unnecessary in that we already have all resources needed to
execute the COPY locally, and doing so is a trivial file-system
operation.  This is in contrast to RUN layers, which can do
arbitrary and unbounded work.

The end result is that cached COPY commands were more expensive when
cached, not less.  Remove them.

Resolves GoogleContainerTools#1357
isker added a commit to isker/kaniko that referenced this issue Aug 31, 2020
Cached COPY layers are expensive in that they both need to be retrieved
over the network and occupy space in the layer cache.

They are unnecessary in that we already have all resources needed to
execute the COPY locally, and doing so is a trivial file-system
operation.  This is in contrast to RUN layers, which can do
arbitrary and unbounded work.

The end result is that cached COPY commands were more expensive when
cached, not less.  Remove them.

Resolves GoogleContainerTools#1357
isker added a commit to isker/kaniko that referenced this issue Sep 1, 2020
Cached COPY layers are expensive in that they both need to be retrieved
over the network and occupy space in the layer cache.

They are unnecessary in that we already have all resources needed to
execute the COPY locally, and doing so is a trivial file-system
operation.  This is in contrast to RUN layers, which can do
arbitrary and unbounded work.

The end result is that cached COPY commands were more expensive when
cached, not less.  Remove them.

Resolves GoogleContainerTools#1357
isker added a commit to isker/kaniko that referenced this issue Sep 1, 2020
Cached COPY layers are expensive in that they both need to be retrieved
over the network and occupy space in the layer cache.

They are unnecessary in that we already have all resources needed to
execute the COPY locally, and doing so is a trivial file-system
operation.  This is in contrast to RUN layers, which can do
arbitrary and unbounded work.

The end result is that cached COPY commands were more expensive when
cached, not less.  Remove them.

Resolves GoogleContainerTools#1357
tejal29 pushed a commit that referenced this issue Oct 1, 2020
Cached COPY layers are expensive in that they both need to be retrieved
over the network and occupy space in the layer cache.

They are unnecessary in that we already have all resources needed to
execute the COPY locally, and doing so is a trivial file-system
operation.  This is in contrast to RUN layers, which can do
arbitrary and unbounded work.

The end result is that cached COPY commands were more expensive when
cached, not less.  Remove them.

Resolves #1357
@kamaln7
Copy link
Contributor

kamaln7 commented Oct 3, 2020

Amazing, thanks @isker & @tejal29!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants