Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposals/pipeline caching #113

Merged
merged 13 commits into from
Apr 26, 2019
Merged
Binary file added design/images/pipeline-caching-task-editor.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
157 changes: 157 additions & 0 deletions design/pipeline-caching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# Pipeline Caching

*** Status: Work in Progress ***

We have observed the need for Azure Pipelines to implement a dependency caching mechanism that allows the outputs of steps within a pipeline to be optimized (or even skipped) if a suitable cached version of the outputs of those steps already exists. Caching can be effective provided the cost of determining a cache hit and acquiring the contents of the cache is cheaper than the cost of producing the output again from scratch.

## Proposal

This proposal contains a number of interconnected elements. We have attempted to layer them in this document from the most primitive building blocks up to simplified YAML syntax to make caching "just work" in the majority of cases. Our approach to building a caching mechanism for Azure Pipelines will be to get the fundamental building blocks right and prove we can increase build performance and then evolve the YAML syntax once we know exactly what that syntax will need to describe.

### Restore and Save Cache Tasks

The basic functions of Pipeline Cache will be implemented by two tasks. The ```RestoreCache``` task and the ```SaveCache``` task. Both tasks will specify a collection of files and environment variables which can be used as a key to lookup the cache and identify a hit or a miss. Each task will also take a collection of paths which will be fetched from or stored in the cache associated with that key.
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved

Consider the following Azure Pipelines pipeline defined in YAML:

```yaml
steps:
- script: yarn
- script: yarn build
```

In this example we are working on a React-based application which was bootstrapped using ```create-react-app```. The output of this command is a directory with about ~28,000 files including the ```node_modules``` directory. On a local developer machine the ```yarn``` command will trigger the installation of any missing packages. Typically this will execute quickly because all packages already exist on the developer machine. However in cloud-based CI environments where build machines are recycled after every build there is no persistent state which means that Yarn must download all packages from scratch.
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved

Applying caching to this pipeline would involve adding a restore cache step and save cache step before and after the ```-script: yarn``` step.

```yaml
steps:
- task: RestoreCache@0
inputs:
key: |
package.json
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved
yarn.lock
paths: |
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved
node_modules
- script: yarn
- task: SaveCache@0
inputs:
key: |
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved
package.json
yarn.lock
paths: |
node_modules
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved
- script: yarn build
```

This is the most verbose version of the task syntax. We will provide shortcuts to correctly configure pipeline caching, but under the covers these are the tasks that will be emitted in to the pipeline job. When the job runs the ```RestoreCache``` task will hash the ```package.json``` and ```yarn.lock``` and combine it with some other caching elements such as operating system and a cache salt (for forced cache invalidation scenarios). It will then lookup the Pipeline Caching service using the fingerprint those files represent and download the cached content from Azure Artifact's blob store (if the content is present).

In either case the ```- script: yarn``` command will be executed, and in the case of a cache hit this will be a close to a no-op for that command. Restoring the cache will take some time and so it is important to make sure that the technique used for restoring the cache is appropriate for the scenario.

### Caching Strategies
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved

When restoring from the cache it is important that the technique used to deliver content to disk is optimized for the content being handled. The size, number of files and environmental conditions can impact the most efficient approach. In the example above we sampled a number of different experiments to deliver the content of the ```node_modules``` folder - following from the scenario above.

| Experiment # | Execution Time | Approach |
| ------------ | -------------- | -------------------------------------------------------------- |
| 1 | 47 seconds | ```yarn``` with clean cache |
| 2 | 19 seconds | ```yarn``` with dirty cache |
| 3 | 47 seconds | dedup file transfer with ```tar``` extraction |
| 4 | 22 seconds | dedup file transfer with direct file placement |
| 5 | 44 seconds | dedup file transfer with direct file placement into yarn cache |
| 6 | 42 seconds | dedup file transfer with ```tar``` & gzip compression |

The timings above are relative and performed in a local test environment. They are shown here to establish the impact that selecting different caching strategies can have on performance.

To establish a baseline of performance experiment #1 just executes the ```yarn``` command in a workspace with no ```node_modules``` folder and a clean Yarn cache - then in experiment #2 we similuate the local development workstation scenario by removing the ```node_modules``` folder by maintain the Yarn cache directory.

For experiment #3 we used a dedup file transfer mechanism built into Azure Artifacts (used behind the scenes by Universal Packages and Pipeline Artifacts). As part of the experiment we packed the node_modules folder into a tar file (without compression) which resulted in a 150MB file. We were able to transfer that file in about 19 seconds and then unpacking took about another 28 seconds - resulting in the same performance as a Yarn install on a clean cache (no improvement).

In experiment #4 we eliminated the tar file and allowed the dedup file transfer to directly place files on disk. This process took 22 seconds end to end which roughly halves the Yarn installation time.

In experiment #5 we instead cached the Yarn package cache directory instead of the local ```node_modules``` folder. The download of the packages into the Yarn cache directory took 22 seconds similar to experiment #4, however Yarn installation added 22 seconds for linking which meant it was roughly the same as a clean install. My comparison, running Yarn install with experiment #4 is close to a no-op for Yarn.

In order to explore the archive file scenario a little bit more we did experiment #6 where in addition to creating a ```.tar``` archive, we also added compression to the file (```.tgz```). Due to file size reductions the transfer time was reduced to about 7 seconds, but decompression and placing files using ```tar``` took 35 seconds.
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved

Based on the above results it would appear that using our dedup file transfer mechanism with direct file placement would be the most performant option, however from internal experimentation at Microsoft we've observed that some larger code bases with hundreds of thousands of files in the ```node_modules``` directory do better with ```.tgz``` so we believe to service the breadth of caching requirements for Azure Pipelines users that we'll need to support a number of ways of placing files.

Other more elaborate caching mechanisms may also be available such as virtualized file systems with prefetch which may provide significant performance boosts particularly in sparse file access scenarios but introduce other issues which would need to make that approach strictly opt-in.

The caching strategy would default to dedup file transfer mechanism but allow an option to override the caching strategy with another approach, for example"

Selecting the cache strategy would simply be a matter of specifying a strategy on the task inputs (for both the restore and save tasks, restore shown below):

```yaml
steps:
- task: RestoreCache@0
inputs:
strategy: dedup
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved
key: |
package.json
yarn.lock
paths: |
node_modules
```

### YAML syntax

The example task usage above uses the explicit task references. We also want to provide a streamlined syntax for caching, applied to the example above.

```yaml
steps:
- restoreCache: yarn.lock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mitchdenny Are we missing a path keyword here? I thought that the caching service wasn't able to save off the relative path. (It would be great if we don't have to make the user say it twice! But I don't recall us solving that problem yet.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No its not solved yet - good catch. I'll push an update.

- script: yarn
- saveCache:
key: yarn.lock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to say "key" here or can we mirror the restoreCache model of just putting the key after saveCache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be possible (and sometimes necessary) to interleave multiple cache restore and save operations. For example:

steps:
- restoreCache: yarn.lock
  path: node_modules
- script: npm install
- script: ./node_modules/bin/grunt build
- restoreCache: Service.csproj
  path: packages
- saveCache: yarn.lock
  path: node_modules
  path: node_modules
- script: dotnet build && dotnet test
- saveCache: Service.csproj
  path: packages

Its a bit contrived but you can probably see how sometimes people will interleave caches (especially if caches are related and interdependent).

That said - maybe there is a simple scenario that we can optimize here. I recall when we caught up in NC that you mentioned that there could be a way of doing something here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow. I was asking a syntactic question about whether key has to live as a separate input vs if it could just go after saveCache. Given that you did it in your example, I think your answer is yes? 😛

Regarding the path stuff -- you can't have duplicate path keys. You have two options: (1) make path take either a string or an array of strings or (2) take only a string, and in the task, expect a newline-delimited list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if in V1 of the task we converted the type to string[]? The YAML ends up being the same doesn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the ability to say "take a scalar or an array". But if you choose just one in the schema, then your YAML is limited to that one. Does that make sense?

scalar: foo
array:
- bar
- baz

paths: node_modules
- script: yarn build
```

This more compact syntax just reduces some of the noise around using the caching task. Additionally we are planning to integrate caching into the the ```use:``` syntax. This will look like the following:

```yaml
steps:
- use: node
cache: true
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved
- script: |
yarn
yarn build
yarn test
```

The idea behind the ```use:``` syntax is that given what we are using (in this case ```node```) we make some automatic decisions about what and how to cache. This logic will use various heuristics to make the decision about what to cache, for example, it will look for the presence of a ```yarn.lock``` file or a ```package-lock.json``` to determine where the ```node_modules``` path may be located.

Other ```use:``` statements will have similar heuristics to pick the best defaults.

Note that the ```use:``` syntax will inject the cache save step at the end of the build process which will not always be desirable. In those cases developers can fall back to specifying the ```restoreCache:``` and ```saveCache:``` YAML statements.

### Cache Scoping

Getting scoping right is important for maximising cache hits and but also avoiding the cache becoming an attack vector to insert malicous code into official/master builds.

As a result we will automatically scope caches to the branch that they are running against. The caches will also be hierarchical so a feature branch will be able to get a hit on the master, but when populating the cache in the ```saveCache:``` step, the contents of that cache won't be used on the master build.

For PR builds, the PR build will use the cache of the branch it is merging into or from with preference been given to the branch it is merging from. At this stage, PR builds, regardless of whether they are from a branch within the repo or from a forked clone will not be able to store content in the cache.
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved

### Cache Expiry
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved

Cache lifetime will be best effort. Our underlying storage will generally keep content for 7 days before it becomes a candidate for eviction, but we won't initially make a guarantee here. We will evaluate the effectiveness of cache durations and listen to community feedback.
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved

### Step Over Support

In the interests of correctness we will not skip build steps by default based on a cache hit. However we will provide a way to emit a variable which can be used in subsequent steps to skip a task if a cache is hit. The usage is as follows:

```yaml
steps:
- restoreCache: yarn.lock
skipVariable: cache.skipyarn
- script: yarn
condition: eq(variables['cache.skipyarn'], 'sourcehit')
```

The value inserted into the variable specified by ```skipVariable:``` will change depending on whether there was a cache hit or miss, and what kind of hit it was. Example values are:

* sourcehit; used to signal that there is a cache hit on the source branch in a PR.
mitchdenny marked this conversation as resolved.
Show resolved Hide resolved
* targethit; used to signal that there is a cache hit on the target branch in a PR.
* hit; used to signal that there is a cache hit on the current branch (non PR build scenario).
* upstreamhit; used to signal that there is a cache hit from an upstream branch (applies to branch builds and PR builds).