Use CI Caching during Interop Tests #31

laurentsenta · 2022-08-22T14:01:36Z

Description

Running the ping interop test takes ~1min30.
Building Testground takes ~5min.
Building the Testground instances takes between 15 and 25 min.

This means 2 things:

Maintainers don't want to wait for the test results
For every run, we download every docker image, go packages, rust packages, etc. The test is likely to fail because of a network issue. (the total size of build + run images in the full interop test is ~14GB).

Path for improvements

Cache build steps

I am investigating this at the moment because I believe this is where we get the most improvement:

We don't re-download & rebuild every image and package (lower failure rate due to downloads),
We speed up build time
- maybe 15min -> 3 min, it might be better with Rust that relies on Docker layers to cache the base packages.

It's not trivial:

We need to pick precisely which artifacts need to be cached,
We have to work around multi-staged builds and 1.10 changes in layers
- Related: https://stackoverflow.com/questions/49965396/can-i-obtain-the-docker-layer-history-on-non-final-stage-docker-builds and https://stackoverflow.com/questions/35310212/docker-missing-layer-ids-in-output/35312577#35312577
- I think the solution is to inject a few LABELS in our Dockerfile, I'm investigating this at the moment.

Cache build artifacts

At the end of the build step, it is possible to retrieve the build artifacts (docker images) produced for each group in the composition file. We could send a "pre-filled" composition file to Testground and save the part that rebuild artifacts.

The issue here is: how do you know whether an artifact needs to be rebuilt?
Even for "old" releases (like go-libp2p v0.11), we might change a test in ways that require to rebuild the test.

benefits: we cache a 50Mb image and never rebuild it,
issues: we need a manual intervention to tell CI when a test needs to be rebuilt from scratch.

Do not rebuild Testground on every test

This would save around 4 min of building go & docker images. But we wouldn't get the stability benefits of NOT downloading the go & rust images + packages.

use the go cache
use Testground binaries
Use pre-built docker containers

Notes

Current times

(measured with a single run, but these Github CI timings are pretty consistent)

rust
- without custom branch + success: 17 min
- with custom branch + success: 24 min
- on fail: ~28 min (timeout)
go
- with or without a custom branch, success or fail: ~15 min
rust + go
- case = latest (go master + rust master + custom branch):
  - on success: ~17min
  - on failure: ~28min
- case = all (~10 versions + custom branch):
  - on success: ~28min
  - on failure: ~44min
rust
go
rust + go

The text was updated successfully, but these errors were encountered:

laurentsenta · 2022-08-25T13:31:48Z

A few notes:

I managed to back up and restore the docker cache for the interop (all) tests. The archive is 4.7GB. extracted = 13GB of docker layers.
I'm going to try this in CI, only for the interop (all) tests, which is the largest one.
- My worry is size: GitHub cache is 10GB. The cache for this test will use almost half of the repo's cache space. We could backup to S3, but then we must deal with authorizations, etc.
Next I'll take a step back and maybe look into the build artifact options:
- We could hash the build folder, cache a mapping source hash -> [artifact ids], and reuse this to populate the composition. We would build legacy code only once (on test changes) but always rebuild the master + custom branches from scratch.

laurentsenta · 2022-08-30T12:12:00Z

Notes about current stats:
#32 (comment)
#32 (comment)

laurentsenta · 2022-08-30T12:21:09Z

Related PR: #29

laurentsenta · 2022-09-01T08:54:03Z

Run 1: 16 minutes https://github.com/laurentsenta/test-plans/actions/runs/2965877583/attempts/1
Run 2: 6 minutes https://github.com/laurentsenta/test-plans/runs/8119963605?check_suite_focus=true

the cache for rust is 897mb

load cache = 10s
restore images = 1m3s
build = 4s (10m on first attempt)
save images = 2m51s

laurentsenta · 2022-09-01T09:02:57Z

At the moment I am investigating if images are shared between tests and if we can have one 4GB cache for every test.
The rough idea is that we could use the cache to store the list of docker image + cache the docker images.

It requires a few changes to testground, but basically we'd have:

testground eval -f ./mycomposition --output ./evald-composition.toml
# output the composition with templates & other information eval'd
# we export the `buildKey` for every group here.

# for every group in the evald-composition
# load `cache/test-hash(group.buildKey)-hash(test-files)/images`
#   the list of docker images required by this build (found using labels)
# load `cache/docker-{imageId}/file.tar.gz`
#   contains the `docker save` for that one image
load-cache ./evald-composition.toml

testground build -f ./evald-composition.toml --output ./built-composition.toml
# build the artifacts, & exports artifact ids too

testground run -f ./build-composition.toml
# run the test, thanks for the artifact Ids parameters we don't even hit docker cache

# for every group... do the opposite save
# do not upload docker images that already exists in the cache.
save-cache ./evald-composition.toml

Note we have programmatic access to github cache via: https://github.com/actions/toolkit/tree/main/packages/cache#usage

Questions:

Are images shared as we expect between every composition runs (just rust, just go, rust + go)
How does the cache size relate to the number of versions? (go is quite naive here)

Rough plan

Once we know if the cache is a reasonable long term solution, the plan is:

Implement the testground eval operation (~1 day + 1 day review)
- we already output the composition on demand, that will be fast.
- add buildkey to the composition output
implement load-cache, save-cache (~1 day + 1 day review)

laurentsenta · 2022-09-01T09:41:05Z

Note: at the moment go + rust and just go / just rust tests are not reusing the caches because we use a different structure (one with path: /go other with test plan = libp2p/ping/go). We'd have to merge the approaches so that build configs are reused.

laurentsenta · 2022-09-01T12:59:10Z

the all test generates a tar that is 4.7GB,
Uses 275 images,
Which contains 375 layers.

cache is reused between tests if we tweak the other composition files to reuse the same parameters.

we can't cache per image because there are a lot of duplicated layers in that case (gzip is probably less efficient here). I wrote a script to test that assumption and killed it after generating 53GB of images.

Question:

How does the cache size relate to the number of versions? (go builder is quite naive when it comes to caching)
Can we live by "just" caching the "all" build on merge to master for example?

Alternative:

is to dig into buildkit and fancier build caches
Or to toy with creating our own docker save | docker load that reuse layers from different gzip files

=> exciting options but too demanding for now.

laurentsenta · 2022-09-01T13:04:50Z

build all with only 1 rust master and 1 go master: 67 images, 1.3GB compressed
build all with 2 rust + 1 go: 80 images, 1.3gb
build all with 3 rust + 1 go: 93 images, 1.4gb
build all with 5 rust + 1 go: 119 images, 1.4gb
build all with 5 rust + 2 go: 139 images, 1.5gb
build all with 5 rust + 4 go: 179 images, 1.8gb
build all with 5 rust + 7 go: 275 images, 4.7gb
go layer caching breaks when we copy the go mod, because they are all different.
ideally we don't cache the base image, but it breaks layer caching when I try to pull them instead of docker saveing them.

Todo

move the composition file from the test folder to prevent invalidating the cache or implement .testgroundignore
look into .(docker|testground)ignore so that if you add or change a go.vXXX.mod file, you don't invalidate every other caches
add a run_id to our layers so that we can cache the current run id and avoid piling up images in our cache
use && rm -rf /var/lib/apt/lists/* after apt installs.

laurentsenta · 2022-09-02T13:45:06Z

Notes:

our backups contains the base images (like go and rust), because we can't docker save IMAGE without saving the whole history. It's too bad: we're caching something we can easily docker pull.
the base images are quite large, rust slim is 700MB, go is ~1gb, there is a go-alpine image that is smaller, but not officially supported (issues with segfaults where reported)
we could tweak the docker internals & the docker backup to drop the base image data for example, but that's messing up with docker internals, which will be fragile.

laurentsenta · 2022-09-02T13:47:37Z

layers in a go image:
(go mod download is ~176MB)

in GARDEN22:~/dev/plabs/testground/caching-experiment
› docker history 9f650b6804cd                                                                                                                                ☺
IMAGE          CREATED       CREATED BY                                      SIZE      COMMENT
9f650b6804cd   2 hours ago   |5 BUILD_TAGS=-tags v0.22 GO_PROXY=https://p…   36.8kB
7b4cc7f59fc4   2 hours ago   /bin/sh -c #(nop)  LABEL testground.run_id=4…   0B
11bb1af54c17   2 hours ago   /bin/sh -c #(nop)  LABEL testground.test.tes…   0B
434c5e66d487   2 hours ago   /bin/sh -c #(nop)  LABEL testground.test.lan…   0B
356164bc2a63   2 hours ago   /bin/sh -c #(nop)  LABEL testground.test.lif…   0B
1231f6f8edf1   2 hours ago   /bin/sh -c #(nop)  LABEL testground.image.ty…   0B
0230f1a83a76   2 hours ago   |5 BUILD_TAGS=-tags v0.22 GO_PROXY=https://p…   176MB
d9f254c7a015   2 hours ago   |5 BUILD_TAGS=-tags v0.22 GO_PROXY=https://p…   117kB
0ce33deec313   2 hours ago   /bin/sh -c #(nop) COPY dir:3c8d3422de6eb5e98…   909kB
8eb3fc0a2415   2 hours ago   |5 BUILD_TAGS=-tags v0.22 GO_PROXY=https://p…   117kB
8a69fe45a752   2 hours ago   |5 BUILD_TAGS=-tags v0.22 GO_PROXY=https://p…   174MB
38f067cf0ec5   2 hours ago   /bin/sh -c #(nop) COPY file:c8b0c0135406ab80…   112kB
ae9a2436e533   2 hours ago   /bin/sh -c #(nop) COPY file:247204431e86496f…   5.49kB
518e77e44692   2 hours ago   /bin/sh -c #(nop)  ARG MODFILE_SUM=go.sum       0B
d246605be057   2 hours ago   /bin/sh -c #(nop)  ARG MODFILE=go.mod           0B
f5c875a96ae1   2 hours ago   /bin/sh -c #(nop)  ENV GOCACHE=/go/cache        0B
bdeb647db115   2 hours ago   /bin/sh -c #(nop)  ENV TESTPLAN_EXEC_PKG=.      0B
8093040e067f   2 hours ago   /bin/sh -c #(nop)  ARG BUILD_TAGS               0B
3b62279b1ecb   2 hours ago   /bin/sh -c #(nop)  ARG GO_PROXY=direct          0B
15f7f3bca9a3   2 hours ago   /bin/sh -c #(nop)  ARG TESTPLAN_EXEC_PKG=.      0B
8a333770ad95   2 hours ago   |1 PLAN_PATH=./go/ /bin/sh -c rm -rf ${PLAN_…   0B
aede486e05d3   2 hours ago   /bin/sh -c #(nop)  ENV SDK_DIR=/sdk             0B
589d82316fd6   2 hours ago   /bin/sh -c #(nop)  ENV PLAN_DIR=/plan/./go/     0B
6b093ed21020   2 hours ago   /bin/sh -c #(nop)  ARG PLAN_PATH                0B
a798dce34acd   9 days ago    /bin/sh -c #(nop) WORKDIR /go                   0B
<missing>      9 days ago    /bin/sh -c mkdir -p "$GOPATH/src" "$GOPATH/b…   0B
<missing>      9 days ago    /bin/sh -c #(nop)  ENV PATH=/go/bin:/usr/loc…   0B
<missing>      9 days ago    /bin/sh -c #(nop)  ENV GOPATH=/go               0B
<missing>      9 days ago    /bin/sh -c set -eux;  arch="$(dpkg --print-a…   431MB
<missing>      9 days ago    /bin/sh -c #(nop)  ENV GOLANG_VERSION=1.18.5    0B
<missing>      9 days ago    /bin/sh -c #(nop)  ENV PATH=/usr/local/go/bi…   0B
<missing>      9 days ago    /bin/sh -c set -eux;  apt-get update;  apt-g…   182MB
<missing>      10 days ago   /bin/sh -c apt-get update && apt-get install…   146MB
<missing>      10 days ago   /bin/sh -c set -ex;  if ! command -v gpg > /…   17.5MB
<missing>      10 days ago   /bin/sh -c set -eux;  apt-get update;  apt-g…   16.5MB
<missing>      10 days ago   /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>      10 days ago   /bin/sh -c #(nop) ADD file:d420ffdf63082e035…   114MB

layer in a rust images
(cargo download is ~1.8GB)

in GARDEN22:~/dev/plabs/testground/caching-experiment
› docker history 6d18a5432ba4                                                                                                                                ☺
IMAGE          CREATED             CREATED BY                                      SIZE      COMMENT
6d18a5432ba4   About an hour ago   /bin/sh -c #(nop)  LABEL testground.run_id=4…   0B
8344b15b1583   About an hour ago   /bin/sh -c #(nop)  ARG RUN_ID=-                 0B
373f506d8dc2   About an hour ago   /bin/sh -c #(nop)  LABEL testground.test.tes…   0B
fd5a8464fe71   About an hour ago   /bin/sh -c #(nop)  LABEL testground.test.lan…   0B
49c2ba7e1643   About an hour ago   /bin/sh -c #(nop)  LABEL testground.test.lif…   0B
f4cd1f8f0984   About an hour ago   /bin/sh -c #(nop)  LABEL testground.image.ty…   0B
704eb8858173   About an hour ago   |4 CARGO_FEATURES=libp2pv0470 CARGO_PATCH= C…   91.2MB
31b8650be5b0   About an hour ago   /bin/sh -c #(nop)  ARG CARGO_FEATURES=          0B
0fa2e9728a08   About an hour ago   |3 CARGO_PATCH= CARGO_REMOVE= PLAN_PATH=./ru…   106kB
01fdcf463b24   About an hour ago   |3 CARGO_PATCH= CARGO_REMOVE= PLAN_PATH=./ru…   7.04kB
8a8bbf1698ac   About an hour ago   /bin/sh -c #(nop) COPY dir:287c465ded8f0a863…   115kB
683076d07cbd   About an hour ago   |3 CARGO_PATCH= CARGO_REMOVE= PLAN_PATH=./ru…   106kB
9dbaef9562fa   About an hour ago   |3 CARGO_PATCH= CARGO_REMOVE= PLAN_PATH=./ru…   1.81kB
47904060fca2   About an hour ago   |3 CARGO_PATCH= CARGO_REMOVE= PLAN_PATH=./ru…   0B
6d20ccc0d9df   About an hour ago   /bin/sh -c #(nop)  ARG CARGO_REMOVE=            0B
042e0ed44c14   About an hour ago   /bin/sh -c #(nop)  ARG CARGO_PATCH=             0B
b17a92778733   About an hour ago   |1 PLAN_PATH=./rust/ /bin/sh -c cd ./plan/ &…   1.82GB
b18bd29fe2b6   About an hour ago   /bin/sh -c #(nop) COPY multi:84d89aebb51da85…   106kB
bb7c9d4aeb54   About an hour ago   |1 PLAN_PATH=./rust/ /bin/sh -c echo "fn mai…   124B
829fd4576be7   About an hour ago   |1 PLAN_PATH=./rust/ /bin/sh -c mkdir -p ./p…   0B
f76ade60a89a   About an hour ago   /bin/sh -c #(nop)  ARG PLAN_PATH=./             0B
8b1077dd3fc7   About an hour ago   /bin/sh -c apt-get update &&         apt-get…   93.6MB
03bd094589a7   About an hour ago   /bin/sh -c #(nop) WORKDIR /usr/src/testplan     0B
6a1351d237d8   4 weeks ago         /bin/sh -c set -eux;     apt-get update;    …   634MB
<missing>      4 weeks ago         /bin/sh -c #(nop)  ENV RUSTUP_HOME=/usr/loca…   0B
<missing>      4 weeks ago         /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>      4 weeks ago         /bin/sh -c #(nop) ADD file:0eae0dca665c7044b…   80.4MB

laurentsenta · 2022-09-02T15:25:55Z

Large test experiment:
first run (using testground build only): https://github.com/laurentsenta/test-plans/runs/8158513878?check_suite_focus=true

second run: https://github.com/laurentsenta/test-plans/runs/8159145324?check_suite_focus=true

laurentsenta · 2022-09-02T16:36:39Z

the temporary branch is master...laurentsenta:test-plans:feat/docker-caching (it's a really rough draft).

Draft notes, will complete on Monday:

We can drop the build time from 15 minutes to 2min, but we need to move around 4GB of images, which means only ~33% improvements in CI build time.
The frustrating part: approx. half of these 4GBs are rust & go base images we could just docker pull. But docker won't let us backup ranges of images and our solution here is messing up with the backups internals.
We timebox'd this effort because we don't want to over-optimize for Github CI if our end goal is moving this workload to EKS. It might be worth looking into BuildKit support too. On Monday I will gather a few more notes on options, and propose alternative approaches.
create caches ONLY on merges to master,
don't run the large interop tests on PR, only the last versions + current branch. Run the rest on releases.
create two caches: go cache and rust cache (we can split them using the LABEL I setup)
mess with the docker archive's internal: we could tweak the archives before caching them to drop docker images that exist on dockerhub (go and rust images).
look into buildkit? I believe this might bring other improvements like shared caches, etc. But Testground doesn't support it from my short experiments.

laurentsenta · 2022-09-06T15:18:33Z

Docker + Ideal Caching:

Some things are harder than they seem
- It's hard to cache builder layers with multi-stage
  - We worked around this by using LABEL
  - Related: https://stackoverflow.com/questions/49965396/can-i-obtain-the-docker-layer-history-on-non-final-stage-docker-builds
- It's hard to get consistent information on how caching & layers works since 1.10
  - Related: https://stackoverflow.com/questions/35310212/docker-missing-layer-ids-in-output/35312577#35312577
We can't implement the "ideal cache" because docker save does not let us backup ranges
- If you docker save IMAGE, you will save ALL the layers, including the base image.
- The "ideal" caching workflow is described in a github comment but we can't implement it if we docker save the base image with every layer.
We need to dig into Docker's internal to get docker save to backup ranges
- Ideally we would like to backup build layers separatly and WITHOUT caching the base images (~1GB for rust, 4 x 1GB for Go)
- We can't for now. The feature is being discussed.
- We could also modify the archives produced by docker save, but it would be fragile.
docker save + docker load + docker build is curious when it comes to reuse base images
- If you do not docker save all the history use by your image, I experienced cases where the cache would not be reused at all. It looks like docker pull the-base-image generates a new image id which breaks the cache, even thought the content are identical.
- Probably related
  - "docker save" does not store image tags. moby/moby#3877

Interop Practical Caching Notes:

It's hard to cache the "all" test
- It produces images of ~4GB. We reach GH disk size limits (~10GB) often.
We can implement a "simple" caching workflow:
- When we build master branches, we cache the images produced.
- When we test a PR, we reuse the build from master, but DO NOT update the cache. Because our caches are so large at the moment (~1 - 4GB), we don't want a pull request to evict the master cache.
We have options that seems pratical:
- If we use a single testplan, we can create one cache per langage (rust / go) and reuse this cache accross test-plans.
- There would be one cache per libp2p implementation instead of one cache per workflow (which won't scale above 3 - 4 workflows).
We have other opportunities:
- Caching layers directly by digging into docker internals
- Look into buildkit features like using layer caching through dockerhub
The question is now: Is it worth it?
- We could dig further, but all this work might become useless when we have an EKS cluster.

Follow-Ups

We should implement all of our tests under a single binary (testplan)
- It will be easier to maintain the tests (only one compat module)
- It will be easier to cache (only one build)
- It's how testground is designed (one testplan + many testcase)
Recommendation:
- run the interop (all) tests nightly + before a release
- run the interop (latests) + interop (cross-versions) on every PR if we're fine with the build time we get from feat: use docker caching during build #38
- Regroup when the libp2p maintainers have more feedbacks.

laurentsenta self-assigned this Aug 22, 2022

laurentsenta mentioned this issue Aug 22, 2022

ping/_composition/go.toml: disable go custom & master branches #30

Closed

galargh moved this to 🤔 Triage in InterPlanetary Developer Experience Aug 29, 2022

galargh added this to InterPlanetary Developer Experience Aug 29, 2022

laurentsenta mentioned this issue Aug 30, 2022

EPIC: interop testing for go and rust libp2p - Ping Test #35

Closed

6 tasks

laurentsenta mentioned this issue Sep 6, 2022

feat: use docker caching during build #38

Closed

8 tasks

This was referenced Sep 21, 2022

ping/: Update to rust-libp2p release v0.48.0 and master v0.49.0 #41

Merged

.github/workflows: use write-artifacts and prevent rebuild #40

Merged

p-shahi closed this as completed Feb 7, 2023

github-project-automation bot moved this from 🤔 Triage to 🥳 Done in InterPlanetary Developer Experience Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use CI Caching during Interop Tests #31

Use CI Caching during Interop Tests #31

laurentsenta commented Aug 22, 2022

laurentsenta commented Aug 25, 2022

laurentsenta commented Aug 30, 2022

laurentsenta commented Aug 30, 2022

laurentsenta commented Sep 1, 2022

laurentsenta commented Sep 1, 2022 •

edited

Loading

laurentsenta commented Sep 1, 2022

laurentsenta commented Sep 1, 2022 •

edited

Loading

laurentsenta commented Sep 1, 2022 •

edited

Loading

laurentsenta commented Sep 2, 2022 •

edited

Loading

laurentsenta commented Sep 2, 2022 •

edited

Loading

laurentsenta commented Sep 2, 2022

laurentsenta commented Sep 2, 2022

laurentsenta commented Sep 6, 2022

Use CI Caching during Interop Tests #31

Use CI Caching during Interop Tests #31

Comments

laurentsenta commented Aug 22, 2022

Description

Path for improvements

Cache build steps

Cache build artifacts

Do not rebuild Testground on every test

Notes

Current times

laurentsenta commented Aug 25, 2022

laurentsenta commented Aug 30, 2022

laurentsenta commented Aug 30, 2022

laurentsenta commented Sep 1, 2022

laurentsenta commented Sep 1, 2022 • edited Loading

Questions:

Rough plan

laurentsenta commented Sep 1, 2022

laurentsenta commented Sep 1, 2022 • edited Loading

Question:

Alternative:

laurentsenta commented Sep 1, 2022 • edited Loading

Todo

laurentsenta commented Sep 2, 2022 • edited Loading

laurentsenta commented Sep 2, 2022 • edited Loading

laurentsenta commented Sep 2, 2022

laurentsenta commented Sep 2, 2022

laurentsenta commented Sep 6, 2022

Docker + Ideal Caching:

Interop Practical Caching Notes:

Follow-Ups

laurentsenta commented Sep 1, 2022 •

edited

Loading

laurentsenta commented Sep 1, 2022 •

edited

Loading

laurentsenta commented Sep 1, 2022 •

edited

Loading

laurentsenta commented Sep 2, 2022 •

edited

Loading

laurentsenta commented Sep 2, 2022 •

edited

Loading