Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CI Caching during Interop Tests #31

Closed
3 tasks
laurentsenta opened this issue Aug 22, 2022 · 13 comments
Closed
3 tasks

Use CI Caching during Interop Tests #31

laurentsenta opened this issue Aug 22, 2022 · 13 comments
Assignees

Comments

@laurentsenta
Copy link
Collaborator

Description

Running the ping interop test takes ~1min30.
Building Testground takes ~5min.
Building the Testground instances takes between 15 and 25 min.

This means 2 things:

  • Maintainers don't want to wait for the test results
  • For every run, we download every docker image, go packages, rust packages, etc. The test is likely to fail because of a network issue. (the total size of build + run images in the full interop test is ~14GB).

Path for improvements

Cache build steps

I am investigating this at the moment because I believe this is where we get the most improvement:

  • We don't re-download & rebuild every image and package (lower failure rate due to downloads),
  • We speed up build time
    • maybe 15min -> 3 min, it might be better with Rust that relies on Docker layers to cache the base packages.

It's not trivial:

Cache build artifacts

At the end of the build step, it is possible to retrieve the build artifacts (docker images) produced for each group in the composition file. We could send a "pre-filled" composition file to Testground and save the part that rebuild artifacts.

The issue here is: how do you know whether an artifact needs to be rebuilt?
Even for "old" releases (like go-libp2p v0.11), we might change a test in ways that require to rebuild the test.

  • benefits: we cache a 50Mb image and never rebuild it,
  • issues: we need a manual intervention to tell CI when a test needs to be rebuilt from scratch.

Do not rebuild Testground on every test

This would save around 4 min of building go & docker images. But we wouldn't get the stability benefits of NOT downloading the go & rust images + packages.

  • use the go cache
  • use Testground binaries
  • Use pre-built docker containers

Notes

Current times

(measured with a single run, but these Github CI timings are pretty consistent)

@laurentsenta
Copy link
Collaborator Author

A few notes:

  • I managed to back up and restore the docker cache for the interop (all) tests. The archive is 4.7GB. extracted = 13GB of docker layers.

  • I'm going to try this in CI, only for the interop (all) tests, which is the largest one.

    • My worry is size: GitHub cache is 10GB. The cache for this test will use almost half of the repo's cache space. We could backup to S3, but then we must deal with authorizations, etc.
  • Next I'll take a step back and maybe look into the build artifact options:

    • We could hash the build folder, cache a mapping source hash -> [artifact ids], and reuse this to populate the composition. We would build legacy code only once (on test changes) but always rebuild the master + custom branches from scratch.

@laurentsenta
Copy link
Collaborator Author

Notes about current stats:
#32 (comment)
#32 (comment)

@laurentsenta
Copy link
Collaborator Author

Related PR: #29

@laurentsenta
Copy link
Collaborator Author

Run 1: 16 minutes https://github.com/laurentsenta/test-plans/actions/runs/2965877583/attempts/1
Run 2: 6 minutes https://github.com/laurentsenta/test-plans/runs/8119963605?check_suite_focus=true

the cache for rust is 897mb

  • load cache = 10s
  • restore images = 1m3s
  • build = 4s (10m on first attempt)
  • save images = 2m51s

@laurentsenta
Copy link
Collaborator Author

laurentsenta commented Sep 1, 2022

At the moment I am investigating if images are shared between tests and if we can have one 4GB cache for every test.
The rough idea is that we could use the cache to store the list of docker image + cache the docker images.

It requires a few changes to testground, but basically we'd have:

testground eval -f ./mycomposition --output ./evald-composition.toml
# output the composition with templates & other information eval'd
# we export the `buildKey` for every group here.

# for every group in the evald-composition
# load `cache/test-hash(group.buildKey)-hash(test-files)/images`
#   the list of docker images required by this build (found using labels)
# load `cache/docker-{imageId}/file.tar.gz`
#   contains the `docker save` for that one image
load-cache ./evald-composition.toml

testground build -f ./evald-composition.toml --output ./built-composition.toml
# build the artifacts, & exports artifact ids too

testground run -f ./build-composition.toml
# run the test, thanks for the artifact Ids parameters we don't even hit docker cache

# for every group... do the opposite save
# do not upload docker images that already exists in the cache.
save-cache ./evald-composition.toml

Note we have programmatic access to github cache via: https://github.com/actions/toolkit/tree/main/packages/cache#usage

Questions:

  • Are images shared as we expect between every composition runs (just rust, just go, rust + go)
  • How does the cache size relate to the number of versions? (go is quite naive here)

Rough plan

Once we know if the cache is a reasonable long term solution, the plan is:

  • Implement the testground eval operation (~1 day + 1 day review)
    • we already output the composition on demand, that will be fast.
    • add buildkey to the composition output
  • implement load-cache, save-cache (~1 day + 1 day review)

@laurentsenta
Copy link
Collaborator Author

Note: at the moment go + rust and just go / just rust tests are not reusing the caches because we use a different structure (one with path: /go other with test plan = libp2p/ping/go). We'd have to merge the approaches so that build configs are reused.

@laurentsenta
Copy link
Collaborator Author

laurentsenta commented Sep 1, 2022

the all test generates a tar that is 4.7GB,
Uses 275 images,
Which contains 375 layers.

cache is reused between tests if we tweak the other composition files to reuse the same parameters.

we can't cache per image because there are a lot of duplicated layers in that case (gzip is probably less efficient here). I wrote a script to test that assumption and killed it after generating 53GB of images.

Question:

  • How does the cache size relate to the number of versions? (go builder is quite naive when it comes to caching)
  • Can we live by "just" caching the "all" build on merge to master for example?

Alternative:

  • is to dig into buildkit and fancier build caches
  • Or to toy with creating our own docker save | docker load that reuse layers from different gzip files

=> exciting options but too demanding for now.

@laurentsenta
Copy link
Collaborator Author

laurentsenta commented Sep 1, 2022

  • build all with only 1 rust master and 1 go master: 67 images, 1.3GB compressed

  • build all with 2 rust + 1 go: 80 images, 1.3gb

  • build all with 3 rust + 1 go: 93 images, 1.4gb

  • build all with 5 rust + 1 go: 119 images, 1.4gb

  • build all with 5 rust + 2 go: 139 images, 1.5gb

  • build all with 5 rust + 4 go: 179 images, 1.8gb

  • build all with 5 rust + 7 go: 275 images, 4.7gb

  • go layer caching breaks when we copy the go mod, because they are all different.

  • ideally we don't cache the base image, but it breaks layer caching when I try to pull them instead of docker saveing them.

Todo

  • move the composition file from the test folder to prevent invalidating the cache or implement .testgroundignore
  • look into .(docker|testground)ignore so that if you add or change a go.vXXX.mod file, you don't invalidate every other caches
  • add a run_id to our layers so that we can cache the current run id and avoid piling up images in our cache
  • use && rm -rf /var/lib/apt/lists/* after apt installs.

@laurentsenta
Copy link
Collaborator Author

laurentsenta commented Sep 2, 2022

Notes:

  • our backups contains the base images (like go and rust), because we can't docker save IMAGE without saving the whole history. It's too bad: we're caching something we can easily docker pull.
  • the base images are quite large, rust slim is 700MB, go is ~1gb, there is a go-alpine image that is smaller, but not officially supported (issues with segfaults where reported)
  • we could tweak the docker internals & the docker backup to drop the base image data for example, but that's messing up with docker internals, which will be fragile.

@laurentsenta
Copy link
Collaborator Author

laurentsenta commented Sep 2, 2022

layers in a go image:
(go mod download is ~176MB)

in GARDEN22:~/dev/plabs/testground/caching-experiment
› docker history 9f650b6804cd                                                                                                                                ☺
IMAGE          CREATED       CREATED BY                                      SIZE      COMMENT
9f650b6804cd   2 hours ago   |5 BUILD_TAGS=-tags v0.22 GO_PROXY=https://p…   36.8kB
7b4cc7f59fc4   2 hours ago   /bin/sh -c #(nop)  LABEL testground.run_id=4…   0B
11bb1af54c17   2 hours ago   /bin/sh -c #(nop)  LABEL testground.test.tes…   0B
434c5e66d487   2 hours ago   /bin/sh -c #(nop)  LABEL testground.test.lan…   0B
356164bc2a63   2 hours ago   /bin/sh -c #(nop)  LABEL testground.test.lif…   0B
1231f6f8edf1   2 hours ago   /bin/sh -c #(nop)  LABEL testground.image.ty…   0B
0230f1a83a76   2 hours ago   |5 BUILD_TAGS=-tags v0.22 GO_PROXY=https://p…   176MB
d9f254c7a015   2 hours ago   |5 BUILD_TAGS=-tags v0.22 GO_PROXY=https://p…   117kB
0ce33deec313   2 hours ago   /bin/sh -c #(nop) COPY dir:3c8d3422de6eb5e98…   909kB
8eb3fc0a2415   2 hours ago   |5 BUILD_TAGS=-tags v0.22 GO_PROXY=https://p…   117kB
8a69fe45a752   2 hours ago   |5 BUILD_TAGS=-tags v0.22 GO_PROXY=https://p…   174MB
38f067cf0ec5   2 hours ago   /bin/sh -c #(nop) COPY file:c8b0c0135406ab80…   112kB
ae9a2436e533   2 hours ago   /bin/sh -c #(nop) COPY file:247204431e86496f…   5.49kB
518e77e44692   2 hours ago   /bin/sh -c #(nop)  ARG MODFILE_SUM=go.sum       0B
d246605be057   2 hours ago   /bin/sh -c #(nop)  ARG MODFILE=go.mod           0B
f5c875a96ae1   2 hours ago   /bin/sh -c #(nop)  ENV GOCACHE=/go/cache        0B
bdeb647db115   2 hours ago   /bin/sh -c #(nop)  ENV TESTPLAN_EXEC_PKG=.      0B
8093040e067f   2 hours ago   /bin/sh -c #(nop)  ARG BUILD_TAGS               0B
3b62279b1ecb   2 hours ago   /bin/sh -c #(nop)  ARG GO_PROXY=direct          0B
15f7f3bca9a3   2 hours ago   /bin/sh -c #(nop)  ARG TESTPLAN_EXEC_PKG=.      0B
8a333770ad95   2 hours ago   |1 PLAN_PATH=./go/ /bin/sh -c rm -rf ${PLAN_…   0B
aede486e05d3   2 hours ago   /bin/sh -c #(nop)  ENV SDK_DIR=/sdk             0B
589d82316fd6   2 hours ago   /bin/sh -c #(nop)  ENV PLAN_DIR=/plan/./go/     0B
6b093ed21020   2 hours ago   /bin/sh -c #(nop)  ARG PLAN_PATH                0B
a798dce34acd   9 days ago    /bin/sh -c #(nop) WORKDIR /go                   0B
<missing>      9 days ago    /bin/sh -c mkdir -p "$GOPATH/src" "$GOPATH/b…   0B
<missing>      9 days ago    /bin/sh -c #(nop)  ENV PATH=/go/bin:/usr/loc…   0B
<missing>      9 days ago    /bin/sh -c #(nop)  ENV GOPATH=/go               0B
<missing>      9 days ago    /bin/sh -c set -eux;  arch="$(dpkg --print-a…   431MB
<missing>      9 days ago    /bin/sh -c #(nop)  ENV GOLANG_VERSION=1.18.5    0B
<missing>      9 days ago    /bin/sh -c #(nop)  ENV PATH=/usr/local/go/bi…   0B
<missing>      9 days ago    /bin/sh -c set -eux;  apt-get update;  apt-g…   182MB
<missing>      10 days ago   /bin/sh -c apt-get update && apt-get install…   146MB
<missing>      10 days ago   /bin/sh -c set -ex;  if ! command -v gpg > /…   17.5MB
<missing>      10 days ago   /bin/sh -c set -eux;  apt-get update;  apt-g…   16.5MB
<missing>      10 days ago   /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>      10 days ago   /bin/sh -c #(nop) ADD file:d420ffdf63082e035…   114MB

layer in a rust images
(cargo download is ~1.8GB)

in GARDEN22:~/dev/plabs/testground/caching-experiment
› docker history 6d18a5432ba4                                                                                                                                ☺
IMAGE          CREATED             CREATED BY                                      SIZE      COMMENT
6d18a5432ba4   About an hour ago   /bin/sh -c #(nop)  LABEL testground.run_id=4…   0B
8344b15b1583   About an hour ago   /bin/sh -c #(nop)  ARG RUN_ID=-                 0B
373f506d8dc2   About an hour ago   /bin/sh -c #(nop)  LABEL testground.test.tes…   0B
fd5a8464fe71   About an hour ago   /bin/sh -c #(nop)  LABEL testground.test.lan…   0B
49c2ba7e1643   About an hour ago   /bin/sh -c #(nop)  LABEL testground.test.lif…   0B
f4cd1f8f0984   About an hour ago   /bin/sh -c #(nop)  LABEL testground.image.ty…   0B
704eb8858173   About an hour ago   |4 CARGO_FEATURES=libp2pv0470 CARGO_PATCH= C…   91.2MB
31b8650be5b0   About an hour ago   /bin/sh -c #(nop)  ARG CARGO_FEATURES=          0B
0fa2e9728a08   About an hour ago   |3 CARGO_PATCH= CARGO_REMOVE= PLAN_PATH=./ru…   106kB
01fdcf463b24   About an hour ago   |3 CARGO_PATCH= CARGO_REMOVE= PLAN_PATH=./ru…   7.04kB
8a8bbf1698ac   About an hour ago   /bin/sh -c #(nop) COPY dir:287c465ded8f0a863…   115kB
683076d07cbd   About an hour ago   |3 CARGO_PATCH= CARGO_REMOVE= PLAN_PATH=./ru…   106kB
9dbaef9562fa   About an hour ago   |3 CARGO_PATCH= CARGO_REMOVE= PLAN_PATH=./ru…   1.81kB
47904060fca2   About an hour ago   |3 CARGO_PATCH= CARGO_REMOVE= PLAN_PATH=./ru…   0B
6d20ccc0d9df   About an hour ago   /bin/sh -c #(nop)  ARG CARGO_REMOVE=            0B
042e0ed44c14   About an hour ago   /bin/sh -c #(nop)  ARG CARGO_PATCH=             0B
b17a92778733   About an hour ago   |1 PLAN_PATH=./rust/ /bin/sh -c cd ./plan/ &…   1.82GB
b18bd29fe2b6   About an hour ago   /bin/sh -c #(nop) COPY multi:84d89aebb51da85…   106kB
bb7c9d4aeb54   About an hour ago   |1 PLAN_PATH=./rust/ /bin/sh -c echo "fn mai…   124B
829fd4576be7   About an hour ago   |1 PLAN_PATH=./rust/ /bin/sh -c mkdir -p ./p…   0B
f76ade60a89a   About an hour ago   /bin/sh -c #(nop)  ARG PLAN_PATH=./             0B
8b1077dd3fc7   About an hour ago   /bin/sh -c apt-get update &&         apt-get…   93.6MB
03bd094589a7   About an hour ago   /bin/sh -c #(nop) WORKDIR /usr/src/testplan     0B
6a1351d237d8   4 weeks ago         /bin/sh -c set -eux;     apt-get update;    …   634MB
<missing>      4 weeks ago         /bin/sh -c #(nop)  ENV RUSTUP_HOME=/usr/loca…   0B
<missing>      4 weeks ago         /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>      4 weeks ago         /bin/sh -c #(nop) ADD file:0eae0dca665c7044b…   80.4MB

@laurentsenta
Copy link
Collaborator Author

@laurentsenta
Copy link
Collaborator Author

the temporary branch is master...laurentsenta:test-plans:feat/docker-caching (it's a really rough draft).

Draft notes, will complete on Monday:

  • We can drop the build time from 15 minutes to 2min, but we need to move around 4GB of images, which means only ~33% improvements in CI build time.

  • The frustrating part: approx. half of these 4GBs are rust & go base images we could just docker pull. But docker won't let us backup ranges of images and our solution here is messing up with the backups internals.

  • We timebox'd this effort because we don't want to over-optimize for Github CI if our end goal is moving this workload to EKS. It might be worth looking into BuildKit support too. On Monday I will gather a few more notes on options, and propose alternative approaches.

  • create caches ONLY on merges to master,

  • don't run the large interop tests on PR, only the last versions + current branch. Run the rest on releases.

  • create two caches: go cache and rust cache (we can split them using the LABEL I setup)

  • mess with the docker archive's internal: we could tweak the archives before caching them to drop docker images that exist on dockerhub (go and rust images).

  • look into buildkit? I believe this might bring other improvements like shared caches, etc. But Testground doesn't support it from my short experiments.

@laurentsenta
Copy link
Collaborator Author

Docker + Ideal Caching:

Interop Practical Caching Notes:

  • It's hard to cache the "all" test
    • It produces images of ~4GB. We reach GH disk size limits (~10GB) often.
  • We can implement a "simple" caching workflow:
    • When we build master branches, we cache the images produced.
    • When we test a PR, we reuse the build from master, but DO NOT update the cache. Because our caches are so large at the moment (~1 - 4GB), we don't want a pull request to evict the master cache.
  • We have options that seems pratical:
    • If we use a single testplan, we can create one cache per langage (rust / go) and reuse this cache accross test-plans.
    • There would be one cache per libp2p implementation instead of one cache per workflow (which won't scale above 3 - 4 workflows).
  • We have other opportunities:
    • Caching layers directly by digging into docker internals
    • Look into buildkit features like using layer caching through dockerhub
  • The question is now: Is it worth it?
    • We could dig further, but all this work might become useless when we have an EKS cluster.

Follow-Ups

  • We should implement all of our tests under a single binary (testplan)
    • It will be easier to maintain the tests (only one compat module)
    • It will be easier to cache (only one build)
    • It's how testground is designed (one testplan + many testcase)
  • Recommendation:
    • run the interop (all) tests nightly + before a release
    • run the interop (latests) + interop (cross-versions) on every PR if we're fine with the build time we get from feat: use docker caching during build #38
    • Regroup when the libp2p maintainers have more feedbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants