metrics for cache size / prune pauses #4401

nicks · 2023-11-06T19:26:48Z

Why

We've seen issues where buildkit will sometimes pause for a long time while the cache manager is doing a prune().

It's fairly straightforward to reproduce this in isolation. A test that creates 1,000 cache entries then garbage collects them will hold the cache lock for about ~15s on my machine. But we don't have good visibility into how frequent these pauses are in a live system.

Feature request

Can we add a prometheus handler that exposes some metrics about cache / prune behavior?

A good place to start would be to take some inspiration from the pause metrics that Go itself uses for its GC (PauseTotalNs and PauseNs in https://pkg.go.dev/runtime#MemStats)

A Gauge metric that tracks how many cache records are currently tracked by the CacheManager
A Counter metric that tracks how much time, cumulatively, have been spent prune()
A Histogram metric that tracks how long a single prune() takes (maybe 0.1s, 1s, 10s, 100s+)

tonistiigi · 2023-11-09T01:04:37Z

Regarding prometheus, I think this should be first added on its own (if we think it is needed). #1544 This would be the default Go metrics and figuring out how to configure the listening addresses/auth etc (and binary size increase). Then the custom counters can follow.

We already have debug handler address for access to Go profilers/tracers, as well as expvar https://pkg.go.dev/expvar . If there are more properties that should be added to expvar then that can be done.

Regarding prune, if there is a test case that shows that prune has an unreasonable impact on running build then prune can be updated to not do that. This shouldn't be very hard. Prune time itself does not show slowness as it only matters if something is waiting on it to continue. There is no reason we need to keep a long lock for whole prune (and we don't for --keep-storage type prunes). If we release the lock then just the prune itself would take longer as it is fastest to do it all at once.

nicks · 2023-11-09T16:33:13Z

Sure! Here's the test case I wrote:

package cache

import (
	"context"
	"fmt"
	"os"
	"path/filepath"
	"testing"
	"time"

	"github.com/containerd/containerd/namespaces"
	"github.com/containerd/containerd/snapshots/native"
	"github.com/moby/buildkit/client"
	"github.com/stretchr/testify/require"
)

func TestBenchmarkPrune(b *testing.T) {
	ctx := namespaces.WithNamespace(context.Background(), "buildkit-test")

	tmpdir := b.TempDir()

	snapshotter, err := native.NewSnapshotter(filepath.Join(tmpdir, "snapshots"))
	require.NoError(b, err)
	b.Cleanup(func() {
		require.NoError(b, snapshotter.Close())
	})

	co, cleanup, err := newCacheManager(ctx, b, cmOpt{
		snapshotter:     snapshotter,
		snapshotterName: "native",
	})
	require.NoError(b, err)
	b.Cleanup(cleanup)

	cm := co.manager

	for j := 0; j < 1; j++ {
		for i := 0; i < 100; i++ {
			active, err := cm.New(ctx, nil, nil, CachePolicyRetain)
			require.NoError(b, err)

			err = active.Release(ctx)
			require.NoError(b, err)
		}
	}

	start := time.Now()
	buf := pruneResultBuffer()
	err = cm.Prune(ctx, buf.C, client.PruneInfo{})
	buf.close()
	require.NoError(b, err)

	_, _ = fmt.Fprintln(os.Stdout, "prune took", time.Since(start))
}

prune speed is linear with the number of items -- so 100 items leads to a 1.5s pause, 1000 items leads to a 15s pause, etc.

Ya, I think a good place to start would be to add some metrics around how many cache records there are and how many records are being pruned in a batch, so that our monitoring systems can tell us how often this happens and whether it's worth optimizing further.

i could certainly make expvar work for metric export, though promhttp is definitely more widespread

tonistiigi · 2023-11-09T17:26:38Z

prune speed is linear with the number of items -- so 100 items leads to a 1.5s pause, 1000 items leads to a 15s pause, etc.

So what do you suggest for the max batch size?

native.NewSnapshotter may not represent the real timing that well. But the main consideration is again that prune time itself does not really matter (and with this change, we will make prune slower). The task is to find how much we should make prune slower to make sure build endpoint remains somewhat responsive when prune happens. Obviously, the build will always be slower when other resource-intensive tasks like prune are running but if prune is for example blocked on IO then the build task could use CPU at that time.

tonistiigi · 2023-11-09T19:31:41Z

Added limit for this case in #4413

But I don't really have a reproducible case for build stalls as basic batching already happens.

What I did (this is without the #4413 patch):

Make Dockerfile that makes 100 layers
Run it 20 times to make 2000 layers

for i in $(seq 1 20); do docker buildx build --no-cache .; done


» docker buildx du | wc -l                                                                                                                                                                                                                    
    2006

Testing how long pruning all this takes

docker buildx prune --all -f  0.15s user 0.18s system 6% cpu 5.133 total

Now I repeat the same but while the prune is running I start new build that creates containers all the time.

Prune is bit slower : docker buildx prune --all -f 0.16s user 0.24s system 6% cpu 6.458 total

Build runs slower but there are no big pauses. The biggest pause is 1.2 sec.

 => [internal] load build definition from Dockerfile                        0.1s
 => => transferring dockerfile: 2.32kB                                      0.0s
 => [internal] load metadata for docker.io/library/alpine:latest            0.5s
 => [internal] load .dockerignore                                           0.1s
 => => transferring context: 2B                                             0.0s
 => CACHED [  1/101] FROM docker.io/library/alpine:latest@sha256:eece0      0.1s
 => => resolve docker.io/library/alpine:latest@sha256:eece025e432126ce      0.0s
 => [  2/101] RUN touch /file1.txt                                          0.4s
 => [  3/101] RUN touch /file2.txt                                          0.1s
 => [  4/101] RUN touch /file3.txt                                          0.1s
 => [  5/101] RUN touch /file4.txt                                          0.1s
 => [  6/101] RUN touch /file5.txt                                          0.1s
 => [  7/101] RUN touch /file6.txt                                          0.1s
 => [  8/101] RUN touch /file7.txt                                          0.1s
 => [  9/101] RUN touch /file8.txt                                          0.1s
 => [ 10/101] RUN touch /file9.txt                                          0.1s
 => [ 11/101] RUN touch /file10.txt                                         0.1s
 => [ 12/101] RUN touch /file11.txt                                         0.1s
 => [ 13/101] RUN touch /file12.txt                                         0.1s
 => [ 14/101] RUN touch /file13.txt                                         0.1s
 => [ 15/101] RUN touch /file14.txt                                         0.1s
 => [ 16/101] RUN touch /file15.txt                                         0.1s
 => [ 17/101] RUN touch /file16.txt                                         0.1s
 => [ 18/101] RUN touch /file17.txt                                         0.1s
 => [ 19/101] RUN touch /file18.txt                                         0.1s
 => [ 20/101] RUN touch /file19.txt                                         0.1s
 => [ 21/101] RUN touch /file20.txt                                         0.1s
 => [ 22/101] RUN touch /file21.txt                                         0.1s
 => [ 23/101] RUN touch /file22.txt                                         0.1s
 => [ 24/101] RUN touch /file23.txt                                         0.1s
 => [ 25/101] RUN touch /file24.txt                                         0.1s
 => [ 26/101] RUN touch /file25.txt                                         0.2s
 => [ 27/101] RUN touch /file26.txt                                         0.1s
 => [ 28/101] RUN touch /file27.txt                                         1.2s
 => [ 29/101] RUN touch /file28.txt                                         0.1s
 => [ 30/101] RUN touch /file29.txt                                         0.0s
 => [ 31/101] RUN touch /file30.txt                                         0.0s
 => [ 32/101] RUN touch /file31.txt                                         0.0s
 => [ 33/101] RUN touch /file32.txt                                         0.0s
 => [ 34/101] RUN touch /file33.txt                                         0.0s
 => [ 35/101] RUN touch /file34.txt                                         0.0s
 => [ 36/101] RUN touch /file35.txt                                         0.0s
 => [ 37/101] RUN touch /file36.txt                                         0.0s

jsternberg · 2023-12-07T20:34:09Z

Just a note:

A Counter metric that tracks how much time, cumulatively, have been spent prune()

I think this is covered in the histogram case. Histograms will usually include the raw count and sum which should match this requirement. I'll make sure of that, but is using those parameters fine for this?

tonistiigi mentioned this issue Nov 9, 2023

cache: set max batch size limit to prune #4413

Merged

jsternberg self-assigned this Dec 5, 2023

jsternberg linked a pull request Dec 8, 2023 that will close this issue

cache: capture metrics related to cache records and pruning #4476

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics for cache size / prune pauses #4401

metrics for cache size / prune pauses #4401

nicks commented Nov 6, 2023 •

edited

Loading

tonistiigi commented Nov 9, 2023

nicks commented Nov 9, 2023

tonistiigi commented Nov 9, 2023

tonistiigi commented Nov 9, 2023

jsternberg commented Dec 7, 2023

metrics for cache size / prune pauses #4401

metrics for cache size / prune pauses #4401

Comments

nicks commented Nov 6, 2023 • edited Loading

Why

Feature request

tonistiigi commented Nov 9, 2023

nicks commented Nov 9, 2023

tonistiigi commented Nov 9, 2023

tonistiigi commented Nov 9, 2023

jsternberg commented Dec 7, 2023

nicks commented Nov 6, 2023 •

edited

Loading