Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics for cache size / prune pauses #4401

Open
nicks opened this issue Nov 6, 2023 · 5 comments · May be fixed by #4476
Open

metrics for cache size / prune pauses #4401

nicks opened this issue Nov 6, 2023 · 5 comments · May be fixed by #4476
Assignees

Comments

@nicks
Copy link
Contributor

nicks commented Nov 6, 2023

Why

We've seen issues where buildkit will sometimes pause for a long time while the cache manager is doing a prune().

It's fairly straightforward to reproduce this in isolation. A test that creates 1,000 cache entries then garbage collects them will hold the cache lock for about ~15s on my machine. But we don't have good visibility into how frequent these pauses are in a live system.

Feature request

Can we add a prometheus handler that exposes some metrics about cache / prune behavior?

A good place to start would be to take some inspiration from the pause metrics that Go itself uses for its GC (PauseTotalNs and PauseNs in https://pkg.go.dev/runtime#MemStats)

  • A Gauge metric that tracks how many cache records are currently tracked by the CacheManager
  • A Counter metric that tracks how much time, cumulatively, have been spent prune()
  • A Histogram metric that tracks how long a single prune() takes (maybe 0.1s, 1s, 10s, 100s+)
@tonistiigi
Copy link
Member

Regarding prometheus, I think this should be first added on its own (if we think it is needed). #1544 This would be the default Go metrics and figuring out how to configure the listening addresses/auth etc (and binary size increase). Then the custom counters can follow.

We already have debug handler address for access to Go profilers/tracers, as well as expvar https://pkg.go.dev/expvar . If there are more properties that should be added to expvar then that can be done.

Regarding prune, if there is a test case that shows that prune has an unreasonable impact on running build then prune can be updated to not do that. This shouldn't be very hard. Prune time itself does not show slowness as it only matters if something is waiting on it to continue. There is no reason we need to keep a long lock for whole prune (and we don't for --keep-storage type prunes). If we release the lock then just the prune itself would take longer as it is fastest to do it all at once.

@nicks
Copy link
Contributor Author

nicks commented Nov 9, 2023

Sure! Here's the test case I wrote:

package cache

import (
	"context"
	"fmt"
	"os"
	"path/filepath"
	"testing"
	"time"

	"github.com/containerd/containerd/namespaces"
	"github.com/containerd/containerd/snapshots/native"
	"github.com/moby/buildkit/client"
	"github.com/stretchr/testify/require"
)

func TestBenchmarkPrune(b *testing.T) {
	ctx := namespaces.WithNamespace(context.Background(), "buildkit-test")

	tmpdir := b.TempDir()

	snapshotter, err := native.NewSnapshotter(filepath.Join(tmpdir, "snapshots"))
	require.NoError(b, err)
	b.Cleanup(func() {
		require.NoError(b, snapshotter.Close())
	})

	co, cleanup, err := newCacheManager(ctx, b, cmOpt{
		snapshotter:     snapshotter,
		snapshotterName: "native",
	})
	require.NoError(b, err)
	b.Cleanup(cleanup)

	cm := co.manager

	for j := 0; j < 1; j++ {
		for i := 0; i < 100; i++ {
			active, err := cm.New(ctx, nil, nil, CachePolicyRetain)
			require.NoError(b, err)

			err = active.Release(ctx)
			require.NoError(b, err)
		}
	}

	start := time.Now()
	buf := pruneResultBuffer()
	err = cm.Prune(ctx, buf.C, client.PruneInfo{})
	buf.close()
	require.NoError(b, err)

	_, _ = fmt.Fprintln(os.Stdout, "prune took", time.Since(start))
}

prune speed is linear with the number of items -- so 100 items leads to a 1.5s pause, 1000 items leads to a 15s pause, etc.

Ya, I think a good place to start would be to add some metrics around how many cache records there are and how many records are being pruned in a batch, so that our monitoring systems can tell us how often this happens and whether it's worth optimizing further.

i could certainly make expvar work for metric export, though promhttp is definitely more widespread

@tonistiigi
Copy link
Member

prune speed is linear with the number of items -- so 100 items leads to a 1.5s pause, 1000 items leads to a 15s pause, etc.

So what do you suggest for the max batch size?

native.NewSnapshotter may not represent the real timing that well. But the main consideration is again that prune time itself does not really matter (and with this change, we will make prune slower). The task is to find how much we should make prune slower to make sure build endpoint remains somewhat responsive when prune happens. Obviously, the build will always be slower when other resource-intensive tasks like prune are running but if prune is for example blocked on IO then the build task could use CPU at that time.

@tonistiigi
Copy link
Member

Added limit for this case in #4413

But I don't really have a reproducible case for build stalls as basic batching already happens.

What I did (this is without the #4413 patch):

  • Make Dockerfile that makes 100 layers
  • Run it 20 times to make 2000 layers
for i in $(seq 1 20); do docker buildx build --no-cache .; done


» docker buildx du | wc -l                                                                                                                                                                                                                    
    2006
  • Testing how long pruning all this takes
docker buildx prune --all -f  0.15s user 0.18s system 6% cpu 5.133 total
  • Now I repeat the same but while the prune is running I start new build that creates containers all the time.

Prune is bit slower : docker buildx prune --all -f 0.16s user 0.24s system 6% cpu 6.458 total

Build runs slower but there are no big pauses. The biggest pause is 1.2 sec.

 => [internal] load build definition from Dockerfile                        0.1s
 => => transferring dockerfile: 2.32kB                                      0.0s
 => [internal] load metadata for docker.io/library/alpine:latest            0.5s
 => [internal] load .dockerignore                                           0.1s
 => => transferring context: 2B                                             0.0s
 => CACHED [  1/101] FROM docker.io/library/alpine:latest@sha256:eece0      0.1s
 => => resolve docker.io/library/alpine:latest@sha256:eece025e432126ce      0.0s
 => [  2/101] RUN touch /file1.txt                                          0.4s
 => [  3/101] RUN touch /file2.txt                                          0.1s
 => [  4/101] RUN touch /file3.txt                                          0.1s
 => [  5/101] RUN touch /file4.txt                                          0.1s
 => [  6/101] RUN touch /file5.txt                                          0.1s
 => [  7/101] RUN touch /file6.txt                                          0.1s
 => [  8/101] RUN touch /file7.txt                                          0.1s
 => [  9/101] RUN touch /file8.txt                                          0.1s
 => [ 10/101] RUN touch /file9.txt                                          0.1s
 => [ 11/101] RUN touch /file10.txt                                         0.1s
 => [ 12/101] RUN touch /file11.txt                                         0.1s
 => [ 13/101] RUN touch /file12.txt                                         0.1s
 => [ 14/101] RUN touch /file13.txt                                         0.1s
 => [ 15/101] RUN touch /file14.txt                                         0.1s
 => [ 16/101] RUN touch /file15.txt                                         0.1s
 => [ 17/101] RUN touch /file16.txt                                         0.1s
 => [ 18/101] RUN touch /file17.txt                                         0.1s
 => [ 19/101] RUN touch /file18.txt                                         0.1s
 => [ 20/101] RUN touch /file19.txt                                         0.1s
 => [ 21/101] RUN touch /file20.txt                                         0.1s
 => [ 22/101] RUN touch /file21.txt                                         0.1s
 => [ 23/101] RUN touch /file22.txt                                         0.1s
 => [ 24/101] RUN touch /file23.txt                                         0.1s
 => [ 25/101] RUN touch /file24.txt                                         0.1s
 => [ 26/101] RUN touch /file25.txt                                         0.2s
 => [ 27/101] RUN touch /file26.txt                                         0.1s
 => [ 28/101] RUN touch /file27.txt                                         1.2s
 => [ 29/101] RUN touch /file28.txt                                         0.1s
 => [ 30/101] RUN touch /file29.txt                                         0.0s
 => [ 31/101] RUN touch /file30.txt                                         0.0s
 => [ 32/101] RUN touch /file31.txt                                         0.0s
 => [ 33/101] RUN touch /file32.txt                                         0.0s
 => [ 34/101] RUN touch /file33.txt                                         0.0s
 => [ 35/101] RUN touch /file34.txt                                         0.0s
 => [ 36/101] RUN touch /file35.txt                                         0.0s
 => [ 37/101] RUN touch /file36.txt                                         0.0s

@jsternberg
Copy link
Collaborator

Just a note:

A Counter metric that tracks how much time, cumulatively, have been spent prune()

I think this is covered in the histogram case. Histograms will usually include the raw count and sum which should match this requirement. I'll make sure of that, but is using those parameters fine for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants