Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: capture metrics related to cache records and pruning #4476

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jsternberg
Copy link
Collaborator

Fixes #4401.

@crazy-max
Copy link
Member

I guess this replaces #4464?

@jsternberg
Copy link
Collaborator Author

I wasn't aware of that PR but yes I think this would likely replace the need for that PR.

@jsternberg jsternberg force-pushed the otel-cache-manager-metrics branch from a2a194d to 16ac232 Compare December 8, 2023 18:58
@jsternberg
Copy link
Collaborator Author

@crazy-max I took a look at it again and it seems that PR is trying to do something different. It seems like they're trying to find the information about a single build based on the progress writer and output that as JSON. This PR is more about the overall system itself. So this PR would capture how many times and how long we spent in pruning the cache and also would show how many cache entries there are.

return stats
}

func (cm *cacheManager) collectMetrics(ctx context.Context, o metric.Observer) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this get called?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets called automatically during a metrics collection interval (defined by the reader). So for the periodic reader, every 60 seconds by default. For the prometheus reader, whenever /metrics is called.

See the invocation of RegisterCallback earlier in this file and the observable gauge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the periodic reader, every 60 seconds by default. For the prometheus reader, whenever /metrics is called.

q: Does this mean that the calls are duplicated because there are 2 readers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked this. Short answer, yes.

The call happens for each reader so it'll happen each time /metrics is hit and also every 60 seconds.

Alternatively, if the call never happens because the metrics are never checked, it happens zero times. So if the otel collector isn't configured and the prometheus endpoint doesn't exist or isn't invoked.

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help to get a simple guide under docs for a workflow using this data. Like we have for tracing https://docs.docker.com/build/building/opentelemetry/ . I guess /metrics side is somewhat more obvious but more interested in the otlp export side (eg. how to see it in grafana or similar).

Question: Do you think it could make sense to move #3860 to use otlp exporters as well and use recorder mechanism like we do for traces to capture the metrics for individual containers. Maybe this would have some builtin logic for combining profiling points or make it easier to visualize. Otoh. atm we have neat typed structures while this is a single string key for each value.

docker-bake.hcl Outdated
Comment on lines 140 to 143
//{ name = "labs", tags = "dfrunsecurity dfparents", target = "golangci-lint" },
//{ name = "nydus", tags = "nydus", target = "golangci-lint" },
//{ name = "yaml", tags = "", target = "yamllint" },
//{ name = "proto", tags = "", target = "protolint" },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be reverted since #4490

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. This was for testing and was accidentally committed.

@jsternberg
Copy link
Collaborator Author

@tonistiigi yes I think updating the docs would be helpful. I've also been considering adapting the local docker compose I use for development and adapting it to be a little more general just to help facilitate some workflows. I was thinking something like having the compose file launch buildkit while including configuration for the debugger, jaeger (for tracing), grafana (for viewing metrics), and then something to store the metrics outputs. If there's some interest there, I'll try some stuff out.

For the build resource metrics, I do think it likely makes sense to convert some of those to OTLP metrics or to have OTLP metrics be mimicked along with those. I think it might be worth discussing more of what this might look like as I'm not really sure what the best way is.

@jsternberg jsternberg force-pushed the otel-cache-manager-metrics branch from 16ac232 to 7538521 Compare December 27, 2023 20:05
@jsternberg jsternberg force-pushed the otel-cache-manager-metrics branch from 7538521 to 9b98442 Compare December 27, 2023 21:42

const (
instrumentationName = "github.com/moby/buildkit/cache"
metricCacheRecords = "cache.records.count"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we should have a cache.records.size to collect the cache size or maybe split between cache.records.shared.size, cache.records.private.size, cache.records.reclaimable.size?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure to be honest. I do think that will likely take a bit more effort though to implement so I left it out of this initial version. The reason is because I wasn't quite sure how to determine the disk usage in an efficient way taking into account mutable and immutable. I figured immutable entries didn't need to be continuously updated while mutable ones would have to be rechecked. I also didn't want to hold a lock on the disk usage or perform a potentially expensive computation to count the size.

I do think it's likely worth making a new issue for cache sizes and adding some metrics to it. I'd say all of the above are good metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to hold a lock on the disk usage or perform a potentially expensive computation to count the size.

Oh right that's a very good point, disk usage is indeed a resource intensive call.

@jsternberg jsternberg marked this pull request as draft January 4, 2024 21:32
@jsternberg
Copy link
Collaborator Author

Converting this back to a draft for a little bit. I want to iterate on the format of the metric before this would get merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metrics for cache size / prune pauses
4 participants