-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
util: expose otel metrics through grpc and prometheus #4460
Conversation
f0f15ef
to
c233e37
Compare
go.mod
Outdated
go.opentelemetry.io/otel/sdk v1.21.0 | ||
go.opentelemetry.io/otel/trace v1.21.0 | ||
go.opentelemetry.io/proto/otlp v1.0.0 | ||
golang.org/x/crypto v0.14.0 | ||
golang.org/x/net v0.17.0 | ||
golang.org/x/sync v0.3.0 | ||
golang.org/x/sys v0.13.0 | ||
golang.org/x/sys v0.14.0 | ||
golang.org/x/time v0.3.0 | ||
google.golang.org/genproto/googleapis/rpc v0.0.0-20230711160842-782d3b101e98 | ||
google.golang.org/grpc v1.58.3 | ||
google.golang.org/genproto/googleapis/rpc v0.0.0-20230822172742-b8732ec3820d | ||
google.golang.org/grpc v1.59.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to bump otel deps for this feature? We tend to infer these updates through containerd, see #4318 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to. I'm attempting to get this working with v1.19.0
since that was the first version where the metrics were considered stable. So that should be fine.
8586a33
to
a54d044
Compare
f025d9c
to
c52719b
Compare
I have a PR almost ready for some cache metrics, but I also found that opentelemetry has grpc metrics built in so I included them in this PR so this PR actually uses the metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some questions but overall LGTM
@@ -28,6 +29,8 @@ func setupDebugHandlers(addr string) error { | |||
bklog.G(req.Context()).Debugf("triggered GC from debug endpoint") | |||
})) | |||
|
|||
m.Handle("/metrics", promhttp.Handler()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering myself why not /debug/metrics
as route but this is part of the debug handler so good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mostly because /metrics
is the typical endpoint for prometheus. I don't think it matters though so I can always move it to /debug/metrics
.
The main reason why it's in the debug handler is because buildkit doesn't expose an HTTP server otherwise. At the same time, this pathway is only relevant for buildkit when it's run in isolation. If buildkit is embedded, such as moby, we'll piggyback on their own /metrics
endpoint and be included there.
util/tracing/detect/detect.go
Outdated
// Register the prometheus reader if there was no error. | ||
// Don't fail if there was an error. | ||
// TODO(jsternberg): Maybe log the error here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think we should log the error using github.com/moby/buildkit/util/bklog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot that I put this todo there. Good catch. I'll log the error as a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
util/tracing/detect/detect.go
Outdated
} | ||
|
||
sdktp := sdktrace.NewTracerProvider(sdktrace.WithSpanProcessor(sp), sdktrace.WithResource(res)) | ||
closers = append(closers, sdktp.Shutdown) | ||
if r, err := prometheus.New(); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC prometheus is a pull metrics exporter right? If we fail to register it and a scraper tries to pull, what happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it'll just include the defaults for prometheus. It just won't include any of our own metrics.
@@ -43,3 +60,36 @@ func otlpExporter() (sdktrace.SpanExporter, error) { | |||
|
|||
return otlptrace.New(context.Background(), c) | |||
} | |||
|
|||
func otlpMetricExporter() (sdkmetric.Exporter, error) { | |||
set := os.Getenv("OTEL_METRICS_EXPORTER") == "otlp" || os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT") != "" || os.Getenv("OTEL_EXPORTER_OTLP_METRICS_ENDPOINT") != "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR but I think we should start writing a dedicated "observability" docs? Atm we just have
Lines 765 to 777 in c2baaa9
## OpenTelemetry support | |
BuildKit supports [OpenTelemetry](https://opentelemetry.io/) for buildkitd gRPC | |
API and buildctl commands. To capture the trace to | |
[Jaeger](https://github.com/jaegertracing/jaeger), set `JAEGER_TRACE` | |
environment variable to the collection address. | |
```bash | |
docker run -d -p6831:6831/udp -p16686:16686 jaegertracing/all-in-one:latest | |
export JAEGER_TRACE=0.0.0.0:6831 | |
# restart buildkitd and buildctl so they know JAEGER_TRACE | |
# any buildctl command should be traced to http://127.0.0.1:16686/ | |
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think that would probably help.
c52719b
to
378b588
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall SGTM
This exposes the otel metrics through the environment variables for the otel collector or through the prometheus `/metrics` endpoint through the debug address. This also adds metrics for the grpc endpoints from the opentelemetry libraries. For the opentelemetry collector, the metric temporality is delta which reduces the amount of data that goes over the wire overall. Prometheus metrics are cumulative because prometheus metrics require that. Signed-off-by: Jonathan A. Sternberg <[email protected]>
378b588
to
7de2e4f
Compare
moby/buildkit#4460 tries and fails to send metrics if OTEL_METRICS_EXPORTER is 'none' Signed-off-by: Tom Plant <[email protected]>
This exposes the otel metrics through the environment variables for the
otel collector or through the prometheus
/metrics
endpoint through thedebug address.
This also adds metrics for the grpc endpoints from the opentelemetry
libraries.
For the opentelemetry collector, the metric temporality is delta which
reduces the amount of data that goes over the wire overall. Prometheus
metrics are cumulative because prometheus metrics require that.