From 448561c17f2bf98054d2cc5b1154c32edae1d86f Mon Sep 17 00:00:00 2001 From: Yuri Shkuro Date: Wed, 27 Nov 2024 23:48:03 -0400 Subject: [PATCH] Update Monitoring and Troubleshooting pages for v2 (#805) Signed-off-by: Yuri Shkuro --- content/docs/next-release-v2/monitoring.md | 53 +++++++-------- .../docs/next-release-v2/troubleshooting.md | 64 +++++-------------- content/docs/next-release/troubleshooting.md | 17 +---- scripts/cspell/project-words.txt | 1 + 4 files changed, 47 insertions(+), 88 deletions(-) diff --git a/content/docs/next-release-v2/monitoring.md b/content/docs/next-release-v2/monitoring.md index 790060f8..ca228b25 100644 --- a/content/docs/next-release-v2/monitoring.md +++ b/content/docs/next-release-v2/monitoring.md @@ -6,43 +6,44 @@ hasparent: true Jaeger itself is a distributed, microservices based system. If you run it in production, you will likely want to setup adequate monitoring for different components, e.g. to ensure that the backend is not saturated by too much tracing data. -## Metrics +Please refer to [OpenTelemetry Collector documentation](https://opentelemetry.io/docs/collector/internal-telemetry/) for details on configuring the internal telemetry. -By default Jaeger microservices expose metrics in Prometheus format. It is controlled by the following command line options: +## Metrics -* `--admin.http.host-port` the port number where the HTTP admin server is running -* `--metrics-backend` controls how the measurements are exposed. The default value is `prometheus`, another option is `expvar`, the Go standard mechanism for exposing process level statistics. -* `--metrics-http-route` specifies the name of the HTTP endpoint used to scrape the metrics (`/metrics` by default). +Here's a sample `curl` call to obtain the metrics: -Each Jaeger component exposes the metrics scraping endpoint on the admin port: +``` +curl -s http://jaeger-collector:8888/metrics +``` -Component | Port ---------------------- | --- -**jaeger-agent** | 14271 -**jaeger-collector** | 14269 -**jaeger-query** | 16687 -**jaeger-ingester** | 14270 -**all-in-one** | 14269 +The following metrics are of special interest: -### Prometheus monitoring mixin for Jaeger +``` +otelcol_receiver_accepted_spans +otelcol_receiver_refused_spans -The Prometheus monitoring mixin for Jaeger provides a starting point for people wanting to monitor Jaeger using Prometheus, Alertmanager, and Grafana. This includes a prebuilt [dashboard](https://github.com/jaegertracing/jaeger/blob/master/monitoring/jaeger-mixin/dashboard-for-grafana.json). For more information, see [the documentation](https://github.com/jaegertracing/jaeger/tree/master/monitoring/jaeger-mixin). +otelcol_exporter_sent_spans +otelcol_exporter_send_failed_spans +``` -## Logging +The first two metrics describe how many spans are being received by Jaeger. The last two metrics indicate how many spans are being sent to the storage. Under normal conditions the `accepted` and `sent_spans` counters should be close to each other. -Jaeger components only log to standard out, using structured logging library [go.uber.org/zap](https://github.com/uber-go/zap) configured to write log lines as JSON encoded strings, for example: +The labels on the metrics allow to separate different receivers and exporters. For example, the first metric with all labels might look like this (formatted for readability): -```json -{"level":"info","ts":1615914981.7914007,"caller":"flags/admin.go:111","msg":"Starting admin HTTP server","http-addr":":14269"} -{"level":"info","ts":1615914981.7914548,"caller":"flags/admin.go:97","msg":"Admin server started","http.host-port":"[::]:14269","health-status":"unavailable"} +``` +otelcol_receiver_accepted_spans{ + receiver="otlp", + service_instance_id="f91d66c2-0445-42bf-a062-32aaed09facf", + service_name="jaeger", + service_version="2.0.0", + transport="http" +} 44 ``` -The log level can be adjusted via `--log-level` command line switch; default level is `info`. +## Logging -## Traces +Logs by default go to `stderr` in plain text format. For production deployment log verbosity of `info` or `warning` is recommended. -Jaeger has the ability to trace some of its own components, namely the requests to the Query service. For example, if you start `all-in-one` as described in [Getting Started](../getting-started/), and refresh the UI screen a few times, you will see `jaeger-all-in-one` populated in the Services dropdown. If you prefer not to see these traces in the Jaeger UI, you can disable them by running Jaeger backend components with `OTEL_TRACES_SAMPLER=always_off` environment variable, for example: +## Traces -``` -docker run -e OTEL_TRACES_SAMPLER=always_off -p 16686:16686 jaegertracing/all-in-one:{{< currentVersion >}} -``` +Jaeger has the ability to trace some of its own components, namely the requests to the Query service. For example, if you start `all-in-one` as described in [Getting Started](../getting-started/), and refresh the UI screen a few times, you will see `jaeger` populated in the Services dropdown. diff --git a/content/docs/next-release-v2/troubleshooting.md b/content/docs/next-release-v2/troubleshooting.md index 9745b5c0..b4101dae 100644 --- a/content/docs/next-release-v2/troubleshooting.md +++ b/content/docs/next-release-v2/troubleshooting.md @@ -20,24 +20,13 @@ If you are using OpenTelemetry SDKs, they should default to `parentbased_always_ OpenTelemetry SDKs can be configured with an exporter that prints recorded spans to `stdout`. Enabling it allows you to verify if the spans are actually being recorded. -#### Use the logging reporter - -Most Jaeger SDKs are able to log the spans that are being reported to the logging facility provided by the instrumented application. Typically, this can be done by setting the environment variable `JAEGER_REPORTER_LOG_SPANS` to `true`, but refer to the Jaeger SDK's documentation for the language you are using. In some languages, specifically in Go and Node.js, there are no de-facto standard logging facilities, so you need to explicitly pass a logger to the SDK that implements a very narrow `Logger` interface defined by the Jaeger SDKs. When using the Jaeger SDK for Java, spans are reported like the following: - - 2018-12-10 17:20:54 INFO LoggingReporter:43 - Span reported: e66dc77b8a1e813b:6b39b9c18f8ef082:a56f41e38ca449a4:1 - getAccountFromCache - -The log entry above contains three IDs: the trace ID `e66dc77b8a1e813b`, the span ID `6b39b9c18f8ef082` and the span's parent ID `a56f41e38ca449a4`. When the backend components have the log level set to `debug`, the span and trace IDs should be visible on their standard output (see [Increase the logging in the backend components](#increase-the-logging-in-the-backend-components) below). - -The logging reporter follows the sampling decision made by the sampler, meaning that if the span is logged, it should also reach the backend. - ### Remote Sampling The Jaeger backend supports [Remote Sampling](../sampling/#remote-sampling), i.e., configuring sampling strategies centrally and making them available to the SDKs. Some, but not all, OpenTelemetry SDKs support remote sampling, often via extensions (refer to [Migration to OpenTelemetry](../../../sdk-migration/#migration-to-opentelemetry) for details). If you suspect the remote sampling is not working correctly, try these steps: -1. Make sure that the SDK is actually configured to use remote sampling, points to the correct sampling service address (see [APIs](../apis/#remote-sampling-configuration)), and that address is reachable from your application's [networking namespace](#networking-namespace). -1. Look at the root span of the traces that are captured in Jaeger. If you are using Jaeger SDKs, the root span will contain the tags `sampler.type` and `sampler.param`, which indicate which strategy was used. (TBD - do OpenTelemetry SDKs record that?) +1. Make sure that the SDK is actually configured to use remote sampling, points to the correct sampling service address (see [APIs](../apis/#remote-sampling-configuration)), and that address is reachable from your application's [networking namespace](#network-connectivity). 1. Verify that the server is returning the appropriate sampling strategy for your service: ``` $ curl "jaeger-collector:14268/api/sampling?service=foobar" @@ -48,56 +37,35 @@ If you suspect the remote sampling is not working correctly, try these steps: If your applications are not sending data directly to Jaeger but to intermediate layers, for example an OpenTelemetry Collector running as a host agent, try configuring the SDK to send data directly to Jaeger to narrow down the problem space. -## Networking Namespace - -If your Jaeger backend is still not able to receive spans (see the following sections on how to check logs and metrics for that), then the issue is most likely with your networking namespace configuration. When running the Jaeger backend components as Docker containers, the typical mistakes are: - - * Not exposing the appropriate ports outside of the container. For example, the collector may be listening on `:14268` inside the container network namespace, but the port is not reachable from the outside. - * Not making **jaeger-agent**'s or **jaeger-collector**'s host name visible from the application's network namespace. For example, if you run both your application and Jaeger backend in separate containers in Docker, they either need to be in the same namespace, or the application's container needs to be given access to Jaeger backend using the `--link` option of the `docker` command. +## Network connectivity -## Increase the logging in the backend components +If your Jaeger backend is still not able to receive spans (see the following sections on how to check logs and metrics for that), then the issue is most likely with your networking namespace configuration. When running the Jaeger backend components as containers, the typical mistakes are: -**jaeger-agent** and **jaeger-collector** provide useful debugging information when the log level is set to `debug`. Every UDP packet that is received by **jaeger-agent** is logged, as well as every batch that is sent by **jaeger-agent** to **jaeger-collector**. **jaeger-collector** also logs every batch it receives and logs every span that is stored in the permanent storage. + * Not exposing the appropriate ports outside of the container. For example, the collector may be listening on `:4317` inside the container network namespace, but the port is not reachable from the outside. + * Using `localhost` as the host name for server endpoints. `localhost` is fine when running on bare metal, but in a container it is recommended to listen on `0.0.0.0` instead. + * Not making Jaeger's host name visible from the application's network namespace. For example, if you run both your application and Jaeger backend in separate containers in Docker, they either need to be in the same namespace, or the application's container needs to be given access to Jaeger backend using the `--link` option of the `docker` command. -Here's what to expect when **jaeger-agent** is started with the `--log-level=debug` flag: +## Increase the logs verbosity - {"level":"debug","ts":1544458854.5367086,"caller":"processors/thrift_processor.go:113","msg":"Span(s) received by the agent","bytes-received":359} - {"level":"debug","ts":1544458854.5408711,"caller":"tchannel/reporter.go:133","msg":"Span batch submitted by the agent","span-count":3} - -On the **jaeger-collector** side, these are the expected log entries when the flag `--log-level=debug` is specified: - - {"level":"debug","ts":1544458854.5406284,"caller":"app/span_handler.go:90","msg":"Span batch processed by the collector.","ok":true} - {"level":"debug","ts":1544458854.5406587,"caller":"app/span_processor.go:105","msg":"Span written to the storage by the collector","trace-id":"e66dc77b8a1e813b","span-id":"6b39b9c18f8ef082"} - {"level":"debug","ts":1544458854.54068,"caller":"app/span_processor.go:105","msg":"Span written to the storage by the collector","trace-id":"e66dc77b8a1e813b","span-id":"d92976b6055e6779"} - {"level":"debug","ts":1544458854.5406942,"caller":"app/span_processor.go:105","msg":"Span written to the storage by the collector","trace-id":"e66dc77b8a1e813b","span-id":"a56f41e38ca449a4"} +Jaeger provides useful debugging information when the log level is set to `debug`. See [Monitoring](../monitoring/#logging) for more details on increasing logging verbosity. ## Check the /metrics endpoint -For the cases where it's not possible or desirable to increase the logging on the **jaeger-collector** side, the `/metrics` endpoint can be used to check if spans for specific services are being received. The `/metrics` endpoint is served from the admin port, which is different for each binary (see [Deployment](../deployment/)). Assuming that **jaeger-collector** is available under a host named `jaeger-collector`, here's a sample `curl` call to obtain the metrics: - - curl http://jaeger-collector:14269/metrics - -The following metrics are of special interest: +For the cases where it's not possible or desirable to increase the logging verbosity, the `/metrics` endpoint can be used to check how trace data is being received and processed by Jaeger. See [Monitoring](../monitoring/#logging) for more details on configuring metrics production. Here's a sample `curl` call to obtain the metrics: - jaeger_collector_spans_received - jaeger_collector_spans_saved_by_svc - jaeger_collector_traces_received - jaeger_collector_traces_saved_by_svc - -The first two metrics should have similar values for the same service. Similarly, the two `traces` metrics should also have similar values. For instance, this is an example of a setup that is working as expected: +``` +curl -s http://jaeger-collector:8888/metrics +``` - jaeger_collector_spans_received{debug="false",format="jaeger",svc="order"} 8 - jaeger_collector_spans_saved_by_svc{debug="false",result="ok",svc="order"} 8 - jaeger_collector_traces_received{debug="false",format="jaeger",svc="order"} 1 - jaeger_collector_traces_saved_by_svc{debug="false",result="ok",svc="order"} 1 +If Jaeger is able to receive traces, the counter `otelcol_receiver_accepted_spans` should be going up. If it is able to successfully write traces into storage, the counter `otelcol_exporter_sent_spans` should also be going up at the same rate. -## Istio: missing spans +## Service Mesh: missing spans -When deploying your application as part of a service mesh like Istio, the number of moving parts increases significantly and might affect how (and which) spans are reported. If you expect to see spans generated by Istio but they aren't being visible in the Jaeger UI, check the troubleshooting guide on [Istio's website](https://istio.io/faq/distributed-tracing/#no-tracing). +When deploying your application as part of a service mesh like Istio, the number of moving parts increases significantly and might affect how (and which) spans are reported. If you expect to see spans generated by a service mesh but they are noe visible in the Jaeger UI, check the troubleshooting guides for the service mesh you are using. For example, on [Istio's website](https://istio.io/faq/distributed-tracing/#no-tracing). ## Run debug images of the backend components -We provide debug images for each Jaeger component. These images have [delve](https://github.com/go-delve/delve) and respective Jaeger component compiled with optimizations disabled. When you run these images, delve triggers the execution of the Jaeger component as its child process and immediately attaches to it to begin a new debug session and start listening on TCP port 12345 for remote connections. You can then use your IDEs like [Visual Studio Code](https://code.visualstudio.com/) or [GoLand](https://www.jetbrains.com/go/) to connect to this port and attach with it remotely and perform [debugging](https://golangforall.com/en/post/go-docker-delve-remote-debug.html) by adding breakpoints. +We provide debug images for Jaeger, which include Jaeger binary compiled with optimizations disabled and the [delve debugger](https://github.com/go-delve/delve). When you run these images, delve triggers the execution of Jaeger as its child process and immediately attaches to it to begin a new debug session and start listening on TCP port 12345 for remote connections. You can then use your IDEs like [Visual Studio Code](https://code.visualstudio.com/) or [GoLand](https://www.jetbrains.com/go/) to connect to this port and attach with it remotely and perform [debugging](https://golangforall.com/en/post/go-docker-delve-remote-debug.html) by adding breakpoints. For Visual Studio Code, you need to have the following configuration at the root of your local clone of the Jaeger source code: diff --git a/content/docs/next-release/troubleshooting.md b/content/docs/next-release/troubleshooting.md index f15cd9aa..63365f9e 100644 --- a/content/docs/next-release/troubleshooting.md +++ b/content/docs/next-release/troubleshooting.md @@ -33,16 +33,6 @@ For example, when using the Jaeger SDK for Java, the strategy is usually printed 2018-12-10 16:41:25 INFO Configuration:236 - Initialized tracer=JaegerTracer(..., sampler=ConstSampler(decision=true, tags={sampler.type=const, sampler.param=true}), ...) -#### Use the logging reporter - -Most Jaeger SDKs are able to log the spans that are being reported to the logging facility provided by the instrumented application. Typically, this can be done by setting the environment variable `JAEGER_REPORTER_LOG_SPANS` to `true`, but refer to the Jaeger SDK's documentation for the language you are using. In some languages, specifically in Go and Node.js, there are no de-facto standard logging facilities, so you need to explicitly pass a logger to the SDK that implements a very narrow `Logger` interface defined by the Jaeger SDKs. When using the Jaeger SDK for Java, spans are reported like the following: - - 2018-12-10 17:20:54 INFO LoggingReporter:43 - Span reported: e66dc77b8a1e813b:6b39b9c18f8ef082:a56f41e38ca449a4:1 - getAccountFromCache - -The log entry above contains three IDs: the trace ID `e66dc77b8a1e813b`, the span ID `6b39b9c18f8ef082` and the span's parent ID `a56f41e38ca449a4`. When the backend components have the log level set to `debug`, the span and trace IDs should be visible on their standard output (see [Increase the logging in the backend components](#increase-the-logging-in-the-backend-components) below). - -The logging reporter follows the sampling decision made by the sampler, meaning that if the span is logged, it should also reach the backend. - ### Remote Sampling The Jaeger backend supports [Remote Sampling](../sampling/#remote-sampling), i.e., configuring sampling strategies centrally and making them available to the SDKs. Some, but not all, OpenTelemetry SDKs support remote sampling, often via extensions (refer to [Migration to OpenTelemetry](../../../sdk-migration/#migration-to-opentelemetry) for details). @@ -50,7 +40,6 @@ The Jaeger backend supports [Remote Sampling](../sampling/#remote-sampling), i.e If you suspect the remote sampling is not working correctly, try these steps: 1. Make sure that the SDK is actually configured to use remote sampling, points to the correct sampling service address (see [APIs](../apis/#remote-sampling-configuration-stable)), and that address is reachable from your application's [networking namespace](#networking-namespace). -1. Look at the root span of the traces that are captured in Jaeger. If you are using Jaeger SDKs, the root span will contain the tags `sampler.type` and `sampler.param`, which indicate which strategy was used. (TBD - do OpenTelemetry SDKs record that?) 1. Verify that the server is returning the appropriate sampling strategy for your service: ``` $ curl "jaeger-collector:14268/api/sampling?service=foobar" @@ -61,12 +50,12 @@ If you suspect the remote sampling is not working correctly, try these steps: If your Jaeger backend is still not able to receive spans (see the following sections on how to check logs and metrics for that), then the issue is most likely with your networking namespace configuration. When running the Jaeger backend components as Docker containers, the typical mistakes are: - * Not exposing the appropriate ports outside of the container. For example, the collector may be listening on `:14268` inside the container network namespace, but the port is not reachable from the outside. + * Not exposing the appropriate ports outside of the container. For example, the collector may be listening on `:4317` inside the container network namespace, but the port is not reachable from the outside. * Not making **jaeger-collector**'s host name visible from the application's network namespace. For example, if you run both your application and Jaeger backend in separate containers in Docker, they either need to be in the same namespace, or the application's container needs to be given access to Jaeger backend using the `--link` option of the `docker` command. ## Increase the logging in the backend components -**jaeger-collector** provides useful debugging information when the log level is set to `debug`. **jaeger-collector** logs every batch it receives and logs every span that is stored in the permanent storage. +Jaeger provides useful debugging information when the log level is set to `debug`. **jaeger-collector** logs information of every batch it receives and every span that is stored in the permanent storage. On the **jaeger-collector** side, these are the expected log entries when the flag `--log-level=debug` is specified: @@ -77,7 +66,7 @@ On the **jaeger-collector** side, these are the expected log entries when the fl ## Check the /metrics endpoint -For the cases where it's not possible or desirable to increase the logging on the **jaeger-collector** side, the `/metrics` endpoint can be used to check if spans for specific services are being received. The `/metrics` endpoint is served from the admin port, which is different for each binary (see [Deployment](../deployment/)). Assuming that **jaeger-collector** is available under a host named `jaeger-collector`, here's a sample `curl` call to obtain the metrics: +For the cases where it's not possible or desirable to increase the logging in the Jaeger backend, the `/metrics` endpoint can be used to check if spans for specific services are being received. The `/metrics` endpoint is served from the admin port, which is different for each binary (see [Deployment](../deployment/)). Assuming that **jaeger-collector** is available under a host named `jaeger-collector`, here's a sample `curl` call to obtain the metrics: curl http://jaeger-collector:14269/metrics diff --git a/scripts/cspell/project-words.txt b/scripts/cspell/project-words.txt index 328c8e33..7f232b4a 100644 --- a/scripts/cspell/project-words.txt +++ b/scripts/cspell/project-words.txt @@ -74,6 +74,7 @@ opentracing openzipkin operatorhub otel +otelcol otlp outreachy parentbased