Added specification for OpenTelemetry metric shipping #742

JonasKunz · 2023-01-04T12:30:15Z

apmmachine · 2023-01-04T12:33:38Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-03-31T05:49:00.086+0000
Duration: 5 min 41 sec

JonasKunz · 2023-01-04T12:37:27Z

specs/agents/metrics-otel.md

+OpenTelemetry metric names are allowed to contain dots. This can lead to mapping problems in elasticsearch if for example metrics with the names `foo.bar` and `foo` are both ingested.
+Due to the nested object representation within metricsets, `foo` would need to be mapped as both an object and a number at the same time, which is not possible.
+
+To avoid such conflicts, agents MUST replace all dots in metric names with `_` upon export iff the `dedot_custom_metrics` configuration option is set to `true`:


The current spec text here is supposed to act more as a starting point for a discussion.

The java agent already has the dedot_custom_metrics configuration, which is on by default. This currently only affects Micrometer metrics, as this is the only way of collecting custom metrics atm.

On the otherside, the APM server does not perform a dedoting of metric names when receiving metrics via OTLP.

This means with the current proposal, the transition from sending metrics via the OTLP exporter vs shipping them via the agent would not be seamless: because dedot_custom_metrics is on by default, the metric names would change.

There are some alternatives to consider:

Add a dedot_custom_metrics option to the APM-server and make it on by default (breaking change)

Change the default dedot_custom_metrics in the java agent to be disabled by default (breaking change)

Separate the dedot_custom_metrics in the java agent into two different config options, one for Otel and one for micrometer. Terrible idea imo, because I don't see a use case where these two would be configured differently.

Please use this discussion to vote for what you think is best or to propose alternative solutions.

Related: https://github.com/elastic/apm-dev/issues/856 (internal)

After some more thinking and offline discussions, what about the following "optimistic" solution:
In long/mid term, we want to get rid of the dedoting in general by mapping metricsets in elasticsearch using subobjects: false. Until we get there, we should aim to minimize breaking-changes.

So for all agents which do not have the dedot_custom_metrics config yet (=everyone but java):

Agents MAY add the dedot_custom_metrics configuration option, but don't need to (maybe added later ondemand if issues with users arise)

If dedot_custom_metrics is added, the default MUST be false -> no breaking change required later on when we support dots in names fully

For all agents which already have the dedot_custom_metrics configuration (= the java agent):

Keep the configuration dedot_custom_metrics with its default as is and respect it for Otel metrics

As soon as we support dots in metric names fully, switch the default to false (breaking but mitigable change)

Opinions?

SGTM but proposing a slight variation that minimizes breaking changes in the future: Ignore the dedot_custom_metrics flag for OTel metrics in the Java agent and not implement it in other agents. As this can lead to data loss, we probably can't GA the feature (debatable as we did GA OTLP metrics intake). Work on support for subobjects: false: We could consider flattening the metric documents in APM Server, try getting elastic/elasticsearch#88934 prioritized or contribute this feature to ES (👨‍🚀⏳?).

Then we only need to introduce the breaking change of changing the default for dedot_custom_metrics for Micrometer. I suppose that we don't really need the dedot_custom_metrics flag at that time anymore so we can deprecate it when we change the default to false and eventually remove it.

we probably can't GA the feature (debatable as we did GA OTLP metrics intake)

Due to the fact that we create separate metricsets for the Otel data instead of merging them with e.g. the agent collected jvm metricsets I think the mapping conflicts would not be too bad, at least not worse than for the OTLP intake. E.g. we would only loose Otel metrics data, all other metrics would continue to work normally. So I'm not sure whether this really is a GA blocker.

specs/agents/metrics-otel.md

JonasKunz · 2023-01-04T12:43:00Z

specs/agents/metrics-otel.md

+|  `foo_bar`    | `baz`           | `string`       |
+|  `testnum_`   | `42.42`         | `number`       |
+
+The OpenTelemetry specification allows the definition of metrics with the same name as long as they reside within a different instrumentation scope ([spec link](https://opentelemetry.io/docs/reference/specification/metrics/api/#get-a-meter)). In order to avoid conflicts, agents MUST add a mandatory label to the metricset with the key `otel_instrumentation_scope_name` and the corresponding instrumentation scope name as label value.


This is another difference to how the APM server currently does things when receiving metric data via OTLP.
The APM-Server will create a different metricset per instrumentation scope, but won't add any label to distinguish them (to my knowledge).
This means that when displaying the metrics in Kibana, the metrics with the same name will get mixed up even if they come from different instrumentation scopes.

Would be great if @elastic/apm-server could clarify here if I'm missing something or if my proposed label makes sense and we maybe should add it on the OTLP intake side aswell.

Some notes on this.

https://opentelemetry.io/docs/reference/specification/metrics/api/#get-a-meter and https://opentelemetry.io/docs/reference/specification/glossary/#instrumentation-scope are not very clear, but one could read from those that the instrumentation scope version (and maybe schema_url) fields may also be relevant for uniquely identifying a scope.

FWIW, looking at the Prometheus exporter from the OTel JS SDK, the instrumentation scope is not included anywhere in the output. I created a JS script that has two my_counter metrics with diff instrumentation scopes:

... const meter = otel.metrics.getMeter('my-meter') // 'my-meter' is the instrumentation scope name const counter = meter.createCounter('my_counter', { description: 'My Counter' }) const meter2 = otel.metrics.getMeter('my-meter2') const counter2 = meter2.createCounter('my_counter', { description: 'My Counter 2' }) setInterval(() => { counter.add(1) counter2.add(1) }, 1000)

The resulting exported Prom metrics:

% curl localhost:3002/metrics # HELP target_info Target metadata # TYPE target_info gauge target_info{service_name="use-otel-prom",telemetry_sdk_language="nodejs",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="1.9.1"} 1 # HELP my_counter_total My Counter # TYPE my_counter_total counter my_counter_total 317 1679701732718 # HELP my_counter_total My Counter 2 # TYPE my_counter_total counter my_counter_total 317 1679701732718

So, I wonder if this is the conflict we could later come back if/when there is a real use case.

If we were to add an intake field or label for this, https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/mapping-to-non-otlp.md#instrumentationscope describes otel.scope.name and otel.scope.version. I'm not sure if that applies here.

https://opentelemetry.io/docs/reference/specification/metrics/api/#get-a-meter and https://opentelemetry.io/docs/reference/specification/glossary/#instrumentation-scope are not very clear, but one could read from those that the instrumentation scope version (and maybe schema_url) fields may also be relevant for uniquely identifying a scope.

It is true that the instrumentation scope name alone might not be good enough to prevent clashes in all cases. It is good enough to prevent cross-library clashes (e.g. library A and B happen to export a metric with the same name) but won't handle the case where you have multiple versions of the same library with breaking changes in their metrics.

My thought here was to start simple and to leave the option open to maybe just use a single label (name and version concatenated) instead of adding both immediately.

FWIW, looking at the Prometheus exporter from the OTel JS SDK, the instrumentation scope is not included anywhere in the output. I created a JS script that has two my_counter metrics with diff instrumentation scopes
So, I wonder if this is the conflict we could later come back if/when there is a real use case.

Interesting observation that this is currently not even handled by the Otel prometheus exporter.
We can definitely just ignore the InstrumentationScope for now and add a label later on, as it would be a non-breaking change.

We should however imo keep in the spec that InstrumentationScopes have to use a different metricsets, even if they have the same labels. Otherwise we would need to decide on a "winner" metric if we encounter a clash.

In the meeting today we discussed using otel.scope.name and otel.scope.version and perhaps adding those at top-level fields (@gregkalapos volunteered).

@Mpdreamz mentioned (if I understood correctly) that metricbeat uses ECS observer.name (https://www.elastic.co/guide/en/ecs/current/ecs-observer.html#field-observer-name) for similar purpose?

I've left the concrete addition of the fields for otel.scope.name and otel.scope.version for the future, so that this doesn't block merging this spec:

Agents MUST report metrics from different instrumentation scopes in separate metricsets to avoid naming conflicts at collection time. This separation MUST be done based on the instrumentation scope name and version. In the future, we might add dedicated intake fields to metricsets for differentiation them based on the instrumentation scope identifiers.

felixbarny · 2023-01-09T11:11:54Z

specs/agents/metrics-otel.md

+## Labels
+
+Conceptually, elastic metric labels keys correspond to [OpenTelemetry Attributes](https://opentelemetry.io/docs/reference/specification/common/#attribute).
+When exporting, the agents MUST convert the attributes to labels as described in this section and MUST group metrics with the same attributes and OpenTelemetry instrumentation scope together into a single metricset.


For the OTel tracing bridge, we introduced an otel.attributes property in the intake v2 spec: https://github.com/elastic/apm/blob/main/specs/agents/tracing-api-otel.md#attributes-mapping.

APM Server then maps known attributes to ECS and unknown attributes to labels. I wonder if we should do the same for metrics. Does APM Server currently map OTel metric attributes to ECS fields?

Maybe this isn't as relevant to metrics as it is to traces if the metric attributes are mostly used for custom user-defined attributes.

The server maps the metadata and some known and defined metrics.

gregkalapos

I did a first parse, great progress on this 👍 .

Few, minor comments below.

specs/agents/metrics-otel.md

Co-authored-by: Gergely Kalapos <[email protected]>

specs/agents/metrics-otel.md

SylvainJuge · 2023-01-25T16:48:01Z

specs/agents/metrics-otel.md

+For **1.** we want to make sure that the existing user configuration is respected by our apm agents. Ideally, agents SHOULD just register an additional exporter to the existing OpenTelemetry Metrics SDK instance(s). If the agent and language capabilities allow it, the exporter registration SHOULD be done automatically without requiring code changes by the user. 
+
+For **2.** agents MAY automatically register an agent-provided SDK instance to bind the user provided OpenTelemetry API to, if this is possible in their language and does not cause too much overhead of any kind (e.g. implementation or agent package size).
+Agents MUST NOT override a user-provided global OpenTelemetry metrics SDK with their own SDK.


[minor] Maybe worth mentioning that this is different from the OTel bridge where the agent overrides the values returned by GlobalOpenTelemetry with its own implementation. Here with metrics SDK the agent just adds an extra exporter at the last step of the metrics pipeline (potentially replacing another sending metrics to another backend).

Maybe worth mentioning that this is different from the OTel bridge where the agent overrides the values returned by GlobalOpenTelemetry with its own implementation.

Isn't GlobalOpenTelemetry something java-specific? It is just the java approach of binding the opentelemetry API to an implementation (e.g. an SDK or our bridge).

Here with metrics SDK the agent just adds an extra exporter at the last step of the metrics pipeline (potentially replacing another sending metrics to another backend)

I don't quiet understand what you mean here? When we add our exporter, we don't replace existing exporters on the user-provided metrics SDK instance. Instead, both exporters live happily side by side.

specs/agents/metrics-otel.md

With this change the APM agent adds support for using OTel Metrics in two ways: 1. The `@opentelemetry/api` package is instrumented so that usage of `.metrics.getMeterProvider(...)` will use one provided by the APM agent, if the user hasn't explicitly provided one of their own. This implementation uses a version of the OTel Metrics SDK included with the APM agent. It is configured to sent metrics to APM server by default. This allows a user to use the OTel Metrics API and get metrics set to APM server without any configuration. 2. Alternatively, the `@opentelemetry/sdk-metrics` package is instrumented so that when a user creates a `MeterProvider`, the APM agent will added a `MetricReader` that sends metrics to APM server. This allows a user to configure metrics as they wish using the OTel Metrics SDK and then automatically get those metrics sent to APM server. This also adds some grouping in ".ci/tav.json" for the TAV workflow to avoid the 256 jobs GH Actions limit. I'm not yet sure if those will work well. Closes: #2954 Refs: elastic/apm#691 Refs: elastic/apm#742 (spec PR)

With this change the APM agent adds support for using OTel Metrics in two ways: 1. The `@opentelemetry/api` package is instrumented so that usage of `.metrics.getMeterProvider(...)` will use one provided by the APM agent, if the user hasn't explicitly provided one of their own. This implementation uses a version of the OTel Metrics SDK included with the APM agent. It is configured to sent metrics to APM server by default. This allows a user to use the OTel Metrics API and get metrics set to APM server without any configuration. 2. Alternatively, the `@opentelemetry/sdk-metrics` package is instrumented so that when a user creates a `MeterProvider`, the APM agent will added a `MetricReader` that sends metrics to APM server. This allows a user to configure metrics as they wish using the OTel Metrics SDK and then automatically get those metrics sent to APM server. This also adds some grouping in ".ci/tav.json" for the TAV workflow to avoid the 256 jobs GH Actions limit. I'm not yet sure if those will work well. Closes: elastic#2954 Refs: elastic/apm#691 Refs: elastic/apm#742 (spec PR)

Added initial proposal for otel metrics spec

0375e0f

JonasKunz commented Jan 4, 2023

View reviewed changes

specs/agents/metrics-otel.md Show resolved Hide resolved

JonasKunz commented Jan 4, 2023

View reviewed changes

JonasKunz changed the title ~~Added initial proposal for otel metrics spec~~ Added specification for OpenTelemetry metric shipping Jan 4, 2023

JonasKunz added 2 commits January 4, 2023 14:28

Clarified temporality

3c47d6f

Refined temporality, added disable_metrics handling

65cad14

felixbarny reviewed Jan 9, 2023

View reviewed changes

gregkalapos reviewed Jan 16, 2023

View reviewed changes

JonasKunz and others added 2 commits January 16, 2023 15:52

Apply suggestions from code review

5c88976

Co-authored-by: Gergely Kalapos <[email protected]>

Added links to main metric doc

e3092c0

JonasKunz mentioned this pull request Jan 16, 2023

Added OpenTelemetry Metrics SDK instrumentation. elastic/apm-agent-java#2968

Merged

5 tasks

dmathieu reviewed Jan 16, 2023

View reviewed changes

specs/agents/metrics-otel.md Outdated Show resolved Hide resolved

jackshirazi approved these changes Jan 18, 2023

View reviewed changes

SylvainJuge reviewed Jan 25, 2023

View reviewed changes

Updated dedoting approach

8128f64

trentm mentioned this pull request Feb 6, 2023

feat: Reporter for OTel Metrics elastic/apm-agent-nodejs#3152

Merged

trentm reviewed Feb 8, 2023

View reviewed changes

specs/agents/metrics-otel.md Show resolved Hide resolved

trentm reviewed Mar 10, 2023

View reviewed changes

specs/agents/metrics-otel.md Outdated Show resolved Hide resolved

JonasKunz mentioned this pull request Mar 21, 2023

Fixed custom_metrics_histogram_boundaries config to not be dynamic. elastic/apm-agent-java#3062

Merged

2 tasks

JonasKunz marked this pull request as ready for review March 24, 2023 09:02

JonasKunz requested review from a team as code owners March 24, 2023 09:02

dmathieu mentioned this pull request Mar 31, 2023

OpenTelemetry metrics exporter elastic/apm-agent-go#1407

Merged

Refinements based on latest discussions

32cc6c9

trentm approved these changes Apr 4, 2023

View reviewed changes

specs/agents/metrics-otel.md Outdated Show resolved Hide resolved

specs/agents/metrics-otel.md Outdated Show resolved Hide resolved

specs/agents/metrics-otel.md Outdated Show resolved Hide resolved

JonasKunz added 2 commits April 5, 2023 11:31

Implement review suggestions

03886bf

Some more review suggestions

01c86fc

JonasKunz merged commit 29fd2d3 into elastic:main Apr 27, 2023

JonasKunz deleted the otel-metrics branch April 27, 2023 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added specification for OpenTelemetry metric shipping #742

Added specification for OpenTelemetry metric shipping #742

JonasKunz commented Jan 4, 2023 •

edited by trentm

Loading

apmmachine commented Jan 4, 2023 •

edited by jenkins-apm-ci bot

Loading

Build stats

JonasKunz Jan 4, 2023

felixbarny Jan 13, 2023

JonasKunz Jan 24, 2023

felixbarny Jan 24, 2023

JonasKunz Jan 30, 2023

JonasKunz Jan 4, 2023

trentm Mar 25, 2023

JonasKunz Mar 27, 2023

trentm Mar 27, 2023

JonasKunz Mar 31, 2023

felixbarny Jan 9, 2023

simitt Jan 9, 2023

gregkalapos left a comment

SylvainJuge Jan 25, 2023

JonasKunz Jan 26, 2023

Added specification for OpenTelemetry metric shipping #742

Added specification for OpenTelemetry metric shipping #742

Conversation

JonasKunz commented Jan 4, 2023 • edited by trentm Loading

apmmachine commented Jan 4, 2023 • edited by jenkins-apm-ci bot Loading

💚 Build Succeeded

Build stats

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregkalapos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JonasKunz commented Jan 4, 2023 •

edited by trentm

Loading

apmmachine commented Jan 4, 2023 •

edited by jenkins-apm-ci bot

Loading