Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added specification for OpenTelemetry metric shipping #742

Merged
merged 9 commits into from
Apr 27, 2023

Conversation

JonasKunz
Copy link
Contributor

@JonasKunz JonasKunz commented Jan 4, 2023

  • May the instrumentation collect sensitive information, such as secrets or PII (ex. in headers)?
    • Yes
      • Add a section to the spec how agents should apply sanitization (such as sanitize_field_names)
    • No
      • Why? -> The agent only collects what the user specifies, therefore I would not see sanitization as a requirement from our side. In addition PII in labels is a terrible idea anyway due to cardinality.
    • n/a
  • Create PR as draft
  • Approval by at least one other agent
  • Mark as Ready for Review (automatically requests reviews from all agents and PM via CODEOWNERS)
    • Remove PM from reviewers if impact on product is negligible
    • Remove agents from reviewers if the change is not relevant for them
  • Approved by at least 2 agents + PM (if relevant)
  • Merge after 7 days passed without objections
    To auto-merge the PR, add /schedule YYYY-MM-DD to the PR description.
  • Create implementation issues through the meta issue template (this will automate issue creation for individual agents)
  • If this spec adds a new dynamic config option, add it to central config.

@apmmachine
Copy link

apmmachine commented Jan 4, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-03-31T05:49:00.086+0000

  • Duration: 5 min 41 sec

OpenTelemetry metric names are allowed to contain dots. This can lead to mapping problems in elasticsearch if for example metrics with the names `foo.bar` and `foo` are both ingested.
Due to the nested object representation within metricsets, `foo` would need to be mapped as both an object and a number at the same time, which is not possible.

To avoid such conflicts, agents MUST replace all dots in metric names with `_` upon export iff the `dedot_custom_metrics` configuration option is set to `true`:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current spec text here is supposed to act more as a starting point for a discussion.

The java agent already has the dedot_custom_metrics configuration, which is on by default. This currently only affects Micrometer metrics, as this is the only way of collecting custom metrics atm.

On the otherside, the APM server does not perform a dedoting of metric names when receiving metrics via OTLP.

This means with the current proposal, the transition from sending metrics via the OTLP exporter vs shipping them via the agent would not be seamless: because dedot_custom_metrics is on by default, the metric names would change.

There are some alternatives to consider:

  • Add a dedot_custom_metrics option to the APM-server and make it on by default (breaking change)
  • Change the default dedot_custom_metrics in the java agent to be disabled by default (breaking change)
  • Separate the dedot_custom_metrics in the java agent into two different config options, one for Otel and one for micrometer. Terrible idea imo, because I don't see a use case where these two would be configured differently.

Please use this discussion to vote for what you think is best or to propose alternative solutions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more thinking and offline discussions, what about the following "optimistic" solution:
In long/mid term, we want to get rid of the dedoting in general by mapping metricsets in elasticsearch using subobjects: false. Until we get there, we should aim to minimize breaking-changes.

So for all agents which do not have the dedot_custom_metrics config yet (=everyone but java):

  • Agents MAY add the dedot_custom_metrics configuration option, but don't need to (maybe added later ondemand if issues with users arise)
  • If dedot_custom_metrics is added, the default MUST be false -> no breaking change required later on when we support dots in names fully

For all agents which already have the dedot_custom_metrics configuration (= the java agent):

  • Keep the configuration dedot_custom_metrics with its default as is and respect it for Otel metrics
  • As soon as we support dots in metric names fully, switch the default to false (breaking but mitigable change)

Opinions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM but proposing a slight variation that minimizes breaking changes in the future: Ignore the dedot_custom_metrics flag for OTel metrics in the Java agent and not implement it in other agents. As this can lead to data loss, we probably can't GA the feature (debatable as we did GA OTLP metrics intake). Work on support for subobjects: false: We could consider flattening the metric documents in APM Server, try getting elastic/elasticsearch#88934 prioritized or contribute this feature to ES (👨‍🚀⏳?).

Then we only need to introduce the breaking change of changing the default for dedot_custom_metrics for Micrometer. I suppose that we don't really need the dedot_custom_metrics flag at that time anymore so we can deprecate it when we change the default to false and eventually remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably can't GA the feature (debatable as we did GA OTLP metrics intake)

Due to the fact that we create separate metricsets for the Otel data instead of merging them with e.g. the agent collected jvm metricsets I think the mapping conflicts would not be too bad, at least not worse than for the OTLP intake. E.g. we would only loose Otel metrics data, all other metrics would continue to work normally. So I'm not sure whether this really is a GA blocker.

| `foo_bar` | `baz` | `string` |
| `testnum_` | `42.42` | `number` |

The OpenTelemetry specification allows the definition of metrics with the same name as long as they reside within a different instrumentation scope ([spec link](https://opentelemetry.io/docs/reference/specification/metrics/api/#get-a-meter)). In order to avoid conflicts, agents MUST add a mandatory label to the metricset with the key `otel_instrumentation_scope_name` and the corresponding instrumentation scope name as label value.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another difference to how the APM server currently does things when receiving metric data via OTLP.
The APM-Server will create a different metricset per instrumentation scope, but won't add any label to distinguish them (to my knowledge).
This means that when displaying the metrics in Kibana, the metrics with the same name will get mixed up even if they come from different instrumentation scopes.

Would be great if @elastic/apm-server could clarify here if I'm missing something or if my proposed label makes sense and we maybe should add it on the OTLP intake side aswell.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes on this.

  1. https://opentelemetry.io/docs/reference/specification/metrics/api/#get-a-meter and https://opentelemetry.io/docs/reference/specification/glossary/#instrumentation-scope are not very clear, but one could read from those that the instrumentation scope version (and maybe schema_url) fields may also be relevant for uniquely identifying a scope.
  2. FWIW, looking at the Prometheus exporter from the OTel JS SDK, the instrumentation scope is not included anywhere in the output. I created a JS script that has two my_counter metrics with diff instrumentation scopes:
...
const meter = otel.metrics.getMeter('my-meter') // 'my-meter' is the instrumentation scope name
const counter = meter.createCounter('my_counter', { description: 'My Counter' })

const meter2 = otel.metrics.getMeter('my-meter2')
const counter2 = meter2.createCounter('my_counter', { description: 'My Counter 2' })

setInterval(() => {
  counter.add(1)
  counter2.add(1)
}, 1000)

The resulting exported Prom metrics:

% curl localhost:3002/metrics
# HELP target_info Target metadata
# TYPE target_info gauge
target_info{service_name="use-otel-prom",telemetry_sdk_language="nodejs",telemetry_sdk_name="opentelemetry",telemetry_sdk_version="1.9.1"} 1
# HELP my_counter_total My Counter
# TYPE my_counter_total counter
my_counter_total 317 1679701732718
# HELP my_counter_total My Counter 2
# TYPE my_counter_total counter
my_counter_total 317 1679701732718

So, I wonder if this is the conflict we could later come back if/when there is a real use case.

  1. If we were to add an intake field or label for this, https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/mapping-to-non-otlp.md#instrumentationscope describes otel.scope.name and otel.scope.version. I'm not sure if that applies here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://opentelemetry.io/docs/reference/specification/metrics/api/#get-a-meter and https://opentelemetry.io/docs/reference/specification/glossary/#instrumentation-scope are not very clear, but one could read from those that the instrumentation scope version (and maybe schema_url) fields may also be relevant for uniquely identifying a scope.

It is true that the instrumentation scope name alone might not be good enough to prevent clashes in all cases. It is good enough to prevent cross-library clashes (e.g. library A and B happen to export a metric with the same name) but won't handle the case where you have multiple versions of the same library with breaking changes in their metrics.

My thought here was to start simple and to leave the option open to maybe just use a single label (name and version concatenated) instead of adding both immediately.

FWIW, looking at the Prometheus exporter from the OTel JS SDK, the instrumentation scope is not included anywhere in the output. I created a JS script that has two my_counter metrics with diff instrumentation scopes
So, I wonder if this is the conflict we could later come back if/when there is a real use case.

Interesting observation that this is currently not even handled by the Otel prometheus exporter.
We can definitely just ignore the InstrumentationScope for now and add a label later on, as it would be a non-breaking change.

We should however imo keep in the spec that InstrumentationScopes have to use a different metricsets, even if they have the same labels. Otherwise we would need to decide on a "winner" metric if we encounter a clash.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meeting today we discussed using otel.scope.name and otel.scope.version and perhaps adding those at top-level fields (@gregkalapos volunteered).

@Mpdreamz mentioned (if I understood correctly) that metricbeat uses ECS observer.name (https://www.elastic.co/guide/en/ecs/current/ecs-observer.html#field-observer-name) for similar purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left the concrete addition of the fields for otel.scope.name and otel.scope.version for the future, so that this doesn't block merging this spec:

Agents MUST report metrics from different instrumentation scopes in separate metricsets to avoid naming conflicts at collection time. This separation MUST be done based on the instrumentation scope name and version. In the future, we might add dedicated intake fields to metricsets for differentiation them based on the instrumentation scope identifiers.

@JonasKunz JonasKunz changed the title Added initial proposal for otel metrics spec Added specification for OpenTelemetry metric shipping Jan 4, 2023
## Labels

Conceptually, elastic metric labels keys correspond to [OpenTelemetry Attributes](https://opentelemetry.io/docs/reference/specification/common/#attribute).
When exporting, the agents MUST convert the attributes to labels as described in this section and MUST group metrics with the same attributes and OpenTelemetry instrumentation scope together into a single metricset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the OTel tracing bridge, we introduced an otel.attributes property in the intake v2 spec: https://github.com/elastic/apm/blob/main/specs/agents/tracing-api-otel.md#attributes-mapping.

APM Server then maps known attributes to ECS and unknown attributes to labels. I wonder if we should do the same for metrics. Does APM Server currently map OTel metric attributes to ECS fields?

Maybe this isn't as relevant to metrics as it is to traces if the metric attributes are mostly used for custom user-defined attributes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server maps the metadata and some known and defined metrics.

Copy link
Contributor

@gregkalapos gregkalapos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first parse, great progress on this 👍 .

Few, minor comments below.

specs/agents/metrics-otel.md Outdated Show resolved Hide resolved
specs/agents/metrics-otel.md Outdated Show resolved Hide resolved
specs/agents/metrics-otel.md Show resolved Hide resolved
specs/agents/metrics-otel.md Outdated Show resolved Hide resolved
specs/agents/metrics-otel.md Outdated Show resolved Hide resolved
specs/agents/metrics-otel.md Outdated Show resolved Hide resolved
For **1.** we want to make sure that the existing user configuration is respected by our apm agents. Ideally, agents SHOULD just register an additional exporter to the existing OpenTelemetry Metrics SDK instance(s). If the agent and language capabilities allow it, the exporter registration SHOULD be done automatically without requiring code changes by the user.

For **2.** agents MAY automatically register an agent-provided SDK instance to bind the user provided OpenTelemetry API to, if this is possible in their language and does not cause too much overhead of any kind (e.g. implementation or agent package size).
Agents MUST NOT override a user-provided global OpenTelemetry metrics SDK with their own SDK.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] Maybe worth mentioning that this is different from the OTel bridge where the agent overrides the values returned by GlobalOpenTelemetry with its own implementation. Here with metrics SDK the agent just adds an extra exporter at the last step of the metrics pipeline (potentially replacing another sending metrics to another backend).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning that this is different from the OTel bridge where the agent overrides the values returned by GlobalOpenTelemetry with its own implementation.

Isn't GlobalOpenTelemetry something java-specific? It is just the java approach of binding the opentelemetry API to an implementation (e.g. an SDK or our bridge).

Here with metrics SDK the agent just adds an extra exporter at the last step of the metrics pipeline (potentially replacing another sending metrics to another backend)

I don't quiet understand what you mean here? When we add our exporter, we don't replace existing exporters on the user-provided metrics SDK instance. Instead, both exporters live happily side by side.

specs/agents/metrics-otel.md Outdated Show resolved Hide resolved
specs/agents/metrics-otel.md Outdated Show resolved Hide resolved
specs/agents/metrics-otel.md Outdated Show resolved Hide resolved
@JonasKunz JonasKunz merged commit 29fd2d3 into elastic:main Apr 27, 2023
@JonasKunz JonasKunz deleted the otel-metrics branch April 27, 2023 07:10
trentm added a commit to elastic/apm-agent-nodejs that referenced this pull request Apr 28, 2023
With this change the APM agent adds support for using OTel Metrics in two ways:

1. The `@opentelemetry/api` package is instrumented so that usage of
   `.metrics.getMeterProvider(...)` will use one provided by the APM agent,
   if the user hasn't explicitly provided one of their own. This implementation
   uses a version of the OTel Metrics SDK included with the APM agent. It
   is configured to sent metrics to APM server by default. This allows a user
   to use the OTel Metrics API and get metrics set to APM server without any
   configuration.
2. Alternatively, the `@opentelemetry/sdk-metrics` package is instrumented so
   that when a user creates a `MeterProvider`, the APM agent will added a
   `MetricReader` that sends metrics to APM server.  This allows a user to
   configure metrics as they wish using the OTel Metrics SDK and then
   automatically get those metrics sent to APM server.

This also adds some grouping in ".ci/tav.json" for the TAV workflow to avoid the
256 jobs GH Actions limit. I'm not yet sure if those will work well.

Closes: #2954
Refs: elastic/apm#691
Refs: elastic/apm#742 (spec PR)
trentm added a commit to elastic/apm-agent-nodejs that referenced this pull request May 16, 2023
With this change the APM agent adds support for using OTel Metrics in two ways:

1. The `@opentelemetry/api` package is instrumented so that usage of
   `.metrics.getMeterProvider(...)` will use one provided by the APM agent,
   if the user hasn't explicitly provided one of their own. This implementation
   uses a version of the OTel Metrics SDK included with the APM agent. It
   is configured to sent metrics to APM server by default. This allows a user
   to use the OTel Metrics API and get metrics set to APM server without any
   configuration.
2. Alternatively, the `@opentelemetry/sdk-metrics` package is instrumented so
   that when a user creates a `MeterProvider`, the APM agent will added a
   `MetricReader` that sends metrics to APM server.  This allows a user to
   configure metrics as they wish using the OTel Metrics SDK and then
   automatically get those metrics sent to APM server.

This also adds some grouping in ".ci/tav.json" for the TAV workflow to avoid the
256 jobs GH Actions limit. I'm not yet sure if those will work well.

Closes: #2954
Refs: elastic/apm#691
Refs: elastic/apm#742 (spec PR)
fpm-peter pushed a commit to fpm-git/apm-agent-nodejs that referenced this pull request Aug 20, 2024
With this change the APM agent adds support for using OTel Metrics in two ways:

1. The `@opentelemetry/api` package is instrumented so that usage of
   `.metrics.getMeterProvider(...)` will use one provided by the APM agent,
   if the user hasn't explicitly provided one of their own. This implementation
   uses a version of the OTel Metrics SDK included with the APM agent. It
   is configured to sent metrics to APM server by default. This allows a user
   to use the OTel Metrics API and get metrics set to APM server without any
   configuration.
2. Alternatively, the `@opentelemetry/sdk-metrics` package is instrumented so
   that when a user creates a `MeterProvider`, the APM agent will added a
   `MetricReader` that sends metrics to APM server.  This allows a user to
   configure metrics as they wish using the OTel Metrics SDK and then
   automatically get those metrics sent to APM server.

This also adds some grouping in ".ci/tav.json" for the TAV workflow to avoid the
256 jobs GH Actions limit. I'm not yet sure if those will work well.

Closes: elastic#2954
Refs: elastic/apm#691
Refs: elastic/apm#742 (spec PR)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants