Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate replacing libbeat monitoring usage with OpenTelemetry instrumentation #14488

Open
axw opened this issue Oct 31, 2024 · 10 comments
Open

Comments

@axw
Copy link
Member

axw commented Oct 31, 2024

We have had issues related to libbeat monitoring plaguing us for years, notably #8383.

We should consider removing our dependency on libbeat for monitoring apm-server, and instrumenting APM Server with the OpenTelemetry API instead. We would need to arrange for the metrics to be written to the monitoring Elasticsearch cluster with the same schema as produced by libbeat; we would also need to expose these metrics via the stats API, and reported in logs.

Hat tip to @marclop for the idea.

@axw
Copy link
Member Author

axw commented Dec 2, 2024

For context below, APM Server monitoring builds on libbeat monitoring, and relies on the .monitoring-beats index template at https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/core/template-resources/src/main/resources/monitoring-beats.json. There's tight coupling between this index template and any metrics we add to apm-server. Moreover, this existing template does not have any way of expressing metric dimensions, nor does it support complex metrics such as histograms.

There's a few approaches we could take to migrating to the OTel API, which have varying trade-offs.

  1. Use the OTel API only in apm-server code bases, and adapt that to the existing libbeat monitoring framework. libbeat would continue to produce metrics directly to the libbeat monitoring framework
  2. Use the OTel API for all metrics, export metrics to Elasticsearch in the same format as libbeat does with a new exporter
  3. Use the OTel API for all metrics, export metrics to Elasticsearch in the new OTel-native format and update the stack monitoring & stack telemetry code in Kibana to match

(1) is probably the least amount of work. It will not remove any of the limitations that come with the existing libbeat monitoring template. Moreover, since the OTel API enables users to provide dimensions (attributes) and complex metrics, developers may be surprised when these are dropped or don't behave as they should.
(2) is an incremental improvement over (1) in that it would remove any reliance on the libbeat/monitoring API, but comes with extra work in the form of implementing a new exporter.
(3) is I think the ideal approach technically, but comes with the most amount of work: a new exporter and changes to Kibana.

@simitt
Copy link
Contributor

simitt commented Dec 2, 2024

@marclop @axw could you describe the risk of not implementing this change in more detail. In a next step, we then need to gather @mlunadia 's input for prioritization of it.

@axw
Copy link
Member Author

axw commented Dec 4, 2024

I would say the main risks of not implementing this are:

The second issue is more pressing IMO. I'll see if I can do a quick spike on (1) from #14488 (comment), as I think we could get something working reasonably quickly to address that issue.

@axw
Copy link
Member Author

axw commented Dec 5, 2024

It's fairly hacky, but I got something working here: https://github.com/axw/apm-server/tree/otel-adapter

I've updated all the apm-server code that was registering metrics with libbeat monitoring to use opentelemetry-go metrics instead. Then I've configured a ManualReader, and registered libbeat monitoring functions to use that and adapt otel metrics to libbeat metrics.

The adaptation works by translating go-docappender metrics (a bit brittle, would need comprehensive unit tests) to libbeat equivalents. It also directly copies across any metrics that fit the following constraints:

  • integer gauge/counter type
  • name starts with apm-server.
  • within a scope that starts with github.com/elastic/apm-server
  • has no attributes (dimensions)

This should address the symptoms of #8383 since the metrics will be persistent across reloads, whereas previously we would recreate all the metrics on each reload.

@marclop I'd be keen to hear what you think about this approach. It's not the most robust thing we could do, but I think it'll be an incremental improvement and I expect we should be able to complete it in a week or so.

@mlunadia
Copy link

mlunadia commented Dec 5, 2024

To add to the risks section: a libbeat dependency was also the cause for us not being able to release EDOT on Windows which indicates to me that there are some headaches we can avoid by working on this.

Have we estimated the time required for a production-ready implementation?

@marclop
Copy link
Contributor

marclop commented Dec 6, 2024

@axw just scanned the linked branch and the approach looks good to me. I had forgotten how rudimentary libbeat metric registries are and how they don't have dimensions.

I think that estimate is accurate and we should be able to have unit tests and test it with a decent amount of confidence to make the change.

@simitt
Copy link
Contributor

simitt commented Dec 6, 2024

@mlunadia my interpretation is that getting the POC that @axw did to production ready should take roughly 1 week (maybe add a bit of buffer). I am strongly +1 on investing the time to apply this and then see which further steps we might want to take in the future.

@kruskall
Copy link
Member

kruskall commented Dec 9, 2024

Use the OTel API only in apm-server code bases, and adapt that to the existing libbeat monitoring framework. libbeat would continue to produce metrics directly to the libbeat monitoring framework

This feels like hiding the libbeat dependency instead of replacing it with otel (which was the goal of this issue) :(

(3) is I think the ideal approach technically, but comes with the most amount of work: a new exporter and changes to Kibana.

fully agree 👍


Mostly reposting here what we discussed last week but would it make sense to produce the otel metrics both to the libbeat monitoring and directly ?
That way we can gradually update stack monitoring, etc. and remove the libbeat monitoring once that's done

@axw
Copy link
Member Author

axw commented Dec 10, 2024

Use the OTel API only in apm-server code bases, and adapt that to the existing libbeat monitoring framework. libbeat would continue to produce metrics directly to the libbeat monitoring framework

This feels like hiding the libbeat dependency instead of replacing it with otel (which was the goal of this issue) :(

The ultimate goal is indeed to remove it, but I don't think we can commit to that amount of work for 9.0. To remove libbeat monitoring we will need to migrate to the OTel API. Once we're ready to fully switch off it, we'll just remove (or perhaps change) the adapter code.

It is hiding (aka encapsulating) the use of libbeat monitoring. That is a legitimate design pattern that will simplify its removal later.

Mostly reposting here what we discussed last week but would it make sense to produce the otel metrics both to the libbeat monitoring and directly ?
That way we can gradually update stack monitoring, etc. and remove the libbeat monitoring once that's done

@kruskall what would "and directly" mean here? We already capture some metrics with the OTel API, but they go through apmotel and do not end up in the stack monitoring indices or in a compatible format. For them to end up in the right indices, and in an appropriate format, would constitute implementing option (2) or option (3).

@kruskall
Copy link
Member

kruskall commented Jan 1, 2025

Opened #15094 to add @axw branch (it also has a commit to send the metrics directly as a demo, reverted since it's just for demonstration)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants