Report detailed queue metrics #31113

cmacknz · 2022-04-01T20:33:01Z

The memory and disk queue implemented by beats report minimal metrics about their state, which makes it difficult to understand how the queue is changing over time. We should expand the set of queue metrics that are reported as the queue is an excellent place to observe whether the beat system is experiencing backpressure.

For both queue types we should report at least the following additional metrics:

The count of times the queue reached its configured maximum size.
A boolean value indicating whether or not the queue is full.
The current queue fill level in the same units as the queue size limit.
The age of the oldest item in the queue.
The current queue lag. Lag is the difference between the producer position and the consumer position in the queue.

Of those metrics, the queue lag is the most useful for diagnosing backpressure. In systems built with Apache Kafka the equivalent consumer lag measurement is a key performance indicator of the system. See example uses and graphs of consumer lag here https://www.confluent.io/blog/kafka-lag-monitoring-and-metrics-at-appsflyer/ (only slightly relevant, ideally we want to easily generate and obtain the same types of graphs for the beats pipeline).

Queue or consumer lag trending up indicates the output cannot keep up with the produced data volume. Ideally we want the ability to easily graph this value over time, but having the instantaneous value is a starting point. Perhaps we can try to calculate the derivative as a metric to see the trend more easily. This should help address #7743, but for all beats since it depends only on the queue state.

The metrics should be periodically published into the beat log, as is currently done here: https://github.com/elastic/beats/blob/main/libbeat/monitoring/report/log/log.go#L188. See #24700 for an example of how to add new beats metrics.

We can split this issue up for each of the two queue types if needed.

Note: The existing beats have an active counter (https://github.com/elastic/beats/blob/main/libbeat/publisher/pipeline/monitoring.go#L99) and a done/published counter. The difference between these two might give you a similar measurement to the queue lag described above. Where the existing metrics are incremented and decremented likely needs to change when used in the shipper however.

elasticmachine · 2022-04-01T20:33:17Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

fearful-symmetry · 2022-04-01T22:13:30Z

Is there a reason we're linking this back to the shipper? Is the plan to more or less import the memory queue into the shipper?

Also, maybe a dumb question, but is this getting wrapped into whatever we have planned for APM-style tracing in the shipper, or is this considered a distinct metrics/debugging component that just serves the queues?

cmacknz · 2022-04-04T14:25:50Z

Is there a reason we're linking this back to the shipper? Is the plan to more or less import the memory queue into the shipper?

Yes, I expect we'll import the beats queues initially which allows us to improve them for both projects. I don't see a reason to do otherwise yet.

Also, maybe a dumb question, but is this getting wrapped into whatever we have planned for APM-style tracing in the shipper, or is this considered a distinct metrics/debugging component that just serves the queues?

The second one. It is related in that it improves our observability, but you can't get queue/consumer lag via traces for example. You have to instrument the queue itself to obtain this.

I want to start with these metrics to see how much value we can get out of just having the queue lag metric for detecting back pressure and input/output rate mismatches without tracing. It might make sense to prioritize being able to visualize the metrics from the logs over adding tracing everywhere initially.

zez3 · 2022-04-04T20:00:59Z

@cmacknz
Would those new detailed queue metrics be also exposed into http endpoint metrics? Default port 5066
The idea is to use those metrics(Lag and/or fill) on the LB in front via health check so that I can declare that beat(backend) dead

cmacknz · 2022-04-04T20:12:36Z

Yes they should be available under /stats or /debug/vars when the HTTP endpoint is enabled alongside the current set of metrics.

zez3 · 2022-04-04T20:41:37Z

@joshdover Do we need a feature request(ER) for fleet to have the beats http endpoint stats option enable in policy?

fearful-symmetry · 2022-05-12T22:00:12Z

@faec with all the changes merged, should we actually begin work on implementing the interface we got in #31289 ?

Also, when we merged #31289 one thing @faec and I weren't sure about (and thus didn't add) was the concept of queue lag, since the way we both understood it, it seemed equivalent to the queue size, is it was just the difference between the newest versus older event at the queue. @cmacknz can you elaborate a bit more on what that should look like versus the queue size/queue event count?

cmacknz · 2022-05-16T13:02:53Z

Also, when we merged #31289 one thing @faec and I weren't sure about (and thus didn't add) was the concept of queue lag, since the way we both understood it, it seemed equivalent to the queue size, is it was just the difference between the newest versus older event at the queue. @cmacknz can you elaborate a bit more on what that should look like versus the queue size/queue event count?

It depends on how queue size is defined. The queue lag is the difference between a consumer's position in the queue and the position of the newest item in the queue.

In the case where there is only ever a single consumer, and items are immediately deleted from the queue when they are written to the output, then the queue size and the lag are the same thing.

The two cases I can think of where queue lag and queue size would diverge are:

Items are not immediately deleted from the queue when successfully written and deletion is a background task. I believe this is how the disk queue works with segment file deletions instead of event deletions. We would want a separation between the consumer position and the queue size on disk.
There is more than one consumer. We don't have a use case for this yet.

I would think the queue size is probably equal to the lag for the memory queue but not the disk queue.

fearful-symmetry · 2022-05-16T16:49:13Z

@faec ^ does that make sense? I know we removed the ability to have more than one consumer, but I'm not sure how much of a possibility there is for that kind of lag between acking(?) and deleting an event.

faec · 2022-05-16T19:23:35Z

I know we removed the ability to have more than one consumer

Well, we didn't really remove any abilities -- there can still be multiple "consumers" in multiple threads, but now they can all make their polling calls against the same queue object instead of needing to allocate additional helpers. This was possible because the consumers carried no state of their own (although that wasn't obvious from the previous interface).

I think @cmacknz's description of lag makes more sense to me, though it's relying on some rather subtle order-of-operations for a user-visible metric, and it might be hard to interpret in isolation. As one example, besides asynchronous event deletion there is also a gap between the "position" as in the next event that will be read by a consumer and "position" as in the most recent fully-acknowledged event (which lags behind the former, sometimes by a lot), and this would be missed by just looking at events pending deletion.

There's three states a queue event could be in -- never been read, read but awaiting ack, and acked but awaiting cleanup. (Maybe four if we also distinguish events that have been read but not yet sent from events awaiting retry after a failed send.) The event cleanup issue certainly applies to the disk queue, but arguably also to the memory queue, since out-of-order ACKs are possible but will not result in event cleanup until every event prior to it is also acknowledged. In this case the metrics might look like the entire queue is still awaiting acknowledgment when really there is just one stuck event. Tracking out-of-order ACKs would require some work in both queues, and might not be worth it; but explaining why we don't track it when we describe a metric as excluding already-acked events also doesn't sound fun.

I wonder if a different framing would make sense. What questions are we trying to answer with the metric that aren't already answered by occupied vs total size? Occupied vs total can tell us roughly how constrained the throughput is relative to the input volume. But knowing whether the last 5-10% has already been acknowledged and is awaiting cleanup (as with a segment file) doesn't change that much -- those numbers will basically always look the same. The main failure mode it might detect is "nothing has been acknowledged ever" (and even that would require other metrics to confirm since all segment files start out that way).

It seems to me what we're trying to surface is, "if the queue is highly congested, why?" The obvious top-level reasons are:

everything is "fine" but the input's throughput is higher than the output's.
the output's throughput is low because we can't successfully communicate with the server
the output's throughput is low because we are communicating with the server but some events are getting rejected

An example from a current SDH is a Kafka case where consistent failure and retry of one event ends up blocking the entire pipeline because that event is never acknowledged, so the "oldest" entry in the queue can never move beyond that point -- the rest of the queue is acknowledged but idle.

I think what I'd like in a perfect world would be three metrics:

something like occupied_inactive or occupied_idle representing the events that have been acknowledged but, for whatever reason, still occupy the queue. This is similar to "lag" as described above except it would ideally include out-of-order ACKs.
occupied_read, representing the events that have been given to a consumer but not yet ACKed (showing what proportion of the queue we're trying to send).
occupied_retry, representing the events that have been read and failed to send at least once.

These numbers would give clear criteria to distinguish between the three types of congestion. I'm not sure all of them should be implemented by the queue itself (I expect our ACK handling to evolve and that will affect where it makes sense to monitor such things), and maybe we can get away with fewer values... but my hope is that whatever we put here will give as much clarity as possible about the conceptual state of the queue, rather than things like segment round-off that to me feels more like an implementation detail.

fearful-symmetry · 2022-05-16T19:48:00Z

I really like @faec 's idea of the occupied_* metrics, I think that makes a lot of sense as far as reporting the causes of congestion in the queue.

I'm not sure all of them should be implemented by the queue itself (I expect our ACK handling to evolve and that will affect where it makes sense to monitor such things), and maybe we can get away with fewer values

Yah, right now so much of the queue, and shipper, is in such an evolving state with so many missing components that I'm not really sure if it's wise to try to implement stuff like this unless we're 100% sure about the APIs and designs, which it seems like we're not. The interface is all opt values, so we can implement stuff piecemeal as needed.

cmacknz · 2022-05-16T19:53:17Z

Yes excellent summary @faec , thanks! There are a lot of implementation details to take into account with this.

My mental model of lag is based off of Kafka topic lag, where the consumer position is based on the last committed offset of a consumer for a given topic. Committing an offset in Kafka means the consumer has marked the event with that offset as read and does not want to consume it again. It is also decoupled from the deletion time of the event, which may be much later or never.

I think this maps to the point where the event has been successfully acknowledged by the output in the beats system.

It seems to me what we're trying to surface is, "if the queue is highly congested, why?" The obvious top-level reasons are:

everything is "fine" but the input's throughput is higher than the output's.
the output's throughput is low because we can't successfully communicate with the server
the output's throughput is low because we are communicating with the server but some events are getting rejected

Yes, exactly. Particularly for the "everything is "fine" but the input's throughput is higher than the output's" case this is extremely difficult to figure out right now. The other two you would ideally see errors in the logs but you may not be able to tell if they are transient or backing up the system.

My biggest question to @faec from this is what is the right point in time to implement these metrics given all the changes that are going on?

faec · 2022-05-16T21:14:27Z

My biggest question to @faec from this is what is the right point in time to implement these metrics given all the changes that are going on?

It would make sense to think about it once there are real ACKs flowing in the shipper (which I hope will be this week). I expect occupied_read will be pretty straightforward in both queues and could be started anytime. inactive might take longer -- the memory queue in particular can't currently see out-of-order acks internally, so would require some new mechanism for its ACK bookkeeping. It could be started anytime, but it's a bigger task.

retry is hard for different reasons -- last I knew we're expecting outputs to be responsible for retries, and if they do end up reporting when that happens I don't really think it should be to the queue; it would require significant new APIs and metadata for something that's only indirectly related to queue operation. [related: currently, acknowledgement is all-or-nothing for whatever batch the queue provided, so until we dig into ACK cleanups we have limited granularity, especially in the memory queue.] So retry I think might need to wait longer, until we figure out what the contract between the shipper and its outputs is -- to me it feels more like an output metric than a queue metric.

Part of my hope for simplifying this whole process is to move the logic of batch assembly and ACK bookkeeping outside the queue. Those are the cause of most of the memory queue's complexity and rigidity. Splitting them out would let us keep the current (quite opinionated) beats pipeline model while adopting a simpler and more trackable data flow in the shipper, and if this happened it would be an excellent place to add these kinds of monitoring hooks.

fearful-symmetry · 2022-05-16T21:46:42Z

It would make sense to think about it once there are real ACKs flowing in the shipper (which I hope will be this week).

Agreed. I think that once the shipper is connected to some kind of external component and does something conceptually useful, a lot more of the pieces will fall into place.

Part of my hope for simplifying this whole process is to move the logic of batch assembly and ACK bookkeeping outside the queue. Those are the cause of most of the memory queue's complexity and rigidity. Splitting them out would let us keep the current (quite opinionated) beats pipeline model while adopting a simpler and more trackable data flow in the shipper, and if this happened it would be an excellent place to add these kinds of monitoring hooks.

Is there an issue/PR for this? Been slowly trying to understand how the memory queue actually works, so I'm kinda interested.

faec · 2022-05-17T13:50:39Z

Is there an issue/PR for this? Been slowly trying to understand how the memory queue actually works, so I'm kinda interested.

No, this was initially an independent cleanup that I was hoping to do during ON week but it got delayed by test failures in an earlier stage. But I learned enough in the process that it's probably worth writing out a plan, since one way or another we have to integrate with this code soon. I'll write up a summary of the design I had in mind.

cmacknz · 2022-05-17T19:43:55Z

@fearful-symmetry I spoke with @faec on the best path forward for this.

occupied_read, representing the events that have been given to a consumer but not yet ACKed (showing what proportion of the queue we're trying to send).

I expect occupied_read will be pretty straightforward in both queues and could be started anytime.

I would pursue the occupied_read queue lag metric first since it should be the simplest. We just need to get something implemented and experiment with how valuable it is. We can improve the accuracy of the metrics over time as we learn more about how they behave.

I strongly suspect we will also need a metric for the age of the oldest item in the queue to go along with this. This is to allow us to distinguish the case where the queue is full, but events are flowing through it (there is high latency) from the case where the queue is full but nothing is flowing through it (zero throughput).

For an oldest item in the queue metric, there may be some trickiness around ensuring the event always has a timestamp. At least for recent versions of beats and agent, we are always writing data streams that must have a timestamp.

I would recommend creating two separate issues to track both of these metric implementations separately.

fearful-symmetry · 2022-05-17T20:12:40Z

@cmacknz so, my plan was to actually start with the super basic metrics, the count and limit values, and hope that once I did that, I'd feel a little more confident diving into the more complex derived metrics. I'm not sure how much of the other metrics, like "oldest in queue" are soft-blocked by other refactoring work, or by elastic/elastic-agent-shipper#27

cmacknz · 2022-05-17T20:23:31Z

Ah yes, I forgot about those. Then let's go with the order:

count & limit
occupied_read
oldest item in queue

cmacknz · 2022-10-13T18:43:53Z

Closing in favour of the top level shipper metrics issue: elastic/elastic-agent-shipper#11

We will simplify the implementation to just tracking the age of the oldest item in the queue and add more detail as needed.

cmacknz added the v8.3.0 label Apr 1, 2022

cmacknz assigned fearful-symmetry Apr 1, 2022

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 1, 2022

cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Apr 1, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 1, 2022

This was referenced Apr 1, 2022

Measure and expose shipper queue performance metrics elastic/elastic-agent-shipper#11

Closed

Report full queue metrics (including events from previous session) in monitoring.metrics.beat.libbeat.pipeline.events.active. #29296

Closed

fearful-symmetry mentioned this issue Apr 13, 2022

Define a queue metrics reporter interface #31289

Merged

fearful-symmetry mentioned this issue Apr 21, 2022

Add queue monitoring elastic/elastic-agent-shipper#29

Merged

fearful-symmetry mentioned this issue May 20, 2022

Add basic metrics reporting to the memqueue #31703

Merged

6 tasks

fearful-symmetry mentioned this issue Jun 2, 2022

Update metrics fields in queue monitoring elastic/elastic-agent-shipper#47

Merged

cmacknz added the 8.5-candidate label Jun 21, 2022

cmacknz removed the v8.3.0 label Jun 21, 2022

cmacknz unassigned fearful-symmetry Jul 6, 2022

jlind23 added v8.5.0 and removed 8.5-candidate labels Jul 8, 2022

jlind23 assigned fearful-symmetry Jul 13, 2022

cmacknz closed this as completed Oct 13, 2022

fearful-symmetry mentioned this issue Oct 25, 2022

Implement occupied_idle metrics for the shipper elastic/elastic-agent-shipper#142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report detailed queue metrics #31113

Report detailed queue metrics #31113

cmacknz commented Apr 1, 2022 •

edited

Loading

elasticmachine commented Apr 1, 2022

fearful-symmetry commented Apr 1, 2022

cmacknz commented Apr 4, 2022

zez3 commented Apr 4, 2022

cmacknz commented Apr 4, 2022

zez3 commented Apr 4, 2022 •

edited

Loading

fearful-symmetry commented May 12, 2022

cmacknz commented May 16, 2022

fearful-symmetry commented May 16, 2022

faec commented May 16, 2022

fearful-symmetry commented May 16, 2022

cmacknz commented May 16, 2022 •

edited

Loading

faec commented May 16, 2022

fearful-symmetry commented May 16, 2022

faec commented May 17, 2022

cmacknz commented May 17, 2022 •

edited

Loading

fearful-symmetry commented May 17, 2022

cmacknz commented May 17, 2022 •

edited

Loading

cmacknz commented Oct 13, 2022

Report detailed queue metrics #31113

Report detailed queue metrics #31113

Comments

cmacknz commented Apr 1, 2022 • edited Loading

elasticmachine commented Apr 1, 2022

fearful-symmetry commented Apr 1, 2022

cmacknz commented Apr 4, 2022

zez3 commented Apr 4, 2022

cmacknz commented Apr 4, 2022

zez3 commented Apr 4, 2022 • edited Loading

fearful-symmetry commented May 12, 2022

cmacknz commented May 16, 2022

fearful-symmetry commented May 16, 2022

faec commented May 16, 2022

fearful-symmetry commented May 16, 2022

cmacknz commented May 16, 2022 • edited Loading

faec commented May 16, 2022

fearful-symmetry commented May 16, 2022

faec commented May 17, 2022

cmacknz commented May 17, 2022 • edited Loading

fearful-symmetry commented May 17, 2022

cmacknz commented May 17, 2022 • edited Loading

cmacknz commented Oct 13, 2022

cmacknz commented Apr 1, 2022 •

edited

Loading

zez3 commented Apr 4, 2022 •

edited

Loading

cmacknz commented May 16, 2022 •

edited

Loading

cmacknz commented May 17, 2022 •

edited

Loading

cmacknz commented May 17, 2022 •

edited

Loading