Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report detailed queue metrics #31113

Closed
cmacknz opened this issue Apr 1, 2022 · 19 comments
Closed

Report detailed queue metrics #31113

cmacknz opened this issue Apr 1, 2022 · 19 comments
Assignees
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.5.0

Comments

@cmacknz
Copy link
Member

cmacknz commented Apr 1, 2022

The memory and disk queue implemented by beats report minimal metrics about their state, which makes it difficult to understand how the queue is changing over time. We should expand the set of queue metrics that are reported as the queue is an excellent place to observe whether the beat system is experiencing backpressure.

For both queue types we should report at least the following additional metrics:

  • The count of times the queue reached its configured maximum size.
  • A boolean value indicating whether or not the queue is full.
  • The current queue fill level in the same units as the queue size limit.
  • The age of the oldest item in the queue.
  • The current queue lag. Lag is the difference between the producer position and the consumer position in the queue.

Of those metrics, the queue lag is the most useful for diagnosing backpressure. In systems built with Apache Kafka the equivalent consumer lag measurement is a key performance indicator of the system. See example uses and graphs of consumer lag here https://www.confluent.io/blog/kafka-lag-monitoring-and-metrics-at-appsflyer/ (only slightly relevant, ideally we want to easily generate and obtain the same types of graphs for the beats pipeline).

Queue or consumer lag trending up indicates the output cannot keep up with the produced data volume. Ideally we want the ability to easily graph this value over time, but having the instantaneous value is a starting point. Perhaps we can try to calculate the derivative as a metric to see the trend more easily. This should help address #7743, but for all beats since it depends only on the queue state.

The metrics should be periodically published into the beat log, as is currently done here: https://github.com/elastic/beats/blob/main/libbeat/monitoring/report/log/log.go#L188. See #24700 for an example of how to add new beats metrics.

We can split this issue up for each of the two queue types if needed.

Note: The existing beats have an active counter (https://github.com/elastic/beats/blob/main/libbeat/publisher/pipeline/monitoring.go#L99) and a done/published counter. The difference between these two might give you a similar measurement to the queue lag described above. Where the existing metrics are incremented and decremented likely needs to change when used in the shipper however.

@cmacknz cmacknz added the v8.3.0 label Apr 1, 2022
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 1, 2022
@cmacknz cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Apr 1, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@fearful-symmetry
Copy link
Contributor

Is there a reason we're linking this back to the shipper? Is the plan to more or less import the memory queue into the shipper?

Also, maybe a dumb question, but is this getting wrapped into whatever we have planned for APM-style tracing in the shipper, or is this considered a distinct metrics/debugging component that just serves the queues?

@cmacknz
Copy link
Member Author

cmacknz commented Apr 4, 2022

Is there a reason we're linking this back to the shipper? Is the plan to more or less import the memory queue into the shipper?

Yes, I expect we'll import the beats queues initially which allows us to improve them for both projects. I don't see a reason to do otherwise yet.

Also, maybe a dumb question, but is this getting wrapped into whatever we have planned for APM-style tracing in the shipper, or is this considered a distinct metrics/debugging component that just serves the queues?

The second one. It is related in that it improves our observability, but you can't get queue/consumer lag via traces for example. You have to instrument the queue itself to obtain this.

I want to start with these metrics to see how much value we can get out of just having the queue lag metric for detecting back pressure and input/output rate mismatches without tracing. It might make sense to prioritize being able to visualize the metrics from the logs over adding tracing everywhere initially.

@zez3
Copy link

zez3 commented Apr 4, 2022

@cmacknz
Would those new detailed queue metrics be also exposed into http endpoint metrics? Default port 5066
The idea is to use those metrics(Lag and/or fill) on the LB in front via health check so that I can declare that beat(backend) dead

@cmacknz
Copy link
Member Author

cmacknz commented Apr 4, 2022

Yes they should be available under /stats or /debug/vars when the HTTP endpoint is enabled alongside the current set of metrics.

@zez3
Copy link

zez3 commented Apr 4, 2022

@joshdover Do we need a feature request(ER) for fleet to have the beats http endpoint stats option enable in policy?

@fearful-symmetry
Copy link
Contributor

@faec with all the changes merged, should we actually begin work on implementing the interface we got in #31289 ?

Also, when we merged #31289 one thing @faec and I weren't sure about (and thus didn't add) was the concept of queue lag, since the way we both understood it, it seemed equivalent to the queue size, is it was just the difference between the newest versus older event at the queue. @cmacknz can you elaborate a bit more on what that should look like versus the queue size/queue event count?

@cmacknz
Copy link
Member Author

cmacknz commented May 16, 2022

Also, when we merged #31289 one thing @faec and I weren't sure about (and thus didn't add) was the concept of queue lag, since the way we both understood it, it seemed equivalent to the queue size, is it was just the difference between the newest versus older event at the queue. @cmacknz can you elaborate a bit more on what that should look like versus the queue size/queue event count?

It depends on how queue size is defined. The queue lag is the difference between a consumer's position in the queue and the position of the newest item in the queue.

In the case where there is only ever a single consumer, and items are immediately deleted from the queue when they are written to the output, then the queue size and the lag are the same thing.

The two cases I can think of where queue lag and queue size would diverge are:

  1. Items are not immediately deleted from the queue when successfully written and deletion is a background task. I believe this is how the disk queue works with segment file deletions instead of event deletions. We would want a separation between the consumer position and the queue size on disk.
  2. There is more than one consumer. We don't have a use case for this yet.

I would think the queue size is probably equal to the lag for the memory queue but not the disk queue.

@fearful-symmetry
Copy link
Contributor

@faec ^ does that make sense? I know we removed the ability to have more than one consumer, but I'm not sure how much of a possibility there is for that kind of lag between acking(?) and deleting an event.

@faec
Copy link
Contributor

faec commented May 16, 2022

I know we removed the ability to have more than one consumer

Well, we didn't really remove any abilities -- there can still be multiple "consumers" in multiple threads, but now they can all make their polling calls against the same queue object instead of needing to allocate additional helpers. This was possible because the consumers carried no state of their own (although that wasn't obvious from the previous interface).

I think @cmacknz's description of lag makes more sense to me, though it's relying on some rather subtle order-of-operations for a user-visible metric, and it might be hard to interpret in isolation. As one example, besides asynchronous event deletion there is also a gap between the "position" as in the next event that will be read by a consumer and "position" as in the most recent fully-acknowledged event (which lags behind the former, sometimes by a lot), and this would be missed by just looking at events pending deletion.

There's three states a queue event could be in -- never been read, read but awaiting ack, and acked but awaiting cleanup. (Maybe four if we also distinguish events that have been read but not yet sent from events awaiting retry after a failed send.) The event cleanup issue certainly applies to the disk queue, but arguably also to the memory queue, since out-of-order ACKs are possible but will not result in event cleanup until every event prior to it is also acknowledged. In this case the metrics might look like the entire queue is still awaiting acknowledgment when really there is just one stuck event. Tracking out-of-order ACKs would require some work in both queues, and might not be worth it; but explaining why we don't track it when we describe a metric as excluding already-acked events also doesn't sound fun.

I wonder if a different framing would make sense. What questions are we trying to answer with the metric that aren't already answered by occupied vs total size? Occupied vs total can tell us roughly how constrained the throughput is relative to the input volume. But knowing whether the last 5-10% has already been acknowledged and is awaiting cleanup (as with a segment file) doesn't change that much -- those numbers will basically always look the same. The main failure mode it might detect is "nothing has been acknowledged ever" (and even that would require other metrics to confirm since all segment files start out that way).

It seems to me what we're trying to surface is, "if the queue is highly congested, why?" The obvious top-level reasons are:

  • everything is "fine" but the input's throughput is higher than the output's.
  • the output's throughput is low because we can't successfully communicate with the server
  • the output's throughput is low because we are communicating with the server but some events are getting rejected

An example from a current SDH is a Kafka case where consistent failure and retry of one event ends up blocking the entire pipeline because that event is never acknowledged, so the "oldest" entry in the queue can never move beyond that point -- the rest of the queue is acknowledged but idle.

I think what I'd like in a perfect world would be three metrics:

  • something like occupied_inactive or occupied_idle representing the events that have been acknowledged but, for whatever reason, still occupy the queue. This is similar to "lag" as described above except it would ideally include out-of-order ACKs.
  • occupied_read, representing the events that have been given to a consumer but not yet ACKed (showing what proportion of the queue we're trying to send).
  • occupied_retry, representing the events that have been read and failed to send at least once.

These numbers would give clear criteria to distinguish between the three types of congestion. I'm not sure all of them should be implemented by the queue itself (I expect our ACK handling to evolve and that will affect where it makes sense to monitor such things), and maybe we can get away with fewer values... but my hope is that whatever we put here will give as much clarity as possible about the conceptual state of the queue, rather than things like segment round-off that to me feels more like an implementation detail.

@fearful-symmetry
Copy link
Contributor

I really like @faec 's idea of the occupied_* metrics, I think that makes a lot of sense as far as reporting the causes of congestion in the queue.

I'm not sure all of them should be implemented by the queue itself (I expect our ACK handling to evolve and that will affect where it makes sense to monitor such things), and maybe we can get away with fewer values

Yah, right now so much of the queue, and shipper, is in such an evolving state with so many missing components that I'm not really sure if it's wise to try to implement stuff like this unless we're 100% sure about the APIs and designs, which it seems like we're not. The interface is all opt values, so we can implement stuff piecemeal as needed.

@cmacknz
Copy link
Member Author

cmacknz commented May 16, 2022

Yes excellent summary @faec , thanks! There are a lot of implementation details to take into account with this.

My mental model of lag is based off of Kafka topic lag, where the consumer position is based on the last committed offset of a consumer for a given topic. Committing an offset in Kafka means the consumer has marked the event with that offset as read and does not want to consume it again. It is also decoupled from the deletion time of the event, which may be much later or never.

I think this maps to the point where the event has been successfully acknowledged by the output in the beats system.

It seems to me what we're trying to surface is, "if the queue is highly congested, why?" The obvious top-level reasons are:

everything is "fine" but the input's throughput is higher than the output's.
the output's throughput is low because we can't successfully communicate with the server
the output's throughput is low because we are communicating with the server but some events are getting rejected

Yes, exactly. Particularly for the "everything is "fine" but the input's throughput is higher than the output's" case this is extremely difficult to figure out right now. The other two you would ideally see errors in the logs but you may not be able to tell if they are transient or backing up the system.

My biggest question to @faec from this is what is the right point in time to implement these metrics given all the changes that are going on?

@faec
Copy link
Contributor

faec commented May 16, 2022

My biggest question to @faec from this is what is the right point in time to implement these metrics given all the changes that are going on?

It would make sense to think about it once there are real ACKs flowing in the shipper (which I hope will be this week). I expect occupied_read will be pretty straightforward in both queues and could be started anytime. inactive might take longer -- the memory queue in particular can't currently see out-of-order acks internally, so would require some new mechanism for its ACK bookkeeping. It could be started anytime, but it's a bigger task.

retry is hard for different reasons -- last I knew we're expecting outputs to be responsible for retries, and if they do end up reporting when that happens I don't really think it should be to the queue; it would require significant new APIs and metadata for something that's only indirectly related to queue operation. [related: currently, acknowledgement is all-or-nothing for whatever batch the queue provided, so until we dig into ACK cleanups we have limited granularity, especially in the memory queue.] So retry I think might need to wait longer, until we figure out what the contract between the shipper and its outputs is -- to me it feels more like an output metric than a queue metric.

Part of my hope for simplifying this whole process is to move the logic of batch assembly and ACK bookkeeping outside the queue. Those are the cause of most of the memory queue's complexity and rigidity. Splitting them out would let us keep the current (quite opinionated) beats pipeline model while adopting a simpler and more trackable data flow in the shipper, and if this happened it would be an excellent place to add these kinds of monitoring hooks.

@fearful-symmetry
Copy link
Contributor

It would make sense to think about it once there are real ACKs flowing in the shipper (which I hope will be this week).

Agreed. I think that once the shipper is connected to some kind of external component and does something conceptually useful, a lot more of the pieces will fall into place.

Part of my hope for simplifying this whole process is to move the logic of batch assembly and ACK bookkeeping outside the queue. Those are the cause of most of the memory queue's complexity and rigidity. Splitting them out would let us keep the current (quite opinionated) beats pipeline model while adopting a simpler and more trackable data flow in the shipper, and if this happened it would be an excellent place to add these kinds of monitoring hooks.

Is there an issue/PR for this? Been slowly trying to understand how the memory queue actually works, so I'm kinda interested.

@faec
Copy link
Contributor

faec commented May 17, 2022

Is there an issue/PR for this? Been slowly trying to understand how the memory queue actually works, so I'm kinda interested.

No, this was initially an independent cleanup that I was hoping to do during ON week but it got delayed by test failures in an earlier stage. But I learned enough in the process that it's probably worth writing out a plan, since one way or another we have to integrate with this code soon. I'll write up a summary of the design I had in mind.

@cmacknz
Copy link
Member Author

cmacknz commented May 17, 2022

@fearful-symmetry I spoke with @faec on the best path forward for this.

occupied_read, representing the events that have been given to a consumer but not yet ACKed (showing what proportion of the queue we're trying to send).

I expect occupied_read will be pretty straightforward in both queues and could be started anytime.

I would pursue the occupied_read queue lag metric first since it should be the simplest. We just need to get something implemented and experiment with how valuable it is. We can improve the accuracy of the metrics over time as we learn more about how they behave.

I strongly suspect we will also need a metric for the age of the oldest item in the queue to go along with this. This is to allow us to distinguish the case where the queue is full, but events are flowing through it (there is high latency) from the case where the queue is full but nothing is flowing through it (zero throughput).

For an oldest item in the queue metric, there may be some trickiness around ensuring the event always has a timestamp. At least for recent versions of beats and agent, we are always writing data streams that must have a timestamp.

I would recommend creating two separate issues to track both of these metric implementations separately.

@fearful-symmetry
Copy link
Contributor

@cmacknz so, my plan was to actually start with the super basic metrics, the count and limit values, and hope that once I did that, I'd feel a little more confident diving into the more complex derived metrics. I'm not sure how much of the other metrics, like "oldest in queue" are soft-blocked by other refactoring work, or by elastic/elastic-agent-shipper#27

@cmacknz
Copy link
Member Author

cmacknz commented May 17, 2022

Ah yes, I forgot about those. Then let's go with the order:

  1. count & limit
  2. occupied_read
  3. oldest item in queue

@cmacknz
Copy link
Member Author

cmacknz commented Oct 13, 2022

Closing in favour of the top level shipper metrics issue: elastic/elastic-agent-shipper#11

We will simplify the implementation to just tracking the age of the oldest item in the queue and add more detail as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.5.0
Projects
None yet
Development

No branches or pull requests

6 participants