-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report detailed queue metrics #31113
Comments
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
Is there a reason we're linking this back to the shipper? Is the plan to more or less import the memory queue into the shipper? Also, maybe a dumb question, but is this getting wrapped into whatever we have planned for APM-style tracing in the shipper, or is this considered a distinct metrics/debugging component that just serves the queues? |
Yes, I expect we'll import the beats queues initially which allows us to improve them for both projects. I don't see a reason to do otherwise yet.
The second one. It is related in that it improves our observability, but you can't get queue/consumer lag via traces for example. You have to instrument the queue itself to obtain this. I want to start with these metrics to see how much value we can get out of just having the queue lag metric for detecting back pressure and input/output rate mismatches without tracing. It might make sense to prioritize being able to visualize the metrics from the logs over adding tracing everywhere initially. |
@cmacknz |
Yes they should be available under |
@joshdover Do we need a feature request(ER) for fleet to have the beats http endpoint stats option enable in policy? |
@faec with all the changes merged, should we actually begin work on implementing the interface we got in #31289 ? Also, when we merged #31289 one thing @faec and I weren't sure about (and thus didn't add) was the concept of queue lag, since the way we both understood it, it seemed equivalent to the queue size, is it was just the difference between the newest versus older event at the queue. @cmacknz can you elaborate a bit more on what that should look like versus the queue size/queue event count? |
It depends on how queue size is defined. The queue lag is the difference between a consumer's position in the queue and the position of the newest item in the queue. In the case where there is only ever a single consumer, and items are immediately deleted from the queue when they are written to the output, then the queue size and the lag are the same thing. The two cases I can think of where queue lag and queue size would diverge are:
I would think the queue size is probably equal to the lag for the memory queue but not the disk queue. |
@faec ^ does that make sense? I know we removed the ability to have more than one consumer, but I'm not sure how much of a possibility there is for that kind of lag between acking(?) and deleting an event. |
Well, we didn't really remove any abilities -- there can still be multiple "consumers" in multiple threads, but now they can all make their polling calls against the same queue object instead of needing to allocate additional helpers. This was possible because the consumers carried no state of their own (although that wasn't obvious from the previous interface). I think @cmacknz's description of lag makes more sense to me, though it's relying on some rather subtle order-of-operations for a user-visible metric, and it might be hard to interpret in isolation. As one example, besides asynchronous event deletion there is also a gap between the "position" as in the next event that will be read by a consumer and "position" as in the most recent fully-acknowledged event (which lags behind the former, sometimes by a lot), and this would be missed by just looking at events pending deletion. There's three states a queue event could be in -- never been read, read but awaiting ack, and acked but awaiting cleanup. (Maybe four if we also distinguish events that have been read but not yet sent from events awaiting retry after a failed send.) The event cleanup issue certainly applies to the disk queue, but arguably also to the memory queue, since out-of-order ACKs are possible but will not result in event cleanup until every event prior to it is also acknowledged. In this case the metrics might look like the entire queue is still awaiting acknowledgment when really there is just one stuck event. Tracking out-of-order ACKs would require some work in both queues, and might not be worth it; but explaining why we don't track it when we describe a metric as excluding already-acked events also doesn't sound fun. I wonder if a different framing would make sense. What questions are we trying to answer with the metric that aren't already answered by occupied vs total size? Occupied vs total can tell us roughly how constrained the throughput is relative to the input volume. But knowing whether the last 5-10% has already been acknowledged and is awaiting cleanup (as with a segment file) doesn't change that much -- those numbers will basically always look the same. The main failure mode it might detect is "nothing has been acknowledged ever" (and even that would require other metrics to confirm since all segment files start out that way). It seems to me what we're trying to surface is, "if the queue is highly congested, why?" The obvious top-level reasons are:
An example from a current SDH is a Kafka case where consistent failure and retry of one event ends up blocking the entire pipeline because that event is never acknowledged, so the "oldest" entry in the queue can never move beyond that point -- the rest of the queue is acknowledged but idle. I think what I'd like in a perfect world would be three metrics:
These numbers would give clear criteria to distinguish between the three types of congestion. I'm not sure all of them should be implemented by the queue itself (I expect our ACK handling to evolve and that will affect where it makes sense to monitor such things), and maybe we can get away with fewer values... but my hope is that whatever we put here will give as much clarity as possible about the conceptual state of the queue, rather than things like segment round-off that to me feels more like an implementation detail. |
I really like @faec 's idea of the
Yah, right now so much of the queue, and shipper, is in such an evolving state with so many missing components that I'm not really sure if it's wise to try to implement stuff like this unless we're 100% sure about the APIs and designs, which it seems like we're not. The interface is all |
Yes excellent summary @faec , thanks! There are a lot of implementation details to take into account with this. My mental model of lag is based off of Kafka topic lag, where the consumer position is based on the last committed offset of a consumer for a given topic. Committing an offset in Kafka means the consumer has marked the event with that offset as read and does not want to consume it again. It is also decoupled from the deletion time of the event, which may be much later or never. I think this maps to the point where the event has been successfully acknowledged by the output in the beats system.
Yes, exactly. Particularly for the "everything is "fine" but the input's throughput is higher than the output's" case this is extremely difficult to figure out right now. The other two you would ideally see errors in the logs but you may not be able to tell if they are transient or backing up the system. My biggest question to @faec from this is what is the right point in time to implement these metrics given all the changes that are going on? |
It would make sense to think about it once there are real ACKs flowing in the shipper (which I hope will be this week). I expect
Part of my hope for simplifying this whole process is to move the logic of batch assembly and ACK bookkeeping outside the queue. Those are the cause of most of the memory queue's complexity and rigidity. Splitting them out would let us keep the current (quite opinionated) beats pipeline model while adopting a simpler and more trackable data flow in the shipper, and if this happened it would be an excellent place to add these kinds of monitoring hooks. |
Agreed. I think that once the shipper is connected to some kind of external component and does something conceptually useful, a lot more of the pieces will fall into place.
Is there an issue/PR for this? Been slowly trying to understand how the memory queue actually works, so I'm kinda interested. |
No, this was initially an independent cleanup that I was hoping to do during ON week but it got delayed by test failures in an earlier stage. But I learned enough in the process that it's probably worth writing out a plan, since one way or another we have to integrate with this code soon. I'll write up a summary of the design I had in mind. |
@fearful-symmetry I spoke with @faec on the best path forward for this.
I would pursue the I strongly suspect we will also need a metric for the age of the oldest item in the queue to go along with this. This is to allow us to distinguish the case where the queue is full, but events are flowing through it (there is high latency) from the case where the queue is full but nothing is flowing through it (zero throughput). For an oldest item in the queue metric, there may be some trickiness around ensuring the event always has a timestamp. At least for recent versions of beats and agent, we are always writing data streams that must have a timestamp. I would recommend creating two separate issues to track both of these metric implementations separately. |
@cmacknz so, my plan was to actually start with the super basic metrics, the |
Ah yes, I forgot about those. Then let's go with the order:
|
Closing in favour of the top level shipper metrics issue: elastic/elastic-agent-shipper#11 We will simplify the implementation to just tracking the age of the oldest item in the queue and add more detail as needed. |
The memory and disk queue implemented by beats report minimal metrics about their state, which makes it difficult to understand how the queue is changing over time. We should expand the set of queue metrics that are reported as the queue is an excellent place to observe whether the beat system is experiencing backpressure.
For both queue types we should report at least the following additional metrics:
Of those metrics, the queue lag is the most useful for diagnosing backpressure. In systems built with Apache Kafka the equivalent consumer lag measurement is a key performance indicator of the system. See example uses and graphs of consumer lag here https://www.confluent.io/blog/kafka-lag-monitoring-and-metrics-at-appsflyer/ (only slightly relevant, ideally we want to easily generate and obtain the same types of graphs for the beats pipeline).
Queue or consumer lag trending up indicates the output cannot keep up with the produced data volume. Ideally we want the ability to easily graph this value over time, but having the instantaneous value is a starting point. Perhaps we can try to calculate the derivative as a metric to see the trend more easily. This should help address #7743, but for all beats since it depends only on the queue state.
The metrics should be periodically published into the beat log, as is currently done here: https://github.com/elastic/beats/blob/main/libbeat/monitoring/report/log/log.go#L188. See #24700 for an example of how to add new beats metrics.
We can split this issue up for each of the two queue types if needed.
Note: The existing beats have an active counter (https://github.com/elastic/beats/blob/main/libbeat/publisher/pipeline/monitoring.go#L99) and a done/published counter. The difference between these two might give you a similar measurement to the queue lag described above. Where the existing metrics are incremented and decremented likely needs to change when used in the shipper however.
The text was updated successfully, but these errors were encountered: