Define a queue metrics reporter interface #31289

fearful-symmetry · 2022-04-13T19:31:51Z

This is the result of a brief conversation with @faec about how to add metrics reporting to the queue. After going over some of the requirements in #31113, I just made a first attempt at defining an interface for what metrics would actually be reported.

I added a simple "reporter" to the Queue interface, but I'm not sure we might actually want to do it that way. Whatever calls that Metrics() function will have to be handled by metrics reporter that lives in its own thread, so maybe the Queue interface should return some kind of handler or something, but I'm not sure.

elasticmachine · 2022-04-13T19:31:54Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine · 2022-04-13T19:50:27Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-04-15T19:08:48.064+0000
Duration: 68 min 40 sec

Test stats 🧪

Test	Results
Failed	0
Passed	22317
Skipped	1935
Total	24252

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

cmacknz · 2022-04-13T19:59:17Z

libbeat/publisher/queue/queue.go

+//Metrics is a set of basic-user friendly metrics that report the current state of the queue. These metrics are meant to be relatively generic and high-level, and when reported directly, can be comprehensible to a user.
+type Metrics struct {
+	//QueueLimitCount is the count of times that the queue has reached it's user-configured limit, either in terms of storage space, or count of events.
+	QueueLimitCount uint64


Suggested change

QueueLimitCount uint64

QueueFullCount uint64

The one below this is QueueIsFull so this seems more consistent. Either Limit or Full works as long as it is used consistently.

libbeat/publisher/queue/queue.go

faec · 2022-04-13T19:50:48Z

libbeat/publisher/queue/queue.go

@@ -34,6 +34,21 @@ type ACKListener interface {
 	OnACK(eventCount int) // number of consecutively published events acked by producers
 }

+//Metrics is a set of basic-user friendly metrics that report the current state of the queue. These metrics are meant to be relatively generic and high-level, and when reported directly, can be comprehensible to a user.
+type Metrics struct {


I'd like if it were more explicit / opinionated about the resource types -- e.g. EventCount, ByteCount, EventLimit and ByteLimit could all be fields (with zero indicating not-applicable).

I'd also rather leave off full and percent: "full" is a relative concept (a queue with 1MB free is "full" if it's 100MB and most events are 2MB; it's not "full" if it's 5MB and most events are 10KB) so let's not try to guess at this granularity. Percent can be derived for the respective types from the count and limit values.

This is a good point about what full means. It seems like just having a counter for events that were rejected because they wouldn't fit is what we'd want.

Yah, that's a good point about "full" too. Perhaps leave that to the metrics code to derive, and let this interface just report counts and byte values, as you suggested.

I think that makes sense. The queue itself doesn't care if something can fit into it, that is a concern of whatever is using the Producer interface.

faec · 2022-04-13T20:00:02Z

libbeat/publisher/queue/queue.go

@@ -34,6 +34,21 @@ type ACKListener interface {
 	OnACK(eventCount int) // number of consecutively published events acked by producers
 }

+//Metrics is a set of basic-user friendly metrics that report the current state of the queue. These metrics are meant to be relatively generic and high-level, and when reported directly, can be comprehensible to a user.


Actually I'd rather these metrics be unambiguous and consistently implementable. Generic / high-level / comprehensible ones should ideally happen in the metrics code in a queue-agnostic way (rather than being duplicated in different queues where the semantics can subtly diverge). So my main comment below is about making the metrics more concrete and low-level 😅

Yah, that's a good point, I think. When I started I was super on the fence about how much of the work of "deriving" certain values should be done by the queue, or by the metrics code.

fearful-symmetry · 2022-04-13T21:56:06Z

So, made some super basic changes based on what @faec said.

There's still a few missing pieces here that I want input on before we expand the metrics here. The first is how aggressive we want to get with the metrics reported through this particular interface, since there's a lot of queue-implementation-specific metrics we could report, but I'm also not sure what additional metrics we could add that would make sense with both the memory and disk queue. The memory queue isn't particularly well documented, so I'm still figuring out how it works and trying to compare it to the disk queue.

A lot of metrics we might want would also require us to actually instrument the code with additional counters, as opposed to just passing along channel sizes or user configs. I'm not sure how much of that we need/want to tackle now.

I'm also not sure if we want to implement the "queue lag" and "oldest item in queue" metrics in the queue, or if the queues should pass along some kind of internal event channel to the metrics code that we can use to track what's moving through the queue at various intervals. It's easy to imagine some kind of channel that reports things like "event with this ID and timestamp just pass through the consumer/producer" down to the metrics code, but that might be more complex than having the queue figure that out itself.

cmacknz · 2022-04-14T16:08:40Z

I'm also not sure if we want to implement the "queue lag" and "oldest item in queue" metrics in the queue, or if the queues should pass along some kind of internal event channel to the metrics code that we can use to track what's moving through the queue at various intervals

One thing we will need to do is assign elements in the queue a monotonic identifier or sequence number. This will be necessary to correctly acknowledge messages between the shipper and input process (see elastic/elastic-agent-shipper#9 for this work). Once we have that, queue lag is just the different between the sequence numbers of the newest and oldest item in the queue.

It may make sense to add those sequences numbers first if it makes the metrics easier to implement. I feel that a queue lag metric is very important as it will allow us to immediately see if a beat or shipper is being backpressued by looking at the lag metric over time. Identifying backpressure is something we really struggle with right now.

fearful-symmetry · 2022-04-14T16:18:58Z

Once we have that, queue lag is just the different between the sequence numbers of the newest and oldest item in the queue.

Yah, an incrementing queue ID value would simplify a lot of this, agreed; it also means (at least to me) that deriving some of these values are simple enough that it can be done in the queue, and the metrics code shouldn't need to worry about handling some kind of event stream, or...something.

cmacknz · 2022-04-14T17:26:20Z

I've split off adding the queue ID / sequence numbers into a separate ticket: elastic/elastic-agent-shipper#27

It seems like it will make sense to add that first since it simplifies the metrics implementation.

cmacknz · 2022-04-14T17:28:07Z

libbeat/publisher/queue/queue.go

+	//ByteCount is the total byte size of the queue
+	ByteCount opt.Uint
+	//ByteLimit is the user-configured byte limit of the queue
+	ByteLimit opt.Uint


This probably requires us to implement #30069 to get. Doing this for the memory queue might be tricky.

I like these and think they are a much more intuitive way to think about the queue size, but I'm not convinced adding this is more important than a lot of the other work we are already doing.

Yah, hence why I made all these opt values; the thinking was that some queues might care more about bytes, and others would be more event-based, the opt type would allow this one struct to forward whatever metrics types are available.

OK cool, they can just stay unimplemented then until we add support for them.

I was thinking of these fields as referring to the queue's main buffer, which for the disk queue would mean disk usage. That data is already available in the current code, there's just no one to report it to. The memory queue could leave those fields zero for now, since it doesn't enforce a byte limit.

(We could consider adding metrics about memory overhead outside the main buffer too, but like you point out nobody tracks that yet.)

@faec Do you think this is something we should merge now? Should this or a follow-up PR actually add a Metrics() method to the actual interfaces?

faec · 2022-04-14T19:54:57Z

It seems like it will make sense to add that first since it simplifies the metrics implementation.

I'm not sure adding that first is best, partly because I'm already working on that part and it is painfully complicated 😅 I've added an outline of the relevant state to elastic/elastic-agent-shipper#27.

It seems like there are still a fair number of things we can immediately report or get started on, so I think we should proceed with those while the memory queue is refactored into a shape where it will handle our data. Once #31307 is checked in the id change should become a lot more feasible.

cmacknz · 2022-04-14T20:03:57Z

Thanks for clarifying. Let's split each of the metrics into separate issues since some of them are blocked and some of them aren't:

One issue for the event count and limit metrics. Unblocked.
One issue for the queue lag. Requires The disk queue must assign an auto-incrementing ID to each element elastic-agent-shipper#27 which is currently blocked
One issue for the longest waiting item metric. Unblocked.
One issue for the byte count and limit metrics, noting that the disk queue can implement these reasonably but the memory queue will be left unimplemented initially. Unblocked for the disk queue.
One issue for the “full queue” metric describing that it needs to be tracked by the producer (for the shipper this is the gRPC handler). Probably blocked until the gRPC handlers exist.

faec · 2022-04-14T19:56:52Z

libbeat/publisher/queue/queue.go

+	EventLimit opt.Uint
+
+	//LongestWaitingItem is the timestamp of the oldest item in the queue.
+	LongestWaitingItem common.Time


I like nouns to match up with datatypes, how about OldestActiveTimestamp?

faec · 2022-04-14T20:15:39Z

libbeat/publisher/queue/queue.go

+
+	//LongestWaitingItem is the timestamp of the oldest item in the queue.
+	LongestWaitingItem common.Time
+	//QueueLag is the difference between the consumer and producer position in the queue.


What does "position" mean here? If it means byte ( / event) index in the queue buffer then this is very similar to ByteCount / EventCount (maybe modulo some rounding depending on how e.g. the disk queue reports events that are acknowledged but not yet deleted).

I'd rather leave this one off for now. I'd like things in the global interface to have an unambiguous meaning that would be consistently derivable for any queue, at least in principle, and I don't think this one is yet. If we could define it in language that is readily convertable to a unit test I'd feel better :-)

So, my understanding is that this will be a measure of the "delay" involved in an event entering then exiting a queue, as told by the event IDs. Since the IDs increment, if the event leaving the queue for the output has ID = 10 and the event entering has ID =15, the QueueLag is 5.

Hmm, but if the event leaving the queue has id 10 and the event entering has id 15, then isn't that also the same as EventCount being 5? (+- 1 anyway) It seems redundant with that definition.

Yah, I'm probably describing it wrong, and discussion of this is spread out across like, 3 different issues, which isn't helping. From Craig's earlier comment:

queue lag is just the difference between the sequence numbers of the newest and oldest item in the queue.

I don't know, to me that still sounds the same? I don't see how it isn't redundant with EventCount. Is there any situation where this should be different than "the number of items in the queue"?

faec · 2022-04-14T20:20:46Z

libbeat/publisher/queue/queue.go

+
+	// Leave this commented out for now so we don't break the interfaces
+	// Should the main Queue implementation itself expose the metrics, or some other component in the interfaces?
+	//Metrics() (Metrics, error)


I think this is a reasonable place to put it.

Alright. Should this PR actually add this to the interface and made some dummy methods for the queues, or should we save that for when we actually implement them?

I'd be fine either way, as long as the comments explain the intention. Dummy methods that just return an "unimplemented" error would be fine in this PR so you can start connecting the reporting components, and we can flesh out the queue-specific implementations as more pieces get connected.

fearful-symmetry · 2022-04-15T17:57:11Z

@faec added bare methods to the queues. Hopefully we're about done.

…er-tar-gz * upstream/main: (139 commits) [Automation] Update elastic stack version to 8.3.0-c655cda8 for testing (elastic#31322) Define a queue metrics reporter interface (elastic#31289) [Oracle Module] Change tablespace metricset collection period (elastic#31259) libbeat/reader/syslog: relax timestamp parsing to allow leading zero (elastic#31254) [Automation] Update elastic stack version to 8.3.0-55ba6f37 for testing (elastic#31311) [libbeat] Remove unused fields and functions in the memory queue (elastic#31302) [libbeat] Cleaning up some unneeded helper types (elastic#31290) Readme for kibana module (elastic#31276) [Automation] Update elastic stack version to 8.3.0-4be61f32 for testing (elastic#31296) x-pack/winlogbeat/module/routing/ingest: fix typo for channel name (elastic#31291) Small pipeline cleanup removing some unused data fields (elastic#31288) removing info log (elastic#30971) Simplify TLS config deserialization (elastic#31168) Detect new files under known paths in filestream input (elastic#31268) Add support for port mapping in docker hints (elastic#31243) Update qa-labels.yml (elastic#31260) libbeat: log debug for `proxy_url` and fixed docs (elastic#31130) [heartbeat][docs] Add note about ensuring correct index settings for uptime (elastic#31146) [Automation] Update elastic stack version to 8.3.0-2c8f9574 for testing (elastic#31256) [Filebeat] fix m365_defender pipeline bug (elastic#31227) ...

* first pass at a queue metrics reporter * change interface * change monitored types * change timestamp field name * add bare queue implementations * make linter happy * fix linter ignore * remove QueueLag for now

first pass at a queue metrics reporter

27200ba

fearful-symmetry added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.3.0 labels Apr 13, 2022

fearful-symmetry requested review from cmacknz and faec April 13, 2022 19:31

fearful-symmetry self-assigned this Apr 13, 2022

fearful-symmetry requested a review from a team as a code owner April 13, 2022 19:31

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 13, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 13, 2022

change interface

72f24a1

cmacknz reviewed Apr 13, 2022

View reviewed changes

libbeat/publisher/queue/queue.go Outdated Show resolved Hide resolved

faec reviewed Apr 13, 2022

View reviewed changes

change monitored types

edd91d6

fearful-symmetry requested a review from faec April 14, 2022 15:40

cmacknz reviewed Apr 14, 2022

View reviewed changes

faec reviewed Apr 14, 2022

View reviewed changes

change timestamp field name

31af3da

fearful-symmetry mentioned this pull request Apr 14, 2022

Measure and expose shipper queue performance metrics elastic/elastic-agent-shipper#11

Closed

8 tasks

fearful-symmetry added 3 commits April 15, 2022 10:10

add bare queue implementations

b686782

Merge remote-tracking branch 'upstream/main' into queue-metrics-iface

f4ce37e

make linter happy

3184e16

fix linter ignore

4d1e404

fearful-symmetry requested a review from faec April 15, 2022 17:56

remove QueueLag for now

a06fb3a

faec approved these changes Apr 15, 2022

View reviewed changes

fearful-symmetry merged commit 296f010 into elastic:main Apr 15, 2022

fearful-symmetry mentioned this pull request May 12, 2022

Report detailed queue metrics #31113

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define a queue metrics reporter interface #31289

Define a queue metrics reporter interface #31289

fearful-symmetry commented Apr 13, 2022

elasticmachine commented Apr 13, 2022

elasticmachine commented Apr 13, 2022 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

cmacknz Apr 13, 2022

faec Apr 13, 2022

cmacknz Apr 13, 2022

fearful-symmetry Apr 13, 2022

cmacknz Apr 14, 2022 •

edited

Loading

faec Apr 13, 2022

fearful-symmetry Apr 13, 2022

fearful-symmetry commented Apr 13, 2022

cmacknz commented Apr 14, 2022 •

edited

Loading

fearful-symmetry commented Apr 14, 2022

cmacknz commented Apr 14, 2022

cmacknz Apr 14, 2022

fearful-symmetry Apr 14, 2022

cmacknz Apr 14, 2022

faec Apr 14, 2022

fearful-symmetry Apr 14, 2022

faec commented Apr 14, 2022

cmacknz commented Apr 14, 2022

faec Apr 14, 2022

faec Apr 14, 2022

fearful-symmetry Apr 14, 2022

faec Apr 15, 2022

fearful-symmetry Apr 15, 2022

faec Apr 15, 2022

faec Apr 14, 2022

fearful-symmetry Apr 14, 2022

faec Apr 15, 2022

fearful-symmetry commented Apr 15, 2022

Define a queue metrics reporter interface #31289

Define a queue metrics reporter interface #31289

Conversation

fearful-symmetry commented Apr 13, 2022

elasticmachine commented Apr 13, 2022

elasticmachine commented Apr 13, 2022 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmacknz Apr 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fearful-symmetry commented Apr 13, 2022

cmacknz commented Apr 14, 2022 • edited Loading

fearful-symmetry commented Apr 14, 2022

cmacknz commented Apr 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

faec commented Apr 14, 2022

cmacknz commented Apr 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fearful-symmetry commented Apr 15, 2022

elasticmachine commented Apr 13, 2022 •

edited by jenkins-beats-ci bot

Loading

cmacknz Apr 14, 2022 •

edited

Loading

cmacknz commented Apr 14, 2022 •

edited

Loading