Add a proxy queue to avoid double-queueing every event when using the shipper output #34377

faec · 2023-01-24T22:40:27Z

Add a "proxy queue" which tracks acknowledgment callbacks for events but does not let events accumulate or keep its own copy of the event data for batches that have been read. This is for use in the shipper output, since events sent to it will be sent and queued in the shipper, and thus don't need to be queued in Beats as well while they wait for upstream acknowledgment.

This also includes significant changes to the shipper output, since there were various race conditions and bugs that interfered with the chain of acknowledgments, and we need precise handling to make sure the events are freed without losing the acknowledgment data.

Currently this change is internal-only; there will be a followup PR to enable the proxy queue when the shipper output is active.

Resolves elastic/elastic-agent-shipper#97

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
~~I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.~~

mergify · 2023-01-24T22:41:02Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2023-01-25T20:12:28Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

elasticmachine · 2023-01-25T20:23:38Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-02-23T21:29:07.535+0000
Duration: 67 min 40 sec

Test stats 🧪

Test	Results
Failed	0
Passed	25983
Skipped	1962
Total	27945

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

fearful-symmetry · 2023-02-01T19:40:44Z

libbeat/outputs/shipper/shipper.go

+
+// ackListener's only job is to listen to the persisted index RPC stream
+// and forward its values to the ack worker.
+func (s *shipper) ackListener(ctx context.Context) error {


Is there a reason to have a dedicated little listener-thread-thing that just forwards events from that RPC stream? Are we just trying to make the select statement in ackWorker cleaner?

I can go into more detail in the shipper sync but, yes, because we can't directly select on the result of this call, when we make it we are committing to a 30+ second window where we can't handle signals from publish calls, which would mean using a very large channel buffer to avoid spurious blocking (there's no particular limit on how many batches could go thru and they send fast) -- I'm unhappy with the extra goroutine conceptually but it is cheap and robust to bad scheduling and other config interactions. What I'd really like is to fix the shipper API so this is all unnecessary, so let's talk about that at the sync...

fearful-symmetry · 2023-02-01T19:48:48Z

libbeat/publisher/queue/proxy/batch.go

+// specific language governing permissions and limitations
+// under the License.
+
+package proxyqueue


I assume something is supposed to be in the README.md file? Regardless, we may want some kind of package-level comment here describing what each queue type does, since there's a lot of them now...

leehinman

Looking good, I'm OK with the overall direction/design. Couple of things I think we need before merging.

Readme filled in with design & purpose of queue
diagram of how broker works
diagram of data structure backing the queue
more tests, especially around partial acks of batches
fix for propagating ackloop errors in shipper client or at least an issue to fix

leehinman · 2023-02-01T20:49:23Z

libbeat/publisher/queue/proxy/batch.go

+	entries []queueEntry
+
+	// Original number of entries (persists even if entries are freed).
+	entryCount int


I think this needs a rename to make the fact that it is the origninal count more explicit.

leehinman · 2023-02-01T21:08:02Z

libbeat/outputs/shipper/shipper.go

+			// try to reconnect.
+			// (Note: this case would be much easier if the persisted index RPC
+			// were not a stream.)
+			s.log.Errorf("acknowledgment listener stopped: %s", err)


Maybe we should store the status of the ackLoop in the shipper struct. That way we can check it before a publish.

I handled this by instead calling s.Close() which will produce an error the next time Publish is called.

faec · 2023-02-02T18:32:20Z

@leehinman Makes sense for the most part but note that there's no mechanism for partial acks of batches (this is even mostly true in the other queues -- "partial acks" are implemented in the outputs and do not propagate back to the queue or producer until the full batch has been processed -- but especially so in this case since in the shipper output the publish call itself blocks on the batch it's given rather than retrying through the pipeline)

leehinman

code LGTM. Thank you for adding the readme and diagram, those are very helpful.

request for adding a few more tests around the done channel and a suggested fix to make the rendered svg more robust.

leehinman · 2023-02-22T19:35:23Z

libbeat/publisher/queue/proxy/diagrams/broker.d2

+
+queueReader {
+    explanation: |md
+        `queueReader` is a worker that reads raw batches (satisfying the<br>


Suggested change

`queueReader` is a worker that reads raw batches (satisfying the 

`queueReader` is a worker that reads raw batches (satisfying the

replace " " with 2 spaces at end of line. Annoying, but neither github or firefox will render the svg that is produced with " ". Need to do for all the " " tags.

\ also works btw (https://commonmark.org/help/tutorial/03-paragraphs.html#:~:text=For%20a%20line%20break%2C%20add,the%20end%20of%20the%20line.)

or   (they won't render it because   is not semantic xml)

leehinman · 2023-02-22T19:45:04Z

libbeat/publisher/queue/proxy/queue_test.go

+			return fmt.Errorf("timed out waiting for acknowledgments: have %d, wanted %d", l.ackedCount, targetCount)
+		}
+	}
+}


Can you add a few more tests around shutdown, so if queue is shutting down we don't publish etc.

I will think about what tests might be appropriate, but:

there's on some level a fundamental nondeterminism: it's always possible for the shutdown signal to come in simultaneously with a publish request, and what breaks the tie isn't which was sent first but which the queue loop receives first (which we can't tell from outside). We can be certain that no publish requests will go through after Close returns (or... at least now we can, since I just added the mistakenly-omitted Wait call now in response to this comment 😅) but we can't have any guarantees based on the close channel alone, or while Close is still in progress. But, I can add tests to make sure that write-after-close fails. However, sadly:

queue.Close is never called during Beats shutdown, so it's nice to know that it would work correctly if we used it, but as of right now none of the queues are ever shut down properly.

A test that shows write after close fails would satisfy me. I think that goes a long way in exercising that code path.

… shipper output (#34377)

faec added 4 commits January 19, 2023 17:07

proxy queue in progress

2521070

iterating on implementation, currently builds

406fabe

updating the shipper output

a6feed0

Add FreeEvents to publisher.Event

1c66088

faec self-assigned this Jan 24, 2023

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 24, 2023

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Jan 25, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 25, 2023

faec added 4 commits January 25, 2023 11:53

finish shipper output refactor

c05e3cd

a little more cleanup

efe68df

finish proxy queue implementation

1c73bcf

remove queue from active configurations

7ac49b0

faec marked this pull request as ready for review January 25, 2023 20:12

faec requested a review from a team as a code owner January 25, 2023 20:12

faec requested review from cmacknz and leehinman and removed request for a team January 25, 2023 20:12

faec changed the title ~~(Draft) Add a proxy queue to avoid double-queueing every event when using the shipper output~~ Add a proxy queue to avoid double-queueing every event when using the shipper output Jan 25, 2023

faec mentioned this pull request Jan 25, 2023

Beats should use the proxy queue when the shipper output is active #34396

Closed

lint

c2ce99a

faec requested a review from fearful-symmetry January 25, 2023 20:47

faec added 6 commits January 25, 2023 16:00

remove event ID checking in memqueue tests

98f83ab

remove unused queue factories for global queue registry

e210595

remove queue entry ids from the proxy queue

ef8cf5c

lint

d5012a9

lint

9a432ed

license headers

42cce28

faec added 4 commits January 26, 2023 09:13

remove unused config code

c8693dc

Rework internal api to avoid race conditions

f101f91

Merge branch 'main' of github.com:elastic/beats into proxy-queue

9c6c6b0

Some major fixes / simplifications

7397b67

fearful-symmetry reviewed Feb 1, 2023

View reviewed changes

leehinman reviewed Feb 1, 2023

View reviewed changes

faec added 10 commits February 8, 2023 15:56

broker diagram draft

09623b8

More documentation, more fixed race conditions

d7e0b2a

Add more tests, fix the bugs they found

c5b5bd4

propagate errors in the shipper output

84333a3

lint

72be5f9

lint

132e474

make update

e652cb6

Merge branch 'main' of github.com:elastic/beats into proxy-queue

2e90778

Merge branch 'main' of github.com:elastic/beats into proxy-queue

87e9470

Merge branch 'main' of github.com:elastic/beats into proxy-queue

86a524b

leehinman approved these changes Feb 22, 2023

View reviewed changes

faec added 6 commits February 22, 2023 16:59

synchronize shutdown on Close call

284f848

Merge branch 'main' of github.com:elastic/beats into proxy-queue

e9f328e

fix diagram EOLs

aacd2d0

add test for write-after-close

c836beb

Merge branch 'main' of github.com:elastic/beats into proxy-queue

202f6bb

fix error updating remaining acks

2e0714c

faec merged commit 4164cf6 into elastic:main Feb 23, 2023

faec deleted the proxy-queue branch February 23, 2023 22:40

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023

Add a proxy queue to avoid double-queueing every event when using the…

f1e8bbd

… shipper output (#34377)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a proxy queue to avoid double-queueing every event when using the shipper output #34377

Add a proxy queue to avoid double-queueing every event when using the shipper output #34377

faec commented Jan 24, 2023 •

edited

Loading

mergify bot commented Jan 24, 2023

elasticmachine commented Jan 25, 2023

elasticmachine commented Jan 25, 2023 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

fearful-symmetry Feb 1, 2023

faec Feb 2, 2023

fearful-symmetry Feb 1, 2023

leehinman left a comment

leehinman Feb 1, 2023

leehinman Feb 1, 2023

faec Feb 10, 2023

faec commented Feb 2, 2023

leehinman left a comment

leehinman Feb 22, 2023

alixander Mar 6, 2023 •

edited

Loading

leehinman Feb 22, 2023

faec Feb 22, 2023

leehinman Feb 23, 2023

	`queueReader` is a worker that reads raw batches (satisfying the<br>
	`queueReader` is a worker that reads raw batches (satisfying the

Add a proxy queue to avoid double-queueing every event when using the shipper output #34377

Add a proxy queue to avoid double-queueing every event when using the shipper output #34377

Conversation

faec commented Jan 24, 2023 • edited Loading

mergify bot commented Jan 24, 2023

elasticmachine commented Jan 25, 2023

elasticmachine commented Jan 25, 2023 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leehinman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

faec commented Feb 2, 2023

leehinman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alixander Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

faec commented Jan 24, 2023 •

edited

Loading

elasticmachine commented Jan 25, 2023 •

edited by jenkins-beats-ci bot

Loading

alixander Mar 6, 2023 •

edited

Loading