Look into changing how the shipper recieves gRPC connection config #225

fearful-symmetry · 2023-01-19T23:54:04Z

Right now, the shipper receives it's initial config in two pieces:

Output config (ES, queue settings) is received as a single output unit
gRPC settings (primarily the endpoint to use for the unit socket, and also TLS) are duplicated across multiple input units.

This is a bit of an awkward process for two reasons:

The shipper's event pipeline operates mostly as a single component, i.e, there's no reason for us to start the gRPC endpoint before we even have the config for the queue or the output, or shut down the gRPC while we keep the output working. This means that before we start the shipper, we have to wait for the config to arrive in pieces, and synchronize events across an otherwise asynchronous event channel.
We use the input units in a rather non-normal way. The gRPC endpoint settings we get from the input units is global and static for the lifetime of the shipper, which is at odds with how the elastic-agent API client API works, where the config connected to the unit has a lifetime explicitly stated by the expected state of the unit; the shipper must take a "set and forget" approach to the gRPC config which seems to be at odds with how the V2 API otherwise works.

There's two ways to make this a little cleaner:

Get all the event pipeline config from the output unit. This makes the shipper init code less complex, and makes it more clear what unit is responsible for the state of the shipper's event pipeline. Input configs would still be used for things like processors.
If for some reason we really thought that the shipper's main event pipeline really needed to be spread across multiple units, the gRPC settings should have their own dedicated unit. This allows us to at least use the V2 API in a more orthodox way, as opposed to grabbing a random input unit and treat its config as a static value.

This will probably generate some discussion on if the shipper should receive pipeline config across multiple units. As far as I can tell, there's only one use case for this: restarting the output while the server and queue remains up. Right now the shipper server isn't designed to operate like this, and I'm not sure how we should prioritize such a feature. Such a feature doesn't necessarily require multiple units, as the shipper could just diff the subsections of the config to see if only the output has changed. Having the shipper pipeline config across two units also creates a number of state questions: should we shut down the queue and output while the gRPC connection remains up? Does a failure of the queue/output also imply that gRPC is in a failed state, as no events can be ACK'ed? From the context of error reporting, will a user find this distinction meaningful?

cmacknz · 2023-01-24T02:41:06Z

As far as I can tell, there's only one use case for this: restarting the output while the server and queue remains up.

I think you are missing a reason, and it is that status is reported per unit. The current set of units provided to the shipper is perhaps awkward from the perspective of configuring the shipper, but seems ideal for reporting status back to the agent and user.

We can report the state of the output (ES, LS, Kafka) separately which we always need, but we could also report the connection state of each individual input back to the shipper. This allows us to report only the unit whose connection has timed out as failed, rather than reporting the shipper gRPC server as a whole as failed.

For status reporting reasons I think we always want at least two units, one for the input side of the shipper and one for the output side. As you suggested we could combine each of the input units together, but I would only do this if we are convinced it is not valuable or possible to report the status of each expected input connection back to the agent. I at least am not convinced there is no value in this, because quickly scanning the state.yaml file in the diagnostics or the output of elastic-agent status looking for errors is far easier than searching every log line to find what may be wrong.

should we shut down the queue and output while the gRPC connection remains up?

The shippers' only job is to be a queue and publish events to an output, so the majority of the time failures in the output should be transient and we should let the queue fill up and report the output unit as FAILED or DEGRADED.

Does a failure of the queue/output also imply that gRPC is in a failed state, as no events can be ACK'ed?

In my opinion the gRPC state is only about connectivity. Is an input successfully authenticated, connected, and trying to publish events? If yes then the gRPC interface is healthy regardless of if the events are accepted into the queue, and the output unit can capture the rest.

From the context of error reporting, will a user find this distinction meaningful?

As developers I believe we will. The fine grained status reporting makes the state of the agent much easier to debug, as you can scan a single file and see what the current error condition is.

cmacknz · 2023-01-24T18:03:13Z

The team spoke today and our current conclusion is that we want to simply to a single input unit that represents the shipper's gRPC server. That would leave the shipper with one input unit (for the gRPC server) and one output unit (for ES/LS/Kafka).

pierrehilbert · 2023-05-04T09:30:29Z

@cmacknz / @leehinman could we close this one with the new shipper plan?

cmacknz · 2023-05-04T18:57:13Z

This can be closed. The new design works with the existing configuration strategy. elastic/beats#35135 (comment)

We can open a follow up if this needs to change later.

fearful-symmetry added enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Jan 19, 2023

fearful-symmetry mentioned this issue Jan 20, 2023

Meta Issue: Shipper configuration and I/O #226

Open

4 tasks

cmacknz mentioned this issue Jan 20, 2023

[Meta] Elastic Agent Shipper Project #16

Open

100 tasks

cmacknz closed this as completed May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look into changing how the shipper recieves gRPC connection config #225

Look into changing how the shipper recieves gRPC connection config #225

fearful-symmetry commented Jan 19, 2023

cmacknz commented Jan 24, 2023 •

edited

Loading

cmacknz commented Jan 24, 2023

pierrehilbert commented May 4, 2023 •

edited

Loading

cmacknz commented May 4, 2023

Look into changing how the shipper recieves gRPC connection config #225

Look into changing how the shipper recieves gRPC connection config #225

Comments

fearful-symmetry commented Jan 19, 2023

cmacknz commented Jan 24, 2023 • edited Loading

cmacknz commented Jan 24, 2023

pierrehilbert commented May 4, 2023 • edited Loading

cmacknz commented May 4, 2023

cmacknz commented Jan 24, 2023 •

edited

Loading

pierrehilbert commented May 4, 2023 •

edited

Loading