Skip to content
This repository has been archived by the owner on Sep 21, 2023. It is now read-only.

Look into changing how the shipper recieves gRPC connection config #225

Closed
Tracked by #226 ...
fearful-symmetry opened this issue Jan 19, 2023 · 4 comments
Closed
Tracked by #226 ...
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team

Comments

@fearful-symmetry
Copy link
Contributor

Right now, the shipper receives it's initial config in two pieces:

  • Output config (ES, queue settings) is received as a single output unit
  • gRPC settings (primarily the endpoint to use for the unit socket, and also TLS) are duplicated across multiple input units.

This is a bit of an awkward process for two reasons:

  • The shipper's event pipeline operates mostly as a single component, i.e, there's no reason for us to start the gRPC endpoint before we even have the config for the queue or the output, or shut down the gRPC while we keep the output working. This means that before we start the shipper, we have to wait for the config to arrive in pieces, and synchronize events across an otherwise asynchronous event channel.
  • We use the input units in a rather non-normal way. The gRPC endpoint settings we get from the input units is global and static for the lifetime of the shipper, which is at odds with how the elastic-agent API client API works, where the config connected to the unit has a lifetime explicitly stated by the expected state of the unit; the shipper must take a "set and forget" approach to the gRPC config which seems to be at odds with how the V2 API otherwise works.

There's two ways to make this a little cleaner:

  • Get all the event pipeline config from the output unit. This makes the shipper init code less complex, and makes it more clear what unit is responsible for the state of the shipper's event pipeline. Input configs would still be used for things like processors.
  • If for some reason we really thought that the shipper's main event pipeline really needed to be spread across multiple units, the gRPC settings should have their own dedicated unit. This allows us to at least use the V2 API in a more orthodox way, as opposed to grabbing a random input unit and treat its config as a static value.

This will probably generate some discussion on if the shipper should receive pipeline config across multiple units. As far as I can tell, there's only one use case for this: restarting the output while the server and queue remains up. Right now the shipper server isn't designed to operate like this, and I'm not sure how we should prioritize such a feature. Such a feature doesn't necessarily require multiple units, as the shipper could just diff the subsections of the config to see if only the output has changed. Having the shipper pipeline config across two units also creates a number of state questions: should we shut down the queue and output while the gRPC connection remains up? Does a failure of the queue/output also imply that gRPC is in a failed state, as no events can be ACK'ed? From the context of error reporting, will a user find this distinction meaningful?

@cmacknz
Copy link
Member

cmacknz commented Jan 24, 2023

As far as I can tell, there's only one use case for this: restarting the output while the server and queue remains up.

I think you are missing a reason, and it is that status is reported per unit. The current set of units provided to the shipper is perhaps awkward from the perspective of configuring the shipper, but seems ideal for reporting status back to the agent and user.

We can report the state of the output (ES, LS, Kafka) separately which we always need, but we could also report the connection state of each individual input back to the shipper. This allows us to report only the unit whose connection has timed out as failed, rather than reporting the shipper gRPC server as a whole as failed.

For status reporting reasons I think we always want at least two units, one for the input side of the shipper and one for the output side. As you suggested we could combine each of the input units together, but I would only do this if we are convinced it is not valuable or possible to report the status of each expected input connection back to the agent. I at least am not convinced there is no value in this, because quickly scanning the state.yaml file in the diagnostics or the output of elastic-agent status looking for errors is far easier than searching every log line to find what may be wrong.

should we shut down the queue and output while the gRPC connection remains up?

The shippers' only job is to be a queue and publish events to an output, so the majority of the time failures in the output should be transient and we should let the queue fill up and report the output unit as FAILED or DEGRADED.

Does a failure of the queue/output also imply that gRPC is in a failed state, as no events can be ACK'ed?

In my opinion the gRPC state is only about connectivity. Is an input successfully authenticated, connected, and trying to publish events? If yes then the gRPC interface is healthy regardless of if the events are accepted into the queue, and the output unit can capture the rest.

From the context of error reporting, will a user find this distinction meaningful?

As developers I believe we will. The fine grained status reporting makes the state of the agent much easier to debug, as you can scan a single file and see what the current error condition is.

@cmacknz
Copy link
Member

cmacknz commented Jan 24, 2023

The team spoke today and our current conclusion is that we want to simply to a single input unit that represents the shipper's gRPC server. That would leave the shipper with one input unit (for the gRPC server) and one output unit (for ES/LS/Kafka).

@pierrehilbert
Copy link
Contributor

pierrehilbert commented May 4, 2023

@cmacknz / @leehinman could we close this one with the new shipper plan?

@cmacknz
Copy link
Member

cmacknz commented May 4, 2023

This can be closed. The new design works with the existing configuration strategy. elastic/beats#35135 (comment)

We can open a follow up if this needs to change later.

@cmacknz cmacknz closed this as completed May 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

3 participants