Skip to content
This repository has been archived by the owner on Sep 21, 2023. It is now read-only.

[Meta][Feature] Enable filebeat and metricbeat to publish data to the shipper #8

Closed
4 tasks done
Tracked by #15
cmacknz opened this issue Mar 16, 2022 · 11 comments
Closed
4 tasks done
Tracked by #15
Assignees
Labels
estimation:Week Task that represents a week of work. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.4.0

Comments

@cmacknz
Copy link
Member

cmacknz commented Mar 16, 2022

This is a feature meta issue to allow filebeat and metricbeat to publish data to the shipper when run under Elastic agent. All other beats are out of scope.

An output for existing beats should be implemented that publishes to the shipper gRPC interface. When the shipper gRPC output is used, the beat output pipeline should be configured to be as simple as possible. Using a per beat disk queue with the shipper is forbidden. A memory queue may be used with the shipper output, but how it should be configured by users will require careful consideration. Ideally any necessary queue configuration can be made automatic.

Removing processors from beats is out of scope for this issue. Processors will be removed in a later issue.

image

This feature is considered complete when at least the following criteria are satisfied for both filebeat and metricbeat:

  • A test exists proving data ingested by the beat is published to the shipper.
  • A test exists proving there is no data loss when the shipper process restarts while the beat is publishing.
  • A test exists proving there is no data loss when the shipper backpressures the beat (because the shipper queue is full for example).

The assignee of this issue is expected to create the development plan with all child issues for this feature. The following set of tasks should be included in the initial issues at a minimum:

  • Creating a beats output that that publishes to the shipper gRPC interface.
  • Defining a standard configuration for using a beat with the shipper that the control plane can easily apply: processors disabled, queues disabled, etc.
  • Creating an integration test suite for the beat and shipper interactions.

UPD by @rdner

I split this in the following steps:

@cmacknz cmacknz changed the title [META][Feature] Enable filebeat and metricbeat to publish data to the shipper [Meta][Feature] Enable filebeat and metricbeat to publish data to the shipper Mar 18, 2022
@jlind23 jlind23 added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.3.0 labels Mar 21, 2022
@rdner
Copy link
Member

rdner commented Mar 30, 2022

Creating a beats output that that publishes to the shipper gRPC interface.

@cmacknz I'm a bit confused about this sentence.

When we talked 1 on 1, we agreed that during the very first iteration the gRPC server will be one of the output options along with Elasticsearch, File output, Kafka, etc.

Later at the team call I asked the same question to widen the discussion circle but then you answered something different about having a feature flag and switching some logic in the code.

I think we have some miscommunication about this.

I see 2 options how to approach this task:

Option 1

We have it as a new experimental output type which we could configure like this:

output:
  shipper:
    server: "localhost:50051" # The server address in the format of host:port
    tls: true # Connection uses TLS if true, else plain TC
    ca_file: "/home/cert" # The file containing the CA root cert file
    server_host_override: "x.test.example.com" # The server name used to verify the hostname returned by the TLS handshake

This can be achieved with the following steps:

  1. We create a new package shipper in here https://github.com/elastic/beats/tree/main/libbeat/outputs (or perhaps in elastic-agent-libs)
  2. We implement the Client interface
  3. We implement the new shipper output type factory.
  4. We use the existing pipeline without any changes.

In this case, changes of the existing code are none or minimal and we can start working with the new setup, debug and perform tests. The new output type can be excluded from the documentation if needed. Later we can just replace the whole pipeline implementation when we feel the shipper is ready.

Option 2

We have a feature flag to switch the pipeline to a separate implementation that starts sending events to the shipper instead of configured outputs.

This will require us:

  1. Refactor the current pipeline implementation so it's an interface that can have 2 different implementations instead of a struct
  2. Create a new configuration section support at the root level where we can configure a shipper, e.g.:
shipper:
  server: "localhost:50051" # The server address in the format of host:port
  tls: true # Connection uses TLS if true, else plain TC
  ca_file: "/home/cert" # The file containing the CA root cert file
  server_host_override: "x.test.example.com" # The server name used to verify the hostname returned by the TLS handshake
  1. If the configuration section exists, the pipeline implementation is switched to the ShipperPipeline and the beat's output configuration is ignored

The major drawback here is that we would need more time and to make a lot of changes to the existing code instead of just adding new that can affect stability. On the other hand, we would need to do that at some point too.

@cmacknz
Copy link
Member Author

cmacknz commented Mar 30, 2022

I recommend option 1 as it will be simpler to implement and maintain in the long term. It follows the model currently used by Elastic agent to configure outputs for beats.

@ph
Copy link

ph commented Mar 30, 2022

I prefer also option 1, so we don't have a special case or transformation to do.

@faec
Copy link
Contributor

faec commented Mar 30, 2022

I'm not sure how option 1 fits with the other pending pieces. I think perhaps there's been some confusion with the "output" language that is being used for two different stages of processing: (1) sending data from the input to the processor / shipper before it enters the queue, and (2) sending final event data from the shipper to the upstream target (elasticsearch, logstash etc) after it exits the queue.

So I'm not sure how option 1 would fit right now -- the Client interface is the final link of the Beats pipeline that hands off to the upstream, so if we connect this output there, then events would go through the whole current pipeline (including processors and the memory queue) before being sent to the shipper, which is also supposed to handle the memory queue. So to me, option 2 makes more sense, since it diverts to the shipper before hitting the queue.

I wonder if the confusion about approaches comes from the use of "output" to refer to both of those components? Because option 1 sounds to me like a reasonable sketch of the output of the shipper, but as I understand it in the first pass we're just handling that with a placeholder raw-file output.

@cmacknz
Copy link
Member Author

cmacknz commented Mar 30, 2022

Yes, the language isn't precise enough, neither does the fact that the beat pipeline and the shipper will have overlapping functionality.

My view is that the development needs to be an iterative process where we start with some duplication between the beat and shipper just to get them connected to each other, and then slowly migrate functionality from the beat side into the shipper when run under agent.

I think initially we start with option 1, where we just make it possible for a beat to communicate with the shipper over gRPC. Both the beat and the shipper at this stage have a memory queue, and the processors only exist on the beat side. This is what the diagram in the issue description is trying to show :)

Once we have that, we next work on trying to remove the queuing from the beat side, followed by processing. At this point we may need to consider something like option 2 to try to strip down what the beat/input needs to run.

I like starting with Denis' option 1 to get a faster end to end prototype. Once we have that and can test the interaction between the beats and shipper we will likely need to consider something like option 2. I think we'll be better positioned to make design adjustments after we have a quick prototype than pursuing larger changes from the beginning. I could be convinced otherwise though.

@faec
Copy link
Contributor

faec commented Mar 30, 2022

Ah ok, so the redundancy in the memory queue is an intentional temporary workaround? In that case fair enough, let's continue :-)

@kvch
Copy link

kvch commented Apr 6, 2022

Does adding a feature flag make sense in beats? It is just basically a setting that enables or disables features. How is that different from setting output.elasticsearch instead of output.shipper (by Agent) if we want to fallback to the old way of sending events?

@rdner
Copy link
Member

rdner commented Apr 13, 2022

I've updated the description and added a checklist for tracking the progress.

One thing which is not 100% clear to me is input and data stream options. I could not find a simple way to propagate these parameters through the event batches so I'm going to address this as a separate issue after the initial implementation is there, so it's not blocking any experiments with the new shipper architecture.

The same goes about the integration tests, they will be implemented separately.

@cmacknz
Copy link
Member Author

cmacknz commented Apr 13, 2022

Thanks! I have a separate issue already for returning acknowledgements from the shipper: #9. I expected that would be too much work to fold into this issue.

The input and data stream will have to be propagated from the agent policy, which we may not do yet. We may not need the data stream until we implement processors in the shipper, at which point we'll need a way to apply the correct processors to events based on the input and data stream.

@cmacknz
Copy link
Member Author

cmacknz commented May 11, 2022

Added #34 as part of this work.

@jlind23 jlind23 added estimation:Week Task that represents a week of work. and removed v8.3.0 8.4-candidate labels May 24, 2022
@rdner rdner assigned leehinman and unassigned rdner Sep 6, 2022
@cmacknz
Copy link
Member Author

cmacknz commented Sep 14, 2022

All tasks complete, closing.

@cmacknz cmacknz closed this as completed Sep 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
estimation:Week Task that represents a week of work. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.4.0
Projects
None yet
Development

No branches or pull requests

7 participants