Skip to content
This repository has been archived by the owner on Sep 21, 2023. It is now read-only.

Implement more efficient output tuning parameters to manage throughput #28

Closed
nimarezainia opened this issue Apr 20, 2022 · 17 comments · Fixed by #227
Closed

Implement more efficient output tuning parameters to manage throughput #28

nimarezainia opened this issue Apr 20, 2022 · 17 comments · Fixed by #227
Assignees
Labels

Comments

@nimarezainia
Copy link

nimarezainia commented Apr 20, 2022

Beats have many knobs and whistles that allow the user to modify output related parameters in order to increase throughput. These parameters are extremely convoluted and sometimes contradict one another. With the new shipper design we have the opportunity to simplify and create more meaningful parameters for users to use.

Performance Tuning Proposal

  1. Change bulk_max_size to maximum_batch_size to be more meaningful. maximum_batch_size is the total batch size in bytes
  2. Allow the user to modify the maximum_batch_size in the UI. Specify maximum_batch_size to be in bytes rather than events.
    a. Bytes are easier to mentally consume
    b. It’s also easier to map to data seen on the wire
    c. On the Elasticsearch ingest, the max document size is configured in bytes
  3. Introduce a NEW variable output_queue_flush_timeout
    a. Upon expiry the output queue is flushed and data written to the output
    b. Users can lower this timeout to reduce the delay in collecting data

In summary for tuning the output we now will have 2 variables: maximum_batch_size and output_queue_flush_timeout

@jlind23 jlind23 added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Apr 27, 2022
@jlind23 jlind23 added estimation:Week Task that represents a week of work. v8.4.0 and removed 8.4-candidate labels May 24, 2022
@jlind23 jlind23 changed the title Implementing more efficient output queue parameters to manage throughput [DESIGN] Implementing more efficient output queue parameters to manage throughput Jun 1, 2022
@jlind23
Copy link
Contributor

jlind23 commented Jun 1, 2022

@nimarezainia did you have a chance to work on the requirements for this?

@nimarezainia
Copy link
Author

@nimarezainia did you have a chance to work on the requirements for this?

i'm still working on defining these

@joshdover
Copy link

Interested to see the list of what we want to support, but we also need to consider that we need to avoid any breaking changes here. Today, we allow the user to use any Elasticsearch output setting from the UI, kibana.yml configuration, and API (though this isn't GA yet).

Migrating this would be quite painful, mostly because we need to send a valid configuration to any agents that are not running the shipper (we support any agent >= 7.17.0 to work with any 8.x version of the Stack). Otherwise, Kibana does have the ability to run migrations during upgrades, which would allow us to do transformations to the user's YAML. This would also need to be done on the API and kibana.yml configuration code.

@jlind23 jlind23 assigned faec and unassigned nimarezainia Jul 13, 2022
@joshdover
Copy link

Thinking about breaking changes some more, I'm curious if we need to consider the following when switching from beats outputs to the shipper:

  1. User configures worker: 2 in the Elasticsearch output
  2. User runs a single logs integration and a single metrics integration
  3. With beats-specific outputs, Filebeat and Metricbeat would each create 2 workers, totaling 4 workers
  4. With shipper output, the shipper creates 2 workers, totaling 2 workers

This is just one example, there could be other related configs like queue that don't translate exactly 1-1 and could result in degraded performance or higher resource usage when switching to the shipper. I don't anticipate this impacting customers who haven't touched these settings, but for those who have carefully tuned them, this will likely cause problems.

@cmacknz
Copy link
Member

cmacknz commented Jul 14, 2022

@joshdover yes that is definitely a possible problem when enabling the shipper, users may need to retune their worker and max_bulk_size configurations if they were using them before.

Even if we tried to apply the same configuration as before it may not behave equivalently as the data flowing through each worker will have changed from before. Filebeat workers would likely only write to log-* datastreams and the shipper will write to every data stream defined by an active integration for example.

There is no way to configure the underlying beat queue from an agent policy right now so that at least isn't a concern.

@jlind23
Copy link
Contributor

jlind23 commented Jul 18, 2022

@nimarezainia Do we have a requirements doc for that? Otherwise it is going to be hard to design.

@nimarezainia
Copy link
Author

@jlind23 i'll share the requirements doc shortly.

@cmacknz cmacknz changed the title [DESIGN] Implementing more efficient output queue parameters to manage throughput Implement more efficient output tunring parameters to manage throughput Oct 13, 2022
@cmacknz
Copy link
Member

cmacknz commented Oct 13, 2022

I've updated the description here to reflect the proposed changes to the output configuration, which I believe are the most impactful.

We will likely want follow up issues about:

  1. Load balancing configuration.
  2. Queue configuration in the agent policy and UI. The memory queue parameters can already be specified in the shipper configuration file, the disk queue configuration will be available after add disk queue configuration to shipper configuration #119 is implemented.

@cmacknz cmacknz changed the title Implement more efficient output tunring parameters to manage throughput Implement more efficient output tuning parameters to manage throughput Oct 13, 2022
@cmacknz
Copy link
Member

cmacknz commented Oct 18, 2022

We will also need to consider how to handle existing agent policies that specify the existing worker and bulk_max_size parameters as advanced YAML configuration. We will likely need to handle both the old and new set of parameters. Fleet could migrate the policy for us, but that won't help standalone agents.

@cmacknz cmacknz removed the estimation:Week Task that represents a week of work. label Oct 20, 2022
@cmacknz
Copy link
Member

cmacknz commented Oct 20, 2022

Given this will affect the agent policy and the Fleet UI, we should probably convert this (or create) a cross team feature issue for this work. We will likely want to break each of the changes in the proposal into individual issues so they can be investigated and implemented incrementally.

@cmacknz
Copy link
Member

cmacknz commented Oct 31, 2022

If we were to switch to using the go-elasticsearch client's BulkIndexer we would get this change essentially for free. BulkIndexer allows specifying a flush threshold in bytes and a minimum flush duration. https://pkg.go.dev/github.com/elastic/go-elasticsearch/v8/esutil#BulkIndexerConfig

@jlind23
Copy link
Contributor

jlind23 commented Nov 3, 2022

@cmacknz shouldn't we for good swithc to the go-elasticsearch client then?

@cmacknz
Copy link
Member

cmacknz commented Nov 3, 2022

Yes I have prioritized the switch with #14 as the next task for the shipper.

@amitkanfer
Copy link

@alexsapran - put this one your radar

@jlind23
Copy link
Contributor

jlind23 commented Nov 21, 2022

@cmacknz shouldn't I close this issue as @faec is currently working on the migration to the go elasticsearch client?

@cmacknz
Copy link
Member

cmacknz commented Nov 21, 2022

I would close this once we have proven the go-elasticsearch client behaves the way we want, and that there will be no additional changes required.

I'll also have to confirm that we have the Fleet UI changes tracked separately since they are mentioned here.

@jlind23
Copy link
Contributor

jlind23 commented Jan 4, 2023

@cmacknz @leehinman shall we keep this one in next sprint or we had enough time to double checked that go-elasticsearch behaviour was as expected?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants