[Fleet] Automatic job allocation for HA agents (K8, Cloud) #75559

mostlyjason · 2020-08-19T14:52:07Z

Describe the enhancement:
When running Elastic Agents in a highly available (HA) environment like Elastic Cloud or K8s we often need a way to allocate jobs to a subset of agents, particularly for use cases where the agent is polling data from a external source. One example is that we need to identify a single agent to collect cluster-level k8s stats elastic/beats#19731. Another is that we need to allocate an agent to run uptime monitors in a geographic region. We'd like to offer fault tolerance so that another agent can take over automatically without downtime if one is stopped or fails.

Here are some notes from Shay on Slack:

I see a few modes:

collect locally: what we have today: works.
poll/queue collection (sqs, kafka) : should work with existing structure, a policy defines the "group" of agents that will poll for data and bring it in. Queue semantics allow for N agents running concurrently.
poll collection without queue semantics: so we need to do it. I think within a policy which represents a group of agents, only one will "do" the work, and if it falis, another will pick up. This gives us HA as well for this situation.
poll collection with job allocation: This falls under uptime, a policy with N agents running, and they divide up the "work" between them, like the list of hosts to ping. This has HA built in as well since job re(allocation) need to be implemented as well.

I think if we chat these types and formalize them in fleet, then workloads fall naturally into either one of them, and we don't do one off. More over, now all cloud need to do is "run agents" (or a self managed customer, or k8s, or ECE), and hopefully worry less about "dedicated" agent tiers if it makes sense. This ties into how easy it is to know about groups of agents per policy. And we can have this segmentation, especially if it can be changed in "runtime".

elasticmachine · 2020-08-19T14:52:08Z

Pinging @elastic/ingest-management (Team:Ingest Management)

mostlyjason · 2020-08-19T14:53:25Z

Pinging @andrewvc for overlap with uptime use cases, @exekias and @masci for k8s cluster-level stats and @ruflin for tracking on ingest mgmt side

ph transferred this issue from elastic/beats Aug 20, 2020

ph changed the title ~~[Elastic Agent] Automatic job allocation for HA agents~~ [Ingest manager] Automatic job allocation for HA agents Aug 20, 2020

ph added the Team:Fleet Team label for Observability Data Collection Fleet team label Aug 20, 2020

ph assigned ruflin Oct 19, 2020

ruflin removed their assignment Nov 3, 2020

jen-huang changed the title ~~[Ingest manager] Automatic job allocation for HA agents~~ [Fleet] Automatic job allocation for HA agents Apr 28, 2021

jen-huang changed the title ~~[Fleet] Automatic job allocation for HA agents~~ [Fleet] Automatic job allocation for HA agents (K8, Cloud) Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Automatic job allocation for HA agents (K8, Cloud) #75559

[Fleet] Automatic job allocation for HA agents (K8, Cloud) #75559

mostlyjason commented Aug 19, 2020

elasticmachine commented Aug 19, 2020

mostlyjason commented Aug 19, 2020

[Fleet] Automatic job allocation for HA agents (K8, Cloud) #75559

[Fleet] Automatic job allocation for HA agents (K8, Cloud) #75559

Comments

mostlyjason commented Aug 19, 2020

elasticmachine commented Aug 19, 2020

mostlyjason commented Aug 19, 2020