You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the enhancement:
When running Elastic Agents in a highly available (HA) environment like Elastic Cloud or K8s we often need a way to allocate jobs to a subset of agents, particularly for use cases where the agent is polling data from a external source. One example is that we need to identify a single agent to collect cluster-level k8s stats elastic/beats#19731. Another is that we need to allocate an agent to run uptime monitors in a geographic region. We'd like to offer fault tolerance so that another agent can take over automatically without downtime if one is stopped or fails.
Here are some notes from Shay on Slack:
I see a few modes:
collect locally: what we have today: works.
poll/queue collection (sqs, kafka) : should work with existing structure, a policy defines the "group" of agents that will poll for data and bring it in. Queue semantics allow for N agents running concurrently.
poll collection without queue semantics: so we need to do it. I think within a policy which represents a group of agents, only one will "do" the work, and if it falis, another will pick up. This gives us HA as well for this situation.
poll collection with job allocation: This falls under uptime, a policy with N agents running, and they divide up the "work" between them, like the list of hosts to ping. This has HA built in as well since job re(allocation) need to be implemented as well.
I think if we chat these types and formalize them in fleet, then workloads fall naturally into either one of them, and we don't do one off. More over, now all cloud need to do is "run agents" (or a self managed customer, or k8s, or ECE), and hopefully worry less about "dedicated" agent tiers if it makes sense. This ties into how easy it is to know about groups of agents per policy. And we can have this segmentation, especially if it can be changed in "runtime".
The text was updated successfully, but these errors were encountered:
Describe the enhancement:
When running Elastic Agents in a highly available (HA) environment like Elastic Cloud or K8s we often need a way to allocate jobs to a subset of agents, particularly for use cases where the agent is polling data from a external source. One example is that we need to identify a single agent to collect cluster-level k8s stats elastic/beats#19731. Another is that we need to allocate an agent to run uptime monitors in a geographic region. We'd like to offer fault tolerance so that another agent can take over automatically without downtime if one is stopped or fails.
Here are some notes from Shay on Slack:
I see a few modes:
I think if we chat these types and formalize them in fleet, then workloads fall naturally into either one of them, and we don't do one off. More over, now all cloud need to do is "run agents" (or a self managed customer, or k8s, or ECE), and hopefully worry less about "dedicated" agent tiers if it makes sense. This ties into how easy it is to know about groups of agents per policy. And we can have this segmentation, especially if it can be changed in "runtime".
The text was updated successfully, but these errors were encountered: