Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Automatic job allocation for HA agents (K8, Cloud) #75559

Open
mostlyjason opened this issue Aug 19, 2020 · 2 comments
Open

[Fleet] Automatic job allocation for HA agents (K8, Cloud) #75559

mostlyjason opened this issue Aug 19, 2020 · 2 comments
Labels
Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@mostlyjason
Copy link
Contributor

Describe the enhancement:
When running Elastic Agents in a highly available (HA) environment like Elastic Cloud or K8s we often need a way to allocate jobs to a subset of agents, particularly for use cases where the agent is polling data from a external source. One example is that we need to identify a single agent to collect cluster-level k8s stats elastic/beats#19731. Another is that we need to allocate an agent to run uptime monitors in a geographic region. We'd like to offer fault tolerance so that another agent can take over automatically without downtime if one is stopped or fails.

Here are some notes from Shay on Slack:

I see a few modes:

  1. collect locally: what we have today: works.
  2. poll/queue collection (sqs, kafka) : should work with existing structure, a policy defines the "group" of agents that will poll for data and bring it in. Queue semantics allow for N agents running concurrently.
  3. poll collection without queue semantics: so we need to do it. I think within a policy which represents a group of agents, only one will "do" the work, and if it falis, another will pick up. This gives us HA as well for this situation.
  4. poll collection with job allocation: This falls under uptime, a policy with N agents running, and they divide up the "work" between them, like the list of hosts to ping. This has HA built in as well since job re(allocation) need to be implemented as well.

I think if we chat these types and formalize them in fleet, then workloads fall naturally into either one of them, and we don't do one off. More over, now all cloud need to do is "run agents" (or a self managed customer, or k8s, or ECE), and hopefully worry less about "dedicated" agent tiers if it makes sense. This ties into how easy it is to know about groups of agents per policy. And we can have this segmentation, especially if it can be changed in "runtime".

@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@mostlyjason
Copy link
Contributor Author

Pinging @andrewvc for overlap with uptime use cases, @exekias and @masci for k8s cluster-level stats and @ruflin for tracking on ingest mgmt side

@ph ph transferred this issue from elastic/beats Aug 20, 2020
@ph ph changed the title [Elastic Agent] Automatic job allocation for HA agents [Ingest manager] Automatic job allocation for HA agents Aug 20, 2020
@ph ph added the Team:Fleet Team label for Observability Data Collection Fleet team label Aug 20, 2020
@ruflin ruflin removed their assignment Nov 3, 2020
@jen-huang jen-huang changed the title [Ingest manager] Automatic job allocation for HA agents [Fleet] Automatic job allocation for HA agents Apr 28, 2021
@jen-huang jen-huang changed the title [Fleet] Automatic job allocation for HA agents [Fleet] Automatic job allocation for HA agents (K8, Cloud) Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

4 participants