Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clusters-keeper: if a machine is pending for too long, it should terminate and retry a few times before giving up #5436

Closed
Tracked by #950
sanderegg opened this issue Mar 7, 2024 · 0 comments · Fixed by #5851
Assignees
Labels
a:clusters-keeper t:enhancement Improvement or request on an existing feature

Comments

@sanderegg
Copy link
Member

sanderegg commented Mar 7, 2024

In case an EC2 instance is "broken", i.e. not reachable it might never be able to run a docker stack.
In this case the clusters-keeper should give up on it after some X amount of time, terminate the instance and try again for Y times.
This use-case sometimes happen and the current behavior is that the machine is kept up forever.

Another variation is when the dask-scheduler on the primary machine somehow uses all the disk space available, then the docker swarm goes boom and the machine is lost.
In that case the clusters-keeper should also terminate the machines to spare money

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:clusters-keeper t:enhancement Improvement or request on an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant