Clusters-keeper: if a machine is pending for too long, it should terminate and retry a few times before giving up #5436

sanderegg · 2024-03-07T16:48:13Z

In case an EC2 instance is "broken", i.e. not reachable it might never be able to run a docker stack.
In this case the clusters-keeper should give up on it after some X amount of time, terminate the instance and try again for Y times.
This use-case sometimes happen and the current behavior is that the machine is kept up forever.

Another variation is when the dask-scheduler on the primary machine somehow uses all the disk space available, then the docker swarm goes boom and the machine is lost.
In that case the clusters-keeper should also terminate the machines to spare money

sanderegg mentioned this issue Mar 7, 2024

sim4life.io - WP4: Computational backend ITISFoundation/osparc-issues#950

Open

sanderegg transferred this issue from ITISFoundation/osparc-issues Mar 7, 2024

sanderegg self-assigned this Mar 7, 2024

sanderegg added the a:clusters-keeper label Mar 7, 2024

sanderegg added the t:enhancement Improvement or request on an existing feature label Apr 2, 2024

sanderegg mentioned this issue Apr 2, 2024

Maintenance / Dev Issues ITISFoundation/osparc-issues#1328

Open

sanderegg added this to the Enchanted Odyssey milestone Apr 8, 2024

sanderegg modified the milestones: Enchanted Odyssey, The Next One May 6, 2024

sanderegg mentioned this issue May 24, 2024

✨Clusters-keeper: terminate broken EC2s🚨 #5851

Merged

5 tasks

sanderegg closed this as completed in #5851 May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clusters-keeper: if a machine is pending for too long, it should terminate and retry a few times before giving up #5436

Clusters-keeper: if a machine is pending for too long, it should terminate and retry a few times before giving up #5436

sanderegg commented Mar 7, 2024 •

edited

Loading

Clusters-keeper: if a machine is pending for too long, it should terminate and retry a few times before giving up #5436

Clusters-keeper: if a machine is pending for too long, it should terminate and retry a few times before giving up #5436

Comments

sanderegg commented Mar 7, 2024 • edited Loading

sanderegg commented Mar 7, 2024 •

edited

Loading