Autoscaling: use labels instead of draining nodes #5339

sanderegg · 2024-02-15T15:00:55Z

After extensive testing from @YuryHrytsuk about the fact that starting sim4life is a bit slow could be related to the fact that nodes are drained and undrained.

The current explanation is that when a node is drained, all the containers running there are removed, and all the networks are also removed.

An alternative would be to keep the node active all the time and use a docker label/docker constraint to prevent unwanted containers to start on these machines.

Autoscaling currently works like so:

it runs on a manager docker node
it starts machines via AWS EC2
these machines automatically connect in drain mode to the main docker swarm
once the new machine connects, it is labelled as needed and made active if there are unrunnable services at the moment
once these machines are not running anything they are drained again
machines that are drained for more than some X amount of time are terminated via AWS EC2

This change would affect the second part:

after a machine is correctly labeled it would be made active anyway but with a docker label such as is-drained=true
when a machine is needed that label would be removed
Advantages
the monitoring services would not be removed everytime the node is drained, therefore the docker engine would be less strained by that
this might improve the starting time of the dynamic-sidecar as it would not need to wait for the monitoring networks to be up before it can start.

Requirements:

changes in autoscaling, to use a label instead of really draining the node
changes in director-v2, as the services must not go to a node where is-drained=true is set

The text was updated successfully, but these errors were encountered:

YuryHrytsuk · 2024-02-15T15:13:47Z

Thank you @sanderegg 👍 I would further state that if we want to completely get rid of docker network sync timeouts (30 sec / 1 min / 1.5 min delays), one solution is to enforce all the necessary networks on the autoscaled node (even before the dy-sidecar starts). These networks are: monitored, interactive services, simcore agent, portainer agent (to the best of my knowledge)

sanderegg mentioned this issue Feb 15, 2024

sim4life.io - WP4: Computational backend ITISFoundation/osparc-issues#950

Open

sanderegg self-assigned this Feb 15, 2024

sanderegg transferred this issue from ITISFoundation/osparc-issues Feb 15, 2024

sanderegg added a:director-v2 issue related with the director-v2 service a:autoscaling autoscaling service in simcore's stack labels Feb 15, 2024

sanderegg added this to the Schoggilebe milestone Feb 15, 2024

sanderegg mentioned this issue Feb 15, 2024

✨Autoscaling: use label instead of draining machine #5340

Merged

1 task

YuryHrytsuk mentioned this issue Feb 16, 2024

Slow starting time compared with s4l-lite ITISFoundation/osparc-ops-environments#552

Closed

sanderegg closed this as completed in #5340 Feb 19, 2024

sanderegg mentioned this issue Feb 20, 2024

Slow starting time compared with lite ITISFoundation/osparc-issues#1262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaling: use labels instead of draining nodes #5339

Autoscaling: use labels instead of draining nodes #5339

sanderegg commented Feb 15, 2024 •

edited

Loading

YuryHrytsuk commented Feb 15, 2024

Autoscaling: use labels instead of draining nodes #5339

Autoscaling: use labels instead of draining nodes #5339

Comments

sanderegg commented Feb 15, 2024 • edited Loading

YuryHrytsuk commented Feb 15, 2024

sanderegg commented Feb 15, 2024 •

edited

Loading