Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaling: use labels instead of draining nodes #5339

Closed
Tracked by #1262 ...
sanderegg opened this issue Feb 15, 2024 · 1 comment · Fixed by #5340
Closed
Tracked by #1262 ...

Autoscaling: use labels instead of draining nodes #5339

sanderegg opened this issue Feb 15, 2024 · 1 comment · Fixed by #5340
Assignees
Labels
a:autoscaling autoscaling service in simcore's stack a:director-v2 issue related with the director-v2 service
Milestone

Comments

@sanderegg
Copy link
Member

sanderegg commented Feb 15, 2024

After extensive testing from @YuryHrytsuk about the fact that starting sim4life is a bit slow could be related to the fact that nodes are drained and undrained.

The current explanation is that when a node is drained, all the containers running there are removed, and all the networks are also removed.

An alternative would be to keep the node active all the time and use a docker label/docker constraint to prevent unwanted containers to start on these machines.

Autoscaling currently works like so:

  • it runs on a manager docker node
  • it starts machines via AWS EC2
  • these machines automatically connect in drain mode to the main docker swarm
  • once the new machine connects, it is labelled as needed and made active if there are unrunnable services at the moment
  • once these machines are not running anything they are drained again
  • machines that are drained for more than some X amount of time are terminated via AWS EC2

This change would affect the second part:

  • after a machine is correctly labeled it would be made active anyway but with a docker label such as is-drained=true
  • when a machine is needed that label would be removed
    Advantages
  • the monitoring services would not be removed everytime the node is drained, therefore the docker engine would be less strained by that
  • this might improve the starting time of the dynamic-sidecar as it would not need to wait for the monitoring networks to be up before it can start.

Requirements:

  • changes in autoscaling, to use a label instead of really draining the node
  • changes in director-v2, as the services must not go to a node where is-drained=true is set
@sanderegg sanderegg self-assigned this Feb 15, 2024
@sanderegg sanderegg transferred this issue from ITISFoundation/osparc-issues Feb 15, 2024
@sanderegg sanderegg added a:director-v2 issue related with the director-v2 service a:autoscaling autoscaling service in simcore's stack labels Feb 15, 2024
@sanderegg sanderegg added this to the Schoggilebe milestone Feb 15, 2024
@YuryHrytsuk
Copy link
Contributor

Thank you @sanderegg 👍 I would further state that if we want to completely get rid of docker network sync timeouts (30 sec / 1 min / 1.5 min delays), one solution is to enforce all the necessary networks on the autoscaled node (even before the dy-sidecar starts). These networks are: monitored, interactive services, simcore agent, portainer agent (to the best of my knowledge)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:autoscaling autoscaling service in simcore's stack a:director-v2 issue related with the director-v2 service
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants