🐛🎨Computational autoscaling: allow multi-machining/processing #5203

sanderegg · 2023-12-21T17:23:08Z

What do these changes do?

This PR stems from the following issue:
1.

send multiple jobs to a cluster
N workers are created
they connect randomly to the dask-scheduler, then the first connected worker take all the jobs
the workers connecting after the first one do not see any job (or it takes some time for them to steal these jobs)
during that time the autoscaling service removes these workers (thus they were created for naught and do not participate in the work)
the lone (or sometime more than 1) workers to the job serquentially, therefore suboptimal

in case there are more jobs of the same EC2 type coming within 5 minutes
the worker will not be upscaled, as the current workers takes all the jobs, and the dask_scheduler does not report these jobs as being queued but always processing (the dask-scheduler does not know whether a job is actively being worked on or just queued, for the dask-scheduler these jobs are all in processing state)

Sadly many changes were necessary, but it should be for the greater good. Here is a summary.

Main changes:

refactoring/redesign of the principle to discover what needs to be scaled up/down to become (hopefully) simpler:
Principle

the cluster is analyzed (what nodes are available and their states)
find tasks that are unrunnable
if there are unrunnable tasks
- activate drained nodes if possible
- scale up if necessary by virtually assigning tasks to the machines
if there are no unrunnable task
- drain empty active nodes
- scale down drained nodes until we reach the reserve drained nodes (can be 0)
  Changes
the way the tasks are virtually assigned to the different types of nodes (active, drained, pending, ...) is now done similarly to the analysis of the infrastructure, making it simpler.

the way an active node (e.g. machine that can accept jobs) is defined is now not only depending on its docker availability mode (drain vs active) but also whether the dask-worker is running and connected to the dask-scheduler
the autoscaling service now computes an estimated job queue based on the number of jobs - #workers, this way the autoscaling service does not remove workers before the queue is back to 0 and therefore the number of workers to operate on the jobs is defined by ENV (as it should have been the case originally) - this is the actual fix to the original issue
the autoscaling service now actively asks the dask-scheduler to retire workers. The dask-scheduler kind of know which worker might be unnecessary and takes care of moving anything in their respective memory to another worker. Once the memory is freed the autoscaling service can drain the workers, and then remove them as usual
improved logging information altogether
added debugpy facilities in autoscaling for debugging in vscode
added jupyter notebook to facilitate manual testing directly on EC2 via vscode remote access (see tests/manual folder)

minor changes:

some refactoring of aws-library, in particular usage of the Resources class is now generalised
fixes some weird behavior in dask-scheduler where consumed resources go to negative values sometimes
fixes some weird behavior in dask-scheduler where idle dask-worker are seemingly still consuming an insane amount of resources after they were retired

Here some maybe more clear video, this simulates the jobs incoming into an external cluster. go to minute 3 to start seeing exciting stuff. Basically 500 jobs were sent to the cluster which can create up to 40 workers (visible on the upper left side, the jobs are shown on the upper right):

ext_cluster.mp4

Related issue/s

How to test

Dev Checklist

No ENV changes or I properly updated ENV (read the instruction)

DevOps Checklist

codecov · 2023-12-21T17:41:49Z

Codecov Report

Attention: 32 lines in your changes are missing coverage. Please review.

Comparison is base (0cdb35b) 87.2% compared to head (5eb5c36) 87.3%.

Additional details and impacted files

@@          Coverage Diff           @@
##           master   #5203   +/-   ##
======================================
  Coverage    87.2%   87.3%           
======================================
  Files        1295    1295           
  Lines       53066   53095   +29     
  Branches     1164    1164           
======================================
+ Hits        46307   46379   +72     
+ Misses       6509    6466   -43     
  Partials      250     250

Flag	Coverage Δ
integrationtests	`65.0% <ø> (+1.3%)`	⬆️
unittests	`85.2% <89.0%> (-0.1%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
packages/aws-library/src/aws_library/ec2/client.py	`100.0% <ø> (ø)`
packages/aws-library/src/aws_library/ec2/models.py	`88.4% <100.0%> (+0.4%)`	⬆️
...brary/api_schemas_clusters_keeper/ec2_instances.py	`100.0% <100.0%> (ø)`
...g/src/simcore_service_autoscaling/core/settings.py	`97.8% <100.0%> (+<0.1%)`	⬆️
...oscaling/src/simcore_service_autoscaling/models.py	`100.0% <100.0%> (ø)`
...e_autoscaling/modules/auto_scaling_mode_dynamic.py	`100.0% <100.0%> (ø)`
...ore_service_autoscaling/utils/auto_scaling_core.py	`91.8% <100.0%> (-2.4%)`	⬇️
...service_autoscaling/utils/computational_scaling.py	`100.0% <100.0%> (ø)`
.../simcore_service_autoscaling/utils/utils_docker.py	`98.9% <100.0%> (+<0.1%)`	⬆️
...src/simcore_service_autoscaling/utils/utils_ec2.py	`100.0% <100.0%> (ø)`
... and 8 more

... and 13 files with indirect coverage changes

sonarqubecloud · 2024-01-09T11:26:04Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

GitHK

👍

services/autoscaling/tests/manual/docker-compose.yml

matusdrobuliak66

At some point it would be nice if you can introduce me a bit in person code of this service, but from top it looks good 👍 thanks

do not send message for pending nodes

pcrespov

Nice! Added some minor suggestions.

packages/aws-library/src/aws_library/ec2/models.py

services/autoscaling/docker/entrypoint.sh

services/autoscaling/src/simcore_service_autoscaling/modules/remote_debug.py

services/autoscaling/src/simcore_service_autoscaling/modules/dask.py

services/autoscaling/src/simcore_service_autoscaling/utils/computational_scaling.py

services/autoscaling/src/simcore_service_autoscaling/utils/utils_ec2.py

services/autoscaling/src/simcore_service_autoscaling/models.py

sonarqubecloud · 2024-01-17T08:07:00Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

sanderegg added the a:autoscaling autoscaling service in simcore's stack label Dec 21, 2023

sanderegg added this to the Kobayashi Maru milestone Dec 21, 2023

sanderegg self-assigned this Dec 21, 2023

sanderegg mentioned this pull request Jan 7, 2024

sim4life.io - WP4: Computational backend ITISFoundation/osparc-issues#950

Open

sanderegg modified the milestones: Kobayashi Maru, This is Sparta! Jan 9, 2024

sanderegg force-pushed the comp-autoscaling/fixes-and-mutli-processing branch from 4c2509e to ef882d6 Compare January 9, 2024 10:12

sanderegg force-pushed the comp-autoscaling/fixes-and-mutli-processing branch 5 times, most recently from 7efab92 to bbe5725 Compare January 16, 2024 12:36

sanderegg marked this pull request as ready for review January 16, 2024 14:12

sanderegg requested review from pcrespov and matusdrobuliak66 as code owners January 16, 2024 14:12

sanderegg requested review from GitHK, mguidon, bisgaard-itis, mrnicegyu11 and YuryHrytsuk January 16, 2024 14:12

sanderegg changed the title ~~🐛Computational autoscaling: fixes and multi-processing~~ 🐛Computational autoscaling: allow multi-machining/processing Jan 16, 2024

GitHK approved these changes Jan 16, 2024

View reviewed changes

services/autoscaling/tests/manual/docker-compose.yml Show resolved Hide resolved

matusdrobuliak66 approved these changes Jan 16, 2024

View reviewed changes

pcrespov requested a review from wvangeit January 16, 2024 20:30

sanderegg added 4 commits January 16, 2024 21:50

instance ready is depending on mode

47194b9

hmm not yet totally there

ea6f013

missing manual testing parts

96114d5

added remote debugging facilities

e231f20

sanderegg added 9 commits January 16, 2024 21:50

mypy

eb14d4b

removed unused code

ebf85f6

remove catastrophic backtracking

0e18cf1

remove catastrophic backtracking

0441def

also check pending nodes

cd99cd1

do not send message for pending nodes

necessary for docker >24

a73f7b7

remove too many logs

4b6f787

only deactivate active nodes

78e2869

fixed regex

3c8d3ed

sanderegg force-pushed the comp-autoscaling/fixes-and-mutli-processing branch from d16e7c4 to 3c8d3ed Compare January 16, 2024 20:50

sanderegg added 2 commits January 16, 2024 22:08

fixed test

e98b0e6

clean up

d6ef835

pcrespov approved these changes Jan 16, 2024

View reviewed changes

sanderegg added 10 commits January 17, 2024 07:55

@pcrespov review: rename fct

d6d7b18

@pcrespov review: improve code

019d0bb

@pcrespov review: refactor

c703f6f

@pcrespov review: remove unused fct

d23e109

@pcrespov review: add asserts to convey message

b337597

@pcrespov review: refactor

40a4e25

some typos

6787420

mypy

a178756

adapt to new naming

a1dc532

fix calls to dask_scheduler stuff

5eb5c36

sanderegg changed the title ~~🐛Computational autoscaling: allow multi-machining/processing~~ 🐛🎨Computational autoscaling: allow multi-machining/processing Jan 17, 2024

sanderegg merged commit 1dc7040 into ITISFoundation:master Jan 17, 2024
55 checks passed

sanderegg deleted the comp-autoscaling/fixes-and-mutli-processing branch January 17, 2024 09:53

This was referenced Jan 23, 2024

Computational clusters: "erred" jobs prevent workers from being shutdown ITISFoundation/osparc-issues#1218

Closed

Computational clusters: autoscaling unable to scale down worker ITISFoundation/osparc-issues#1219

Closed

matusdrobuliak66 mentioned this pull request Feb 14, 2024

🚀 Release v1.65.0 #5226

Closed

39 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛🎨Computational autoscaling: allow multi-machining/processing #5203

🐛🎨Computational autoscaling: allow multi-machining/processing #5203

sanderegg commented Dec 21, 2023 •

edited

Loading

codecov bot commented Dec 21, 2023 •

edited

Loading

sonarqubecloud bot commented Jan 9, 2024

GitHK left a comment

matusdrobuliak66 left a comment

pcrespov left a comment

sonarqubecloud bot commented Jan 17, 2024

🐛🎨Computational autoscaling: allow multi-machining/processing #5203

🐛🎨Computational autoscaling: allow multi-machining/processing #5203

Conversation

sanderegg commented Dec 21, 2023 • edited Loading

What do these changes do?

Main changes:

minor changes:

Related issue/s

How to test

Dev Checklist

DevOps Checklist

codecov bot commented Dec 21, 2023 • edited Loading

Codecov Report

sonarqubecloud bot commented Jan 9, 2024

Quality Gate passed

GitHK left a comment

Choose a reason for hiding this comment

matusdrobuliak66 left a comment

Choose a reason for hiding this comment

pcrespov left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Jan 17, 2024

Quality Gate passed

sanderegg commented Dec 21, 2023 •

edited

Loading

codecov bot commented Dec 21, 2023 •

edited

Loading