Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛🎨Computational autoscaling: allow multi-machining/processing #5203

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Dec 21, 2023

What do these changes do?

This PR stems from the following issue:
1.

  • send multiple jobs to a cluster
  • N workers are created
  • they connect randomly to the dask-scheduler, then the first connected worker take all the jobs
  • the workers connecting after the first one do not see any job (or it takes some time for them to steal these jobs)
  • during that time the autoscaling service removes these workers (thus they were created for naught and do not participate in the work)
  • the lone (or sometime more than 1) workers to the job serquentially, therefore suboptimal
  • in case there are more jobs of the same EC2 type coming within 5 minutes
  • the worker will not be upscaled, as the current workers takes all the jobs, and the dask_scheduler does not report these jobs as being queued but always processing (the dask-scheduler does not know whether a job is actively being worked on or just queued, for the dask-scheduler these jobs are all in processing state)

Sadly many changes were necessary, but it should be for the greater good. Here is a summary.

Main changes:

  1. refactoring/redesign of the principle to discover what needs to be scaled up/down to become (hopefully) simpler:
    Principle
  • the cluster is analyzed (what nodes are available and their states)
  • find tasks that are unrunnable
  • if there are unrunnable tasks
    • activate drained nodes if possible
    • scale up if necessary by virtually assigning tasks to the machines
  • if there are no unrunnable task
    • drain empty active nodes
    • scale down drained nodes until we reach the reserve drained nodes (can be 0)
      Changes
  • the way the tasks are virtually assigned to the different types of nodes (active, drained, pending, ...) is now done similarly to the analysis of the infrastructure, making it simpler.
  1. the way an active node (e.g. machine that can accept jobs) is defined is now not only depending on its docker availability mode (drain vs active) but also whether the dask-worker is running and connected to the dask-scheduler
  2. the autoscaling service now computes an estimated job queue based on the number of jobs - #workers, this way the autoscaling service does not remove workers before the queue is back to 0 and therefore the number of workers to operate on the jobs is defined by ENV (as it should have been the case originally) - this is the actual fix to the original issue
  3. the autoscaling service now actively asks the dask-scheduler to retire workers. The dask-scheduler kind of know which worker might be unnecessary and takes care of moving anything in their respective memory to another worker. Once the memory is freed the autoscaling service can drain the workers, and then remove them as usual
  4. improved logging information altogether
  5. added debugpy facilities in autoscaling for debugging in vscode
  6. added jupyter notebook to facilitate manual testing directly on EC2 via vscode remote access (see tests/manual folder)

minor changes:

  • some refactoring of aws-library, in particular usage of the Resources class is now generalised
  • fixes some weird behavior in dask-scheduler where consumed resources go to negative values sometimes
  • fixes some weird behavior in dask-scheduler where idle dask-worker are seemingly still consuming an insane amount of resources after they were retired

Here some maybe more clear video, this simulates the jobs incoming into an external cluster. go to minute 3 to start seeing exciting stuff. Basically 500 jobs were sent to the cluster which can create up to 40 workers (visible on the upper left side, the jobs are shown on the upper right):

ext_cluster.mp4

Related issue/s

How to test

Dev Checklist

DevOps Checklist

@sanderegg sanderegg added the a:autoscaling autoscaling service in simcore's stack label Dec 21, 2023
@sanderegg sanderegg added this to the Kobayashi Maru milestone Dec 21, 2023
@sanderegg sanderegg self-assigned this Dec 21, 2023
Copy link

codecov bot commented Dec 21, 2023

Codecov Report

Attention: 32 lines in your changes are missing coverage. Please review.

Comparison is base (0cdb35b) 87.2% compared to head (5eb5c36) 87.3%.

Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #5203   +/-   ##
======================================
  Coverage    87.2%   87.3%           
======================================
  Files        1295    1295           
  Lines       53066   53095   +29     
  Branches     1164    1164           
======================================
+ Hits        46307   46379   +72     
+ Misses       6509    6466   -43     
  Partials      250     250           
Flag Coverage Δ
integrationtests 65.0% <ø> (+1.3%) ⬆️
unittests 85.2% <89.0%> (-0.1%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
packages/aws-library/src/aws_library/ec2/client.py 100.0% <ø> (ø)
packages/aws-library/src/aws_library/ec2/models.py 88.4% <100.0%> (+0.4%) ⬆️
...brary/api_schemas_clusters_keeper/ec2_instances.py 100.0% <100.0%> (ø)
...g/src/simcore_service_autoscaling/core/settings.py 97.8% <100.0%> (+<0.1%) ⬆️
...oscaling/src/simcore_service_autoscaling/models.py 100.0% <100.0%> (ø)
...e_autoscaling/modules/auto_scaling_mode_dynamic.py 100.0% <100.0%> (ø)
...ore_service_autoscaling/utils/auto_scaling_core.py 91.8% <100.0%> (-2.4%) ⬇️
...service_autoscaling/utils/computational_scaling.py 100.0% <100.0%> (ø)
.../simcore_service_autoscaling/utils/utils_docker.py 98.9% <100.0%> (+<0.1%) ⬆️
...src/simcore_service_autoscaling/utils/utils_ec2.py 100.0% <100.0%> (ø)
... and 8 more

... and 13 files with indirect coverage changes

@sanderegg sanderegg modified the milestones: Kobayashi Maru, This is Sparta! Jan 9, 2024
@sanderegg sanderegg force-pushed the comp-autoscaling/fixes-and-mutli-processing branch from 4c2509e to ef882d6 Compare January 9, 2024 10:12
Copy link

sonarqubecloud bot commented Jan 9, 2024

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@sanderegg sanderegg force-pushed the comp-autoscaling/fixes-and-mutli-processing branch 5 times, most recently from 7efab92 to bbe5725 Compare January 16, 2024 12:36
@sanderegg sanderegg marked this pull request as ready for review January 16, 2024 14:12
@sanderegg sanderegg changed the title 🐛Computational autoscaling: fixes and multi-processing 🐛Computational autoscaling: allow multi-machining/processing Jan 16, 2024
Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point it would be nice if you can introduce me a bit in person code of this service, but from top it looks good 👍 thanks

@pcrespov pcrespov requested a review from wvangeit January 16, 2024 20:30
@sanderegg sanderegg force-pushed the comp-autoscaling/fixes-and-mutli-processing branch from d16e7c4 to 3c8d3ed Compare January 16, 2024 20:50
Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Added some minor suggestions.

Copy link

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@sanderegg sanderegg changed the title 🐛Computational autoscaling: allow multi-machining/processing 🐛🎨Computational autoscaling: allow multi-machining/processing Jan 17, 2024
@sanderegg sanderegg merged commit 1dc7040 into ITISFoundation:master Jan 17, 2024
55 checks passed
@sanderegg sanderegg deleted the comp-autoscaling/fixes-and-mutli-processing branch January 17, 2024 09:53
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Feb 14, 2024
39 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:autoscaling autoscaling service in simcore's stack
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants