Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨Autoscaling: 1st draft on auto-scaling computational clusters #4711

Merged

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Sep 5, 2023

What do these changes do?

This allows the autoscaling service to optionally connect to a dask-scheduler to find out whether it is in need of dask-workers to complete pending jobs.

For context:

  • a user starts a job (though publicAPI, or through the osparc UI) that should run on on-demand clusters
  • the clusters-keeper service creates a primary machine on AWS EC2 where a dask-scheduler and autoscaling shall run [upcoming PRs],
  • the autoscaling service connects to the dask-scheduler and monitors for unrunnable jobs, if it finds some it will try to create a machine that accomodates the job required resources, [THIS PR]
  • the newly created machine will auto-start a dask-sidecar, that auto-connects to the dask-scheduler as soon as the machine is available, [THIS PR]
  • the job runs, and the results are fetched by the director-v2
  • the autoscaling then removes the machine if it is unused for some time, [THIS PR]

details

  • adds dask distributed dependencies
  • removes old outdated script.py from sandbox
  • adds computational banner on start in the autoscaling logs
  • new ENV DASK_MONITORING_URL for use when autoscaling is to be set in computational mode (NOTE: DASK_MONITORING_URL and NODE_MONITORING settings cannot be both set)
  • new ENV EC2_INSTANCES_NAME_PREFIX defaults to "autoscaling" allows to prefix machines created by the autoscaling service

Related issue/s

How to test

make test-dev-unit

otherwise manual testing see tests/manual/README.md

DevOps Checklist

  • new optional ENV EC2_INSTANCES_NAME_PREFIX, which default to "autoscaling" to keep backwards compatibility with dynamic services autoscaling.

@sanderegg sanderegg added the a:autoscaling autoscaling service in simcore's stack label Sep 5, 2023
@sanderegg sanderegg added this to the Baklava milestone Sep 5, 2023
@sanderegg sanderegg self-assigned this Sep 5, 2023
@codecov
Copy link

codecov bot commented Sep 5, 2023

Codecov Report

Merging #4711 (b149147) into master (ff62322) will increase coverage by 0.2%.
The diff coverage is 98.8%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #4711     +/-   ##
========================================
+ Coverage    86.6%   86.8%   +0.2%     
========================================
  Files        1188    1191      +3     
  Lines       49402   49656    +254     
  Branches     1072    1072             
========================================
+ Hits        42825   43145    +320     
+ Misses       6342    6276     -66     
  Partials      235     235             
Flag Coverage Δ
integrationtests 65.1% <ø> (+1.3%) ⬆️
unittests 84.5% <98.8%> (+<0.1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...toscaling/src/simcore_service_autoscaling/_meta.py 100.0% <100.0%> (ø)
...rc/simcore_service_autoscaling/core/application.py 100.0% <100.0%> (ø)
...ing/src/simcore_service_autoscaling/core/errors.py 100.0% <100.0%> (ø)
...g/src/simcore_service_autoscaling/core/settings.py 100.0% <100.0%> (ø)
...oscaling/src/simcore_service_autoscaling/models.py 100.0% <100.0%> (ø)
...e_service_autoscaling/modules/auto_scaling_task.py 100.0% <100.0%> (ø)
...ore_service_autoscaling/utils/auto_scaling_core.py 90.4% <100.0%> (-9.6%) ⬇️
...service_autoscaling/utils/computational_scaling.py 100.0% <100.0%> (ø)
...aling/src/simcore_service_autoscaling/utils/ec2.py 100.0% <100.0%> (ø)
.../src/simcore_service_autoscaling/utils/rabbitmq.py 100.0% <100.0%> (ø)
... and 3 more

... and 12 files with indirect coverage changes

@sanderegg sanderegg force-pushed the autoscaling/comp-autoscaling branch from ed4587c to 7f96b24 Compare September 5, 2023 14:48
@sanderegg sanderegg force-pushed the autoscaling/comp-autoscaling branch 4 times, most recently from 987c0f3 to 8abcb54 Compare September 6, 2023 20:55
@sanderegg sanderegg changed the title ✨Autoscaling: dynamically auto-scale nodes on computational cluster ✨Autoscaling: auto-scale nodes on computational cluster Sep 7, 2023
@sanderegg sanderegg force-pushed the autoscaling/comp-autoscaling branch 2 times, most recently from 483bf56 to de969ca Compare September 8, 2023 13:58
@sanderegg sanderegg force-pushed the autoscaling/comp-autoscaling branch 3 times, most recently from 3d27a17 to 27acc27 Compare September 22, 2023 11:32
@sanderegg sanderegg force-pushed the autoscaling/comp-autoscaling branch 2 times, most recently from 5f315c5 to fd81962 Compare October 17, 2023 06:29
@sanderegg sanderegg force-pushed the autoscaling/comp-autoscaling branch from fd81962 to 829cf50 Compare October 17, 2023 09:36
@sanderegg sanderegg changed the title ✨Autoscaling: auto-scale nodes on computational cluster ✨Autoscaling: 1st draft on auto-scaling computational clusters Oct 17, 2023
@sanderegg sanderegg force-pushed the autoscaling/comp-autoscaling branch 8 times, most recently from b8309a9 to c76f37e Compare October 20, 2023 06:54
@sanderegg sanderegg force-pushed the autoscaling/comp-autoscaling branch from 7f65c3e to ac00578 Compare October 20, 2023 09:44
Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from what I can tell

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Just left some comments and suggestions. I am curious about a few things

@codeclimate
Copy link

codeclimate bot commented Oct 20, 2023

Code Climate has analyzed commit b149147 and detected 0 issues on this pull request.

View more on Code Climate.

@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
4.1% 4.1% Duplication

idea Catch issues before they fail your Quality Gate with our IDE extension sonarlint SonarLint

@sanderegg sanderegg merged commit 60dbc85 into ITISFoundation:master Oct 20, 2023
52 of 53 checks passed
@sanderegg sanderegg deleted the autoscaling/comp-autoscaling branch October 20, 2023 13:52
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Oct 31, 2023
30 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:autoscaling autoscaling service in simcore's stack
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants