AppTeam Std Simulations on S4L/AWS #885

drniiken · 2023-02-27T15:55:54Z

Background

As mentioned to some members of the Team, in today's 2PM Meeting Erdem and me faced the following situation and possible use-case:

Erdem's Team (in this particular case it's Cosimo) has to support standards activities (Mark Douglas) and for this purpose run a huge number of simulations - >10k simulations in total.
These are simulations of dipoles in combination with cSAR3Ds, in various combinations (angles, distances, positions, etc.).
EM-FDTD/iSolve simulations, must be run on GPU, sim size = ca. 150-200MCells, sim duration = a few hours.
All of it is totally scriptable (pre-, sim and post-pro); Cosimo is working on the "big script" right now.
Niels suggested (or rather demanded) - as an experiment - to run all of this on AWS, using S4L and the "cloud infrastructure" ... as a proof of concept/ feasibility test, and also in particular to get a feeling for cost for "such large studies".
Cost, in contrary to buying a new up-to-date HPC on-premise facility which we/Z43 would own (cost in the range of CHF 50k).
The AWS cost shall be different from the one attached to NIH/STRIDES, since this is pure Z43 in-house work for which we can't charge the NIH, obviously :-) (we can use, e.g., the AWS IT'IS account for which billing is on Niels' credit card)

ToDo

The task for the Team would thus be to setup the infrastructure which allows the AppTeam (Cosimo) to run the simulations on AWS:
IMPORTANT: please use as much "existing infrastructure and technology" as possible. We do not want to interfere with the anyhow super sporty timeline for S4L web (full).

Deploy on "IT'IS AWS".
Could be S4L service on oSPARC or something like S4L lite but w/o the lite, i.e., larger simulations and GPU support.
We could, e.g., fix book 10 (or more) appropriate machines which run the tasks in parallel, script based (Cosimo), with GPU.
GUI access actually not needed, assuming all can be scripted.
Output files are not excessively big.

Thanks and best, Nik

Sundae

Give feedback

run part of the simulation @mguidon, @sanderegg
Options

drniiken · 2023-02-27T16:36:33Z

Dear @eofli, please extend/correct, if needed. Thanks.

mrnicegyu11 · 2023-02-28T14:10:21Z

After an initial chat with cosimo, we established:

We need preliminary runtime benchmarks, at least on Cosimo's machine and a g4dn aws instance, before starting the big run
Worst-case time estimate as of today is 40min * 25 (points) * 24 (angles) * 28 (antennas) * 3 (distance)

After an initial chat with @sanderegg:

Sync with @mguidon / @colinRawlings once they are back is needed on: S4L-full-web state (any blockers?), Multi-GPU support?
computational gateway usage is likely not feasible in this sprint, using a second autoscaler or permanent personalized machines is likely the way to go.

After an initial chat with @Surfict :

Connecting two VPCs of different aws accounts is trivially possible, no special handling for the license server shall be required. Nota bene: Both VPCs need a seperate IP range that doesnt overlap

Additional considerations:

What happens in case of maintenances/release/(un)scheduled downtimes?
Prepare a cost-estimate after cosimo's initial benchmark

sanderegg · 2023-03-02T14:37:04Z

sanderegg · 2023-03-13T15:34:33Z

Created a separate cluster with 2 g4dn.xlarge machines running in separate AWS account connected to staging.osparc.io.
Deployed a osparc-dask-gateway and run many sleepers
succeeded.

mguidon · 2023-03-29T09:44:20Z

Update on sprint Mithril

Done

Setup of workflow for cosimo (python scripts to create/analyze simulations)
Helper wrapper for submitting jobs to the cloud and getting progress information about running solvers
Define hardware specs for cluster
Cost/time estimate for solving
initial testing (POC) thanks to personalizable resource limits #618:
- 200 jobs each with 1 sleepers:2.0.2 with resources overridden to use 0.1 CPU and 100MB RAM
- 22 EC2 instances of t2.large (8Gb, 2CPU) t2 instances which translates to max 22 comp. sidecars (dynamically adapted for now) each able to run up to 10 jobs depending on resources needed
  - creation of the jobs took about 2.5 minute (~0.7job per second)
  - running the jobs took about 3 minutes
- testing with 2000 sleepers and then 16000 sleepers are planned, also with longer running ones.

Todo

Cost/time estimate for post-processing
Publishing release 7.2 based services
Execute the whole thing

sanderegg · 2023-04-06T14:54:01Z

Goal for sprint Jelly Beans

Test & Fix osparc-dask-gateway until we can reliably run:
- external dask-gateway with >20 machines,
- run batches of 1000 jobs in parallel without issues,
Run multiple instances of iSolve on the same node
Cost/time estimate for post-processing
Publishing release 7.2 based services
Execute the whole thing

sanderegg · 2023-04-25T19:14:03Z

Update on sprint Jelly Beans

Done

Improved osparc-dask-gateway stability to successfully run 3x 2200 sleepers computational services in a row on an external separated cluster
Planning for S4l:Big use case

Ongoing

Improving oSparc platform to cope with 1000s of computational services running and streaming logs/progress etc
Preparing s4l to submit jobs into osparc

Open

Execute the whole thing

sanderegg · 2023-05-12T13:44:31Z

isolve 7.2 with VRAM
run subgroup of isolves
ideally the whole thing

mguidon · 2023-07-06T20:12:05Z

Update Watermelon

Paused for now

sanderegg · 2023-07-25T08:00:13Z

Notes:

Communicated that running on AWS would be too expensive? need clarifications
We still want to run some parts for testing, POCs

drniiken · 2023-08-17T12:04:30Z

I would still learn from feedback regarding the "spirit" of this use case:

large amount of simulations
scaling
cost
...

However, this must not be based on this specific case. Can be another use case (or example), giving more insights.

@mguidon

drniiken added PO issue Created by Product owners s4l:web sim4life product in osparc.io labels Feb 27, 2023

drniiken assigned drniiken, sanderegg, eofli and mrnicegyu11 Feb 27, 2023

sanderegg added the Epic Zenhub label (Pleas do not modify) label Feb 28, 2023

sanderegg added this to the Mithril milestone Feb 28, 2023

sanderegg assigned colinRawlings and Surfict Mar 1, 2023

This was referenced Mar 29, 2023

♻️Dask gateway: add missing variables + change deprecated calls ITISFoundation/osparc-simcore#4022

Merged

✨ Gateway-server: add ENV variable to change how many CPUs/RAM is advertised by each sidecar ITISFoundation/osparc-simcore#4019

Merged

sanderegg modified the milestones: Mithril, The Next Milestone Mar 31, 2023

sanderegg mentioned this issue Apr 3, 2023

♻️ Dask: maintenance, refactor, upgrade ITISFoundation/osparc-simcore#4051

Merged

4 tasks

This was referenced Apr 3, 2023

♻️ Api server: improve job creation speed ITISFoundation/osparc-simcore#4053

Merged

🐛 Dask: fix issue with disappearing workers ITISFoundation/osparc-simcore#4057

Merged

drniiken mentioned this issue Apr 4, 2023

Gateway Access: MVP Service Project, Cosimo #918

Closed

sanderegg assigned mguidon and unassigned Surfict and mrnicegyu11 Apr 6, 2023

sanderegg mentioned this issue Apr 17, 2023

♻️Clusters: make non dev feature, disabled by default (⚠️) ITISFoundation/osparc-simcore#4129

Merged

sanderegg modified the milestones: Jelly Beans, The Next Milestone Apr 28, 2023

mguidon mentioned this issue May 9, 2023

sim4life.io - WP4: Computational backend #950

Open

pcrespov assigned cosfor1 and unassigned colinRawlings, drniiken and eofli Jun 12, 2023

pcrespov modified the milestones: Pastel de Nata, Watermelon Jun 12, 2023

matusdrobuliak66 modified the milestones: Watermelon, Sundae Jul 24, 2023

matusdrobuliak66 modified the milestones: Sundae, Baklava Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AppTeam Std Simulations on S4L/AWS #885

AppTeam Std Simulations on S4L/AWS #885

drniiken commented Feb 27, 2023 •

edited by sanderegg

Loading

Sundae

drniiken commented Feb 27, 2023

mrnicegyu11 commented Feb 28, 2023

sanderegg commented Mar 2, 2023 •

edited

Loading

sanderegg commented Mar 13, 2023

mguidon commented Mar 29, 2023 •

edited by sanderegg

Loading

sanderegg commented Apr 6, 2023 •

edited by mguidon

Loading

sanderegg commented Apr 25, 2023 •

edited

Loading

sanderegg commented May 12, 2023

mguidon commented Jul 6, 2023 •

edited

Loading

sanderegg commented Jul 25, 2023

drniiken commented Aug 17, 2023

AppTeam Std Simulations on S4L/AWS #885

AppTeam Std Simulations on S4L/AWS #885

Comments

drniiken commented Feb 27, 2023 • edited by sanderegg Loading

Sundae

drniiken commented Feb 27, 2023

mrnicegyu11 commented Feb 28, 2023

sanderegg commented Mar 2, 2023 • edited Loading

Goal for sprint Mithril

Use-case:

Initial plan

sanderegg commented Mar 13, 2023

mguidon commented Mar 29, 2023 • edited by sanderegg Loading

Update on sprint Mithril

Done

Todo

sanderegg commented Apr 6, 2023 • edited by mguidon Loading

Goal for sprint Jelly Beans

sanderegg commented Apr 25, 2023 • edited Loading

Update on sprint Jelly Beans

Done

Ongoing

Open

sanderegg commented May 12, 2023

mguidon commented Jul 6, 2023 • edited Loading

Update Watermelon

sanderegg commented Jul 25, 2023

drniiken commented Aug 17, 2023

drniiken commented Feb 27, 2023 •

edited by sanderegg

Loading

sanderegg commented Mar 2, 2023 •

edited

Loading

mguidon commented Mar 29, 2023 •

edited by sanderegg

Loading

sanderegg commented Apr 6, 2023 •

edited by mguidon

Loading

sanderegg commented Apr 25, 2023 •

edited

Loading

mguidon commented Jul 6, 2023 •

edited

Loading