Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AppTeam Std Simulations on S4L/AWS #885

Assignees
Labels
Epic Zenhub label (Pleas do not modify) PO issue Created by Product owners s4l:web sim4life product in osparc.io
Milestone

Comments

@drniiken
Copy link
Member

drniiken commented Feb 27, 2023

Background

As mentioned to some members of the Team, in today's 2PM Meeting Erdem and me faced the following situation and possible use-case:

  • Erdem's Team (in this particular case it's Cosimo) has to support standards activities (Mark Douglas) and for this purpose run a huge number of simulations - >10k simulations in total.
  • These are simulations of dipoles in combination with cSAR3Ds, in various combinations (angles, distances, positions, etc.).
  • EM-FDTD/iSolve simulations, must be run on GPU, sim size = ca. 150-200MCells, sim duration = a few hours.
  • All of it is totally scriptable (pre-, sim and post-pro); Cosimo is working on the "big script" right now.
  • Niels suggested (or rather demanded) - as an experiment - to run all of this on AWS, using S4L and the "cloud infrastructure" ... as a proof of concept/ feasibility test, and also in particular to get a feeling for cost for "such large studies".
  • Cost, in contrary to buying a new up-to-date HPC on-premise facility which we/Z43 would own (cost in the range of CHF 50k).
  • The AWS cost shall be different from the one attached to NIH/STRIDES, since this is pure Z43 in-house work for which we can't charge the NIH, obviously :-) (we can use, e.g., the AWS IT'IS account for which billing is on Niels' credit card)

ToDo

The task for the Team would thus be to setup the infrastructure which allows the AppTeam (Cosimo) to run the simulations on AWS:
IMPORTANT: please use as much "existing infrastructure and technology" as possible. We do not want to interfere with the anyhow super sporty timeline for S4L web (full).

  • Deploy on "IT'IS AWS".
  • Could be S4L service on oSPARC or something like S4L lite but w/o the lite, i.e., larger simulations and GPU support.
  • We could, e.g., fix book 10 (or more) appropriate machines which run the tasks in parallel, script based (Cosimo), with GPU.
  • GUI access actually not needed, assuming all can be scripted.
  • Output files are not excessively big.

Thanks and best, Nik

Sundae

Preview Give feedback
@drniiken drniiken added PO issue Created by Product owners s4l:web sim4life product in osparc.io labels Feb 27, 2023
@drniiken
Copy link
Member Author

Dear @eofli, please extend/correct, if needed. Thanks.

@sanderegg sanderegg added the Epic Zenhub label (Pleas do not modify) label Feb 28, 2023
@sanderegg sanderegg added this to the Mithril milestone Feb 28, 2023
@mrnicegyu11
Copy link
Member

After an initial chat with cosimo, we established:

  • We need preliminary runtime benchmarks, at least on Cosimo's machine and a g4dn aws instance, before starting the big run
  • Worst-case time estimate as of today is 40min * 25 (points) * 24 (angles) * 28 (antennas) * 3 (distance)

After an initial chat with @sanderegg:

  • Sync with @mguidon / @colinRawlings once they are back is needed on: S4L-full-web state (any blockers?), Multi-GPU support?
  • computational gateway usage is likely not feasible in this sprint, using a second autoscaler or permanent personalized machines is likely the way to go.

After an initial chat with @Surfict :

  • Connecting two VPCs of different aws accounts is trivially possible, no special handling for the license server shall be required. Nota bene: Both VPCs need a seperate IP range that doesnt overlap

Additional considerations:

  • What happens in case of maintenances/release/(un)scheduled downtimes?
  • Prepare a cost-estimate after cosimo's initial benchmark

@sanderegg
Copy link
Member

sanderegg commented Mar 2, 2023

Goal for sprint Mithril

Use-case:

  • cluster connected to osparc.io, not paid by STRIDES (on ITIS account)
  • user starts jupyter-smash on osparc.io and generate inputs files
  • in jupyter-smash use the osparc-API to create jobs using isolve-gpu
  • these jobs should go to the separate dask-cluster
  • user logs out
  • user logs in later and gets progress
  • user get the results
  • get estimation of costs

Initial plan

we do it on staging-AWS (faster deployment of bugfixes/changes)

@sanderegg
Copy link
Member

Created a separate cluster with 2 g4dn.xlarge machines running in separate AWS account connected to staging.osparc.io.
Deployed a osparc-dask-gateway and run many sleepers
succeeded.

@mguidon
Copy link
Member

mguidon commented Mar 29, 2023

Update on sprint Mithril

Done

  • Setup of workflow for cosimo (python scripts to create/analyze simulations)
  • Helper wrapper for submitting jobs to the cloud and getting progress information about running solvers
  • Define hardware specs for cluster
  • Cost/time estimate for solving
  • initial testing (POC) thanks to personalizable resource limits #618:
    • 200 jobs each with 1 sleepers:2.0.2 with resources overridden to use 0.1 CPU and 100MB RAM
    • 22 EC2 instances of t2.large (8Gb, 2CPU) t2 instances which translates to max 22 comp. sidecars (dynamically adapted for now) each able to run up to 10 jobs depending on resources needed
      • creation of the jobs took about 2.5 minute (~0.7job per second)
      • running the jobs took about 3 minutes
    • testing with 2000 sleepers and then 16000 sleepers are planned, also with longer running ones.

Todo

  • Cost/time estimate for post-processing
  • Publishing release 7.2 based services
  • Execute the whole thing

@sanderegg
Copy link
Member

sanderegg commented Apr 6, 2023

Goal for sprint Jelly Beans

  • Test & Fix osparc-dask-gateway until we can reliably run:
    • external dask-gateway with >20 machines,
    • run batches of 1000 jobs in parallel without issues,
  • Run multiple instances of iSolve on the same node
  • Cost/time estimate for post-processing
  • Publishing release 7.2 based services
  • Execute the whole thing

@sanderegg
Copy link
Member

sanderegg commented Apr 25, 2023

Update on sprint Jelly Beans

Done

  • Improved osparc-dask-gateway stability to successfully run 3x 2200 sleepers computational services in a row on an external separated cluster
  • Planning for S4l:Big use case

Ongoing

  • Improving oSparc platform to cope with 1000s of computational services running and streaming logs/progress etc
  • Preparing s4l to submit jobs into osparc

Open

  • Execute the whole thing

@sanderegg
Copy link
Member

  • isolve 7.2 with VRAM
  • run subgroup of isolves
  • ideally the whole thing

@pcrespov pcrespov assigned cosfor1 and unassigned colinRawlings, drniiken and eofli Jun 12, 2023
@pcrespov pcrespov modified the milestones: Pastel de Nata, Watermelon Jun 12, 2023
@mguidon
Copy link
Member

mguidon commented Jul 6, 2023

Update Watermelon

  • Paused for now

@matusdrobuliak66 matusdrobuliak66 modified the milestones: Watermelon, Sundae Jul 24, 2023
@sanderegg
Copy link
Member

Notes:

  • Communicated that running on AWS would be too expensive? need clarifications
  • We still want to run some parts for testing, POCs

@drniiken
Copy link
Member Author

I would still learn from feedback regarding the "spirit" of this use case:

  • large amount of simulations
  • scaling
  • cost
  • ...

However, this must not be based on this specific case. Can be another use case (or example), giving more insights.

@mguidon

@matusdrobuliak66 matusdrobuliak66 modified the milestones: Sundae, Baklava Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment