-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI celery queue on dalco cluster #1672
Comments
Proposed solutionBecause machine(s) used to run MPI tasks, have a specific amount of total available CPUs and because this number differs from other type machines in the cluster. I would propose the following solution:
The above will guarantee that in every given cluster there will be only one node dedicated to running a single MPI sidecar. The rest of the sidecars scheduled on that node will be either GPU sidecars or CPU sidecars, based on the available node's configuration/resources. Observations
|
re (trying to start): I guess you will take care that this is configurable? (i.e. having optionally other sidecars running on the MPI node) |
What I mean is that if more then one sidecar is configured on that node, the rest will become either GPU sidecars or CPU sidecars, based on how to node was configured. |
Thats what I had in mind. Thanks. |
I will proceed with implementing this, and have the issue linked to a PR. The implementation will also bring a |
USER STORY:
We have 5 nodes on the dalco cluster. One of them has 48 CPUS and 700 sth GB of RAM. I want to run iSolve (a comp. service) on that machine using MPI parallelism. I will tag the corresponding image with a label "MPI". I want to have exactly one sidecar running there so that different jobs do not fight for the resources.
Since it is not quite clear yet how many users will use that feature and since we do not want to waste too much computational resources I would also like to have possibility to optionally run "normal" sidecars on that node.
DEFINITION OF DONE:
I create 2 projects filepicker->iSolve(MPI). When I run the two projects, the solvers run on the MPI node one after the other using all the cores.
The text was updated successfully, but these errors were encountered: