MPI celery queue on dalco cluster #1672

mguidon · 2020-08-05T08:11:28Z

USER STORY:

We have 5 nodes on the dalco cluster. One of them has 48 CPUS and 700 sth GB of RAM. I want to run iSolve (a comp. service) on that machine using MPI parallelism. I will tag the corresponding image with a label "MPI". I want to have exactly one sidecar running there so that different jobs do not fight for the resources.

Since it is not quite clear yet how many users will use that feature and since we do not want to waste too much computational resources I would also like to have possibility to optionally run "normal" sidecars on that node.

DEFINITION OF DONE:

I create 2 projects filepicker->iSolve(MPI). When I run the two projects, the solvers run on the MPI node one after the other using all the cores.

GitHK · 2020-08-05T09:05:59Z

Proposed solution

Because machine(s) used to run MPI tasks, have a specific amount of total available CPUs and because this number differs from other type machines in the cluster. I would propose the following solution:

When the sidecar container starts, it will check in sequence if IS_MPI_NODE and IS_GPU_NODE. It will become the first available type of sidecar with in the following order: [MPI, GPU, CPU]. In the end all sidecars will be CPU sidecars if all checks fail.
MPI check:
- an environment variable will be passed to the sidecar service containing the number of CPUs needed to become an MPI node. The container will run cat /proc/cpuinfo | grep processor | wc -l to determine the CPU count.
- If a node can be MPI, it will acquire a Redlock with name containing MPI and the number of CPUs for this specific MPI node. This guarantees no other sidecars can become an MPI node
- If the node also has a GPU, all other sidecars (trying to start) will become a GPU sidecars.
- If the node has no GPU, all other sidecars (trying to start) will become CPU sidecars.

The above will guarantee that in every given cluster there will be only one node dedicated to running a single MPI sidecar. The rest of the sidecars scheduled on that node will be either GPU sidecars or CPU sidecars, based on the available node's configuration/resources.

Observations

The placement of the MPI label will be assumed to be at the same level as the VRAM label for the GPU services.
In development mode the sidecar service will start with 3 copies (up from 1). Only one of these will become an MPI node.
The implementation and dispatching will be similar to the GPU solution.

mguidon · 2020-08-05T09:13:23Z

re (trying to start): I guess you will take care that this is configurable? (i.e. having optionally other sidecars running on the MPI node)

GitHK · 2020-08-05T09:15:31Z

re (trying to start): I guess you will take care that this is configurable? (i.e. having optionally other sidecars running on the MPI node)

What I mean is that if more then one sidecar is configured on that node, the rest will become either GPU sidecars or CPU sidecars, based on how to node was configured.

mguidon · 2020-08-05T09:19:05Z

re (trying to start): I guess you will take care that this is configurable? (i.e. having optionally other sidecars running on the MPI node)

What I mean is that if more then one sidecar is configured on that node, the rest will become either GPU sidecars or CPU sidecars, based on how to node was configured.

Thats what I had in mind. Thanks.

GitHK · 2020-08-05T09:21:19Z

I will proceed with implementing this, and have the issue linked to a PR. The implementation will also bring a sleeper-mpi service.

mguidon assigned mguidon and GitHK Aug 5, 2020

mguidon added the a:sidecar issue related with the sidecar worker service label Aug 5, 2020

GitHK mentioned this issue Aug 5, 2020

Adds MPI scheduling support #1673

Merged

5 tasks

GitHK closed this as completed in #1673 Aug 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI celery queue on dalco cluster #1672

MPI celery queue on dalco cluster #1672

mguidon commented Aug 5, 2020

GitHK commented Aug 5, 2020

mguidon commented Aug 5, 2020

GitHK commented Aug 5, 2020

mguidon commented Aug 5, 2020

GitHK commented Aug 5, 2020

MPI celery queue on dalco cluster #1672

MPI celery queue on dalco cluster #1672

Comments

mguidon commented Aug 5, 2020

GitHK commented Aug 5, 2020

Proposed solution

Observations

mguidon commented Aug 5, 2020

GitHK commented Aug 5, 2020

mguidon commented Aug 5, 2020

GitHK commented Aug 5, 2020