Launch jobs on HPC (SLURM) #71

jirikuncar · 2018-06-26T13:32:26Z

Would it be possible to prepare a prototype to launch jobs on HPC cluster?

lukasheinrich · 2018-06-28T00:14:26Z

@jirikuncar I've looked into this as well -- how would you run containers? via a runtime like singularity ? SLURM doesn't seem to have a nice (python) API though. If you know one can you point to it?

jirikuncar · 2018-06-28T07:19:37Z

Shifter

lukasheinrich · 2018-06-28T07:24:44Z

Hi Jriri, yeah singularity and shifter were the runtimes I was thinking about, but afaik they are a layer below SLURM (I.e. I need to have an API to slurm to submit a shifter job).. I talked about having a python API to SLURM+Shifter on NERSC with @iamholger as well, maybe we can hash one out and see to to best implement it.

tiborsimko · 2018-07-02T08:14:43Z

@jirikuncar A very light Singularity support may come early autumn, see #62. The full HPC support would be great to have, but the timing is unsure...

tiborsimko · 2018-12-20T09:19:51Z

Reviving the issue as we are about to address HPC backend integration through our NSF collaborations.

There are two basic directions how to run REANA on HPC:

Run REANA infrastructure on a small non-HPC deployment (using K8s, using OpenShift, using whatever) and let only the heavy-lifting jobs to run on HPC. This would mean to adapt mostly the REANA-Job-Controller component only. (Plus some shared storage etc as discussed below.)
Run REANA itself on HPC completely. This would mean to adapt many REANA components to prepare for non-Kubernetes deployments.

One can distinguish roughly the five compute layers of the REANA platform:

Infrastructure layer, i.e. pods such as REANA-Server (accepting REST API commands from people via their reana-client sessions) and REANA-Workflow-Controller (handling commands regarding preparation and managing people's individual workflow runs).
Workflow runtime layer, i.e. pods orchestrating CWL/Serial/Yadage workflows for individual people, e.g. running John Doe's ttH batch analysis or Jane Doe's interactive Jupyter notebook.
Job compute layer, i.e. running task payload itself, e.g. this-and-this cmsRun command or run this-and-this lightweight ROOT macro or this-and-this mpirun call.
... plus some shared persistent volumes for file exchange within workflow. (S3 buckets?)
... plus some REANA infrastructure services such as DB and MQ.

One can really take advantage of running different layers on different architectures. Ideally, one could keep the REANA service infrastructure and workflow orchestration running on Kubernetes and adapt only the job execution layers, i.e. 3-4 for HPC backend scenario. That would be the quickest...

tiborsimko · 2018-12-20T09:31:54Z

Another consideration is the job execution API supporting various backends. Currently REANA supports job execution only on Kubernetes, however HTCondor, Singularity, Slurm, etc are all desirable options. There are two basic directions:

Keep our internal job API and extend it with functionalities necessary for HPC and other backends. This would permit to keep the rest of the REANA ecosystem unchanged and adapt only the REANA-Job-Controller component to plug various backends.
Change our internal job API for something like GA4GH TES and use some existing job tool such as funnel that can speak to HTCondor or Slurm already. This would require to adapt REANA internal components to speak TES and to adapt funnel to add Kubernetes support and other features it may not have.

The former is the fastest to start with, the latter may be interesting in longer term, should there be more tools available around the GA4GH TES/WES ecosystem.

CodyKank · 2018-12-20T15:34:43Z

I can definitely see the benefit of launching most of the infrastructure on non-HPC and sending heavy lifting to HPC clusters. Docker will not be supported at most if not all HPC resources thus Singularity or Shifter will be the popular alternatives to use. We have been attempting to stand up the reana infrastructure components in pure singularity and are facing some challenges. That said, if attempting the split infrastructure route I do have some concerns mainly with user authentication:

Automating authentication in a secure fashion could be tricky, depending on whether or not the HPC resource allows ssh-keys. This is a bigger issue if the site requires multi-factor authentication, which the number of sites requiring this are increasing.
If the workflow-execution layer stays within k8s or another orchestration service, for each step in the workflow would it need to authenticate with the HPC resource? Not a huge concern if ssh-keys are allowed but again this cannot be guaranteed with every system.
There would be a need to submit the job in terms of the mpirun call etc which can be handled by the REANA_Job-controller and some way to report back the completion or failure of the job which would also require some sort of authentication to a head node. Also, most centers do not like long running processes on head nodes (i.e. some sort of listener) so spawning a process to sit and listen most likely would not be feasible at large NSF resources. Some sort of emailing method could be created but in practice that may not be the best way.

khurtado · 2018-12-21T16:23:19Z

To expand on the approach that keeps most of the infrastructure on some non-HPC edge service:
There is a DOE funded project called VC3: https://www.virtualclusters.org that allows building a virtual cluster with different middlewares (right now: HTCondor, WorkQueue, SPARK) to different computing centers (campus clusters and HPC centers) like NERSC/Cori or Bridges by requiring an allocation on these centers + an ssh key.

The advantage here is that VC3 gives you a VM launched via openstack on the fly (the headnode is in principle created within a few clicks using a portal website) and with HTCondor installed on it. All condor jobs submitted from this VM are matched to glidein pilots running at the campus or HPC Centers (SLURM, SGE, PBS, etc supported) configured with the request by a VC3 factory. So, we could have all REANA components in a single machine, including the job controller, but submitting to the "local" condor cluster rather than kubernetes and VC3 would do the job translation work and submission to the remote clusters, disregarding if the batch system there is not HTCondor but PBS, SLURM, SGE or LSF.

The downside of it is that not all centers allow passwordless authentication (ssh keys). NERSC and Bridges have their own standard procedures to do this, but other centers don't, so it would require a case-by-case negotiation with the centers ( I think ATLAS made this sort of arrangement with TACC for example). This approach doesn't work with centers with worker nodes with no outbound connection either (so ALCF resources wouldn't work with this at present).
I know CSCS is part of the scenario here. I'm not sure if they would allow ssh keys, but I think they have an HTCondor Compute Element that could be used to submit grid condor jobs instead for this case.

diegodelemos · 2019-04-26T08:59:51Z

We shall implement a new job backend prototype for Slurm. It should aim at submitting a REANA non-dockerised, no input/outputs files job to Slurm (file management will be addressed by #143).

* SSH client is needed for connecting to SLURM submit node. Connects reanahub#71

* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>

* SSH client is needed for connecting to SLURM submit node. Connects reanahub#71

* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>

* SSH client is needed for connecting to SLURM submit node. Connects reanahub#71

* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>

* Addresses reanahub/reana-job-controller/issues/71

* closes reanahub#71, reanahub#143

* Addresses reanahub/reana-job-controller#71

* closes reanahub#71, reanahub#143

* Addresses reanahub/reana-job-controller/issues/71

* Addresses reanahub/reana-job-controller#71

* Addresses reanahub/reana-job-controller/issues/71

* closes reanahub#71, reanahub#143

* Renames HTCONDORCERN_USERNAME,CERN_USERNAME to CERN_USER Addresses reanahub/reana-job-controller#71

* closes reanahub#71, reanahub#143

* Addresses reanahub/reana-job-controller/issues/71

tiborsimko mentioned this issue Jan 21, 2019

infrastructure: run an instance of RJC together with each workflow engine #105

Closed

tiborsimko mentioned this issue Jan 31, 2019

Launch jobs on HTCondor #110

Closed

diegodelemos mentioned this issue Mar 27, 2019

config: add k8s default namespace reanahub/reana-commons#106

Merged

diegodelemos mentioned this issue Apr 26, 2019

slurm: implement first Slurm submission prototype #141

Closed

roksys self-assigned this Aug 5, 2019

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Aug 8, 2019

Dockerfile: install ssh client

d10c0a6

* SSH client is needed for connecting to SLURM submit node. Connects reanahub#71

roksys mentioned this issue Aug 8, 2019

Dockerfile: install ssh client #174

Merged

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Aug 12, 2019

utils: adds ssh client prototype

8cb5f4d

* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>

roksys mentioned this issue Aug 12, 2019

utils: adds ssh client prototype #180

Merged

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 10, 2019

Dockerfile: install ssh client

7b3c1ef

* SSH client is needed for connecting to SLURM submit node. Connects reanahub#71

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 10, 2019

utils: adds ssh client prototype

3897355

* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 10, 2019

utils: adds ssh client prototype

b3e9322

* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 10, 2019

utils: adds ssh client prototype

b605fe7

* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 10, 2019

Dockerfile: install ssh client

560381f

* SSH client is needed for connecting to SLURM submit node. Connects reanahub#71

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 11, 2019

utils: adds ssh client prototype

7ac1733

* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 11, 2019

utils: adds ssh client prototype

394e520

* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 11, 2019

utils: adds ssh client prototype

528db20

* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>

roksys pushed a commit to roksys/reana-client that referenced this issue Dec 3, 2019

docs: reanmes HTCONDORCERN* vars to CERN*

ef04063

* Addresses reanahub/reana-job-controller/issues/71

roksys mentioned this issue Dec 3, 2019

docs: reanmes HTCONDORCERN* vars to CERN* reanahub/reana-client#331

Merged

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019

slurm: adds 1st prototype

cf4a36d

* closes reanahub#71, reanahub#143

roksys pushed a commit to roksys/reana-workflow-controller that referenced this issue Dec 3, 2019

workflow_run_manager: renames HTCONDORCERN_USERNAME to CERN_USERNAME

cc26f43

* Addresses reanahub/reana-job-controller#71

roksys mentioned this issue Dec 3, 2019

workflow_run_manager: renames HTCONDORCERN_USERNAME to CERN_USERNAME reanahub/reana-workflow-controller#268

Merged

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019

slurm: adds 1st prototype

ce00241

* closes reanahub#71, reanahub#143

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019

slurm: adds 1st prototype

732c772

* closes reanahub#71, reanahub#143

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019

slurm: adds 1st prototype

3417b77

* closes reanahub#71, reanahub#143

roksys pushed a commit to roksys/reana-client that referenced this issue Dec 3, 2019

docs: reanmes HTCONDORCERN* vars to CERN*

a3f1b04

* Addresses reanahub/reana-job-controller/issues/71

roksys pushed a commit to roksys/reana-workflow-controller that referenced this issue Dec 3, 2019

workflow_run_manager: renames HTCONDORCERN_USERNAME to CERN_USERNAME

11bdde7

* Addresses reanahub/reana-job-controller#71

roksys pushed a commit to roksys/reana-client that referenced this issue Dec 3, 2019

docs: reanmes HTCONDORCERN* vars to CERN*

b51f0b1

* Addresses reanahub/reana-job-controller/issues/71

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019

slurm: adds 1st prototype

4080a23

* closes reanahub#71, reanahub#143

roksys pushed a commit to roksys/reana-workflow-controller that referenced this issue Dec 3, 2019

workflow_run_manager: renames env vars

3f6d3aa

* Renames HTCONDORCERN_USERNAME,CERN_USERNAME to CERN_USER Addresses reanahub/reana-job-controller#71

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019

slurm: adds 1st prototype

ad3ce9b

* closes reanahub#71, reanahub#143

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019

slurm: adds 1st prototype

0a470cb

* closes reanahub#71, reanahub#143

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019

slurm: adds 1st prototype

081e593

* closes reanahub#71, reanahub#143

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019

slurm: adds 1st prototype

9d7a774

* closes reanahub#71, reanahub#143

roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019

slurm: adds 1st prototype

dd0447e

* closes reanahub#71, reanahub#143

roksys pushed a commit to roksys/reana-client that referenced this issue Dec 3, 2019

docs: reanmes HTCONDORCERN* vars to CERN*

eee648b

* Addresses reanahub/reana-job-controller/issues/71

roksys pushed a commit to roksys/reana that referenced this issue Dec 3, 2019

docs: adds documentation for comupute backends

f831418

* Addresses reanahub/reana-job-controller/issues/71

roksys mentioned this issue Dec 3, 2019

docs: adds documentation for comupute backends reanahub/reana#227

Merged

roksys pushed a commit to roksys/reana that referenced this issue Dec 3, 2019

docs: adds documentation for comupute backends

b2f74d8

* Addresses reanahub/reana-job-controller/issues/71

roksys pushed a commit to roksys/reana that referenced this issue Dec 3, 2019

docs: adds documentation for comupute backends

a095266

* Addresses reanahub/reana-job-controller/issues/71

roksys pushed a commit to roksys/reana that referenced this issue Dec 4, 2019

docs: adds documentation for comupute backends

8e7919d

* Addresses reanahub/reana-job-controller/issues/71

tiborsimko closed this as completed in 104556b Dec 4, 2019

tiborsimko closed this as completed in #199 Dec 4, 2019

roksys pushed a commit to roksys/reana that referenced this issue Dec 4, 2019

docs: adds documentation for comupute backends

a967b25

* Addresses reanahub/reana-job-controller/issues/71

tiborsimko pushed a commit to roksys/reana that referenced this issue Dec 12, 2019

docs: adds documentation for comupute backends

9ad8ad4

* Addresses reanahub/reana-job-controller/issues/71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Launch jobs on HPC (SLURM) #71

Launch jobs on HPC (SLURM) #71

jirikuncar commented Jun 26, 2018

lukasheinrich commented Jun 28, 2018

jirikuncar commented Jun 28, 2018

lukasheinrich commented Jun 28, 2018

tiborsimko commented Jul 2, 2018

tiborsimko commented Dec 20, 2018

tiborsimko commented Dec 20, 2018

CodyKank commented Dec 20, 2018

khurtado commented Dec 21, 2018 •

edited

Loading

diegodelemos commented Apr 26, 2019

Launch jobs on HPC (SLURM) #71

Launch jobs on HPC (SLURM) #71

Comments

jirikuncar commented Jun 26, 2018

lukasheinrich commented Jun 28, 2018

jirikuncar commented Jun 28, 2018

lukasheinrich commented Jun 28, 2018

tiborsimko commented Jul 2, 2018

tiborsimko commented Dec 20, 2018

tiborsimko commented Dec 20, 2018

CodyKank commented Dec 20, 2018

khurtado commented Dec 21, 2018 • edited Loading

diegodelemos commented Apr 26, 2019

khurtado commented Dec 21, 2018 •

edited

Loading