Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launch jobs on HPC (SLURM) #71

Closed
jirikuncar opened this issue Jun 26, 2018 · 9 comments · Fixed by #199
Closed

Launch jobs on HPC (SLURM) #71

jirikuncar opened this issue Jun 26, 2018 · 9 comments · Fixed by #199
Assignees

Comments

@jirikuncar
Copy link
Member

Would it be possible to prepare a prototype to launch jobs on HPC cluster?

@lukasheinrich
Copy link
Member

@jirikuncar I've looked into this as well -- how would you run containers? via a runtime like singularity ? SLURM doesn't seem to have a nice (python) API though. If you know one can you point to it?

@jirikuncar
Copy link
Member Author

@lukasheinrich
Copy link
Member

Hi Jriri, yeah singularity and shifter were the runtimes I was thinking about, but afaik they are a layer below SLURM (I.e. I need to have an API to slurm to submit a shifter job).. I talked about having a python API to SLURM+Shifter on NERSC with @iamholger as well, maybe we can hash one out and see to to best implement it.

@tiborsimko
Copy link
Member

@jirikuncar A very light Singularity support may come early autumn, see #62. The full HPC support would be great to have, but the timing is unsure...

@tiborsimko
Copy link
Member

Reviving the issue as we are about to address HPC backend integration through our NSF collaborations.

There are two basic directions how to run REANA on HPC:

  • Run REANA infrastructure on a small non-HPC deployment (using K8s, using OpenShift, using whatever) and let only the heavy-lifting jobs to run on HPC. This would mean to adapt mostly the REANA-Job-Controller component only. (Plus some shared storage etc as discussed below.)

  • Run REANA itself on HPC completely. This would mean to adapt many REANA components to prepare for non-Kubernetes deployments.

One can distinguish roughly the five compute layers of the REANA platform:

  1. Infrastructure layer, i.e. pods such as REANA-Server (accepting REST API commands from people via their reana-client sessions) and REANA-Workflow-Controller (handling commands regarding preparation and managing people's individual workflow runs).

  2. Workflow runtime layer, i.e. pods orchestrating CWL/Serial/Yadage workflows for individual people, e.g. running John Doe's ttH batch analysis or Jane Doe's interactive Jupyter notebook.

  3. Job compute layer, i.e. running task payload itself, e.g. this-and-this cmsRun command or run this-and-this lightweight ROOT macro or this-and-this mpirun call.

  4. ... plus some shared persistent volumes for file exchange within workflow. (S3 buckets?)

  5. ... plus some REANA infrastructure services such as DB and MQ.

One can really take advantage of running different layers on different architectures. Ideally, one could keep the REANA service infrastructure and workflow orchestration running on Kubernetes and adapt only the job execution layers, i.e. 3-4 for HPC backend scenario. That would be the quickest...

@tiborsimko
Copy link
Member

Another consideration is the job execution API supporting various backends. Currently REANA supports job execution only on Kubernetes, however HTCondor, Singularity, Slurm, etc are all desirable options. There are two basic directions:

  1. Keep our internal job API and extend it with functionalities necessary for HPC and other backends. This would permit to keep the rest of the REANA ecosystem unchanged and adapt only the REANA-Job-Controller component to plug various backends.

  2. Change our internal job API for something like GA4GH TES and use some existing job tool such as funnel that can speak to HTCondor or Slurm already. This would require to adapt REANA internal components to speak TES and to adapt funnel to add Kubernetes support and other features it may not have.

The former is the fastest to start with, the latter may be interesting in longer term, should there be more tools available around the GA4GH TES/WES ecosystem.

@CodyKank
Copy link

I can definitely see the benefit of launching most of the infrastructure on non-HPC and sending heavy lifting to HPC clusters. Docker will not be supported at most if not all HPC resources thus Singularity or Shifter will be the popular alternatives to use. We have been attempting to stand up the reana infrastructure components in pure singularity and are facing some challenges. That said, if attempting the split infrastructure route I do have some concerns mainly with user authentication:

  1. Automating authentication in a secure fashion could be tricky, depending on whether or not the HPC resource allows ssh-keys. This is a bigger issue if the site requires multi-factor authentication, which the number of sites requiring this are increasing.

  2. If the workflow-execution layer stays within k8s or another orchestration service, for each step in the workflow would it need to authenticate with the HPC resource? Not a huge concern if ssh-keys are allowed but again this cannot be guaranteed with every system.

  3. There would be a need to submit the job in terms of the mpirun call etc which can be handled by the REANA_Job-controller and some way to report back the completion or failure of the job which would also require some sort of authentication to a head node. Also, most centers do not like long running processes on head nodes (i.e. some sort of listener) so spawning a process to sit and listen most likely would not be feasible at large NSF resources. Some sort of emailing method could be created but in practice that may not be the best way.

@khurtado
Copy link
Contributor

khurtado commented Dec 21, 2018

To expand on the approach that keeps most of the infrastructure on some non-HPC edge service:
There is a DOE funded project called VC3: https://www.virtualclusters.org that allows building a virtual cluster with different middlewares (right now: HTCondor, WorkQueue, SPARK) to different computing centers (campus clusters and HPC centers) like NERSC/Cori or Bridges by requiring an allocation on these centers + an ssh key.

The advantage here is that VC3 gives you a VM launched via openstack on the fly (the headnode is in principle created within a few clicks using a portal website) and with HTCondor installed on it. All condor jobs submitted from this VM are matched to glidein pilots running at the campus or HPC Centers (SLURM, SGE, PBS, etc supported) configured with the request by a VC3 factory. So, we could have all REANA components in a single machine, including the job controller, but submitting to the "local" condor cluster rather than kubernetes and VC3 would do the job translation work and submission to the remote clusters, disregarding if the batch system there is not HTCondor but PBS, SLURM, SGE or LSF.

The downside of it is that not all centers allow passwordless authentication (ssh keys). NERSC and Bridges have their own standard procedures to do this, but other centers don't, so it would require a case-by-case negotiation with the centers ( I think ATLAS made this sort of arrangement with TACC for example). This approach doesn't work with centers with worker nodes with no outbound connection either (so ALCF resources wouldn't work with this at present).
I know CSCS is part of the scenario here. I'm not sure if they would allow ssh keys, but I think they have an HTCondor Compute Element that could be used to submit grid condor jobs instead for this case.

@diegodelemos
Copy link
Member

We shall implement a new job backend prototype for Slurm. It should aim at submitting a REANA non-dockerised, no input/outputs files job to Slurm (file management will be addressed by #143).

@roksys roksys self-assigned this Aug 5, 2019
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Aug 8, 2019
* SSH client is needed for  connecting to SLURM submit
  node. Connects reanahub#71
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Aug 12, 2019
* Addresses reanahub#71

Signed-off-by: Rokas Maciulaitis <[email protected]>
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 10, 2019
* SSH client is needed for  connecting to SLURM submit
  node. Connects reanahub#71
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 10, 2019
* Addresses reanahub#71

Signed-off-by: Rokas Maciulaitis <[email protected]>
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 10, 2019
* Addresses reanahub#71

Signed-off-by: Rokas Maciulaitis <[email protected]>
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 10, 2019
* Addresses reanahub#71

Signed-off-by: Rokas Maciulaitis <[email protected]>
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 10, 2019
* SSH client is needed for  connecting to SLURM submit
  node. Connects reanahub#71
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 11, 2019
* Addresses reanahub#71

Signed-off-by: Rokas Maciulaitis <[email protected]>
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 11, 2019
* Addresses reanahub#71

Signed-off-by: Rokas Maciulaitis <[email protected]>
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Oct 11, 2019
* Addresses reanahub#71

Signed-off-by: Rokas Maciulaitis <[email protected]>
roksys pushed a commit to roksys/reana-client that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-workflow-controller that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-client that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-workflow-controller that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-client that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-workflow-controller that referenced this issue Dec 3, 2019
* Renames HTCONDORCERN_USERNAME,CERN_USERNAME to CERN_USER
  Addresses reanahub/reana-job-controller#71
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-job-controller that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana-client that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana that referenced this issue Dec 3, 2019
roksys pushed a commit to roksys/reana that referenced this issue Dec 4, 2019
roksys pushed a commit to roksys/reana that referenced this issue Dec 4, 2019
tiborsimko pushed a commit to roksys/reana that referenced this issue Dec 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants