-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Launch jobs on HPC (SLURM) #71
Comments
@jirikuncar I've looked into this as well -- how would you run containers? via a runtime like singularity ? SLURM doesn't seem to have a nice (python) API though. If you know one can you point to it? |
Hi Jriri, yeah singularity and shifter were the runtimes I was thinking about, but afaik they are a layer below SLURM (I.e. I need to have an API to slurm to submit a shifter job).. I talked about having a python API to SLURM+Shifter on NERSC with @iamholger as well, maybe we can hash one out and see to to best implement it. |
@jirikuncar A very light Singularity support may come early autumn, see #62. The full HPC support would be great to have, but the timing is unsure... |
Reviving the issue as we are about to address HPC backend integration through our NSF collaborations. There are two basic directions how to run REANA on HPC:
One can distinguish roughly the five compute layers of the REANA platform:
One can really take advantage of running different layers on different architectures. Ideally, one could keep the REANA service infrastructure and workflow orchestration running on Kubernetes and adapt only the job execution layers, i.e. 3-4 for HPC backend scenario. That would be the quickest... |
Another consideration is the job execution API supporting various backends. Currently REANA supports job execution only on Kubernetes, however HTCondor, Singularity, Slurm, etc are all desirable options. There are two basic directions:
The former is the fastest to start with, the latter may be interesting in longer term, should there be more tools available around the GA4GH TES/WES ecosystem. |
I can definitely see the benefit of launching most of the infrastructure on non-HPC and sending heavy lifting to HPC clusters. Docker will not be supported at most if not all HPC resources thus Singularity or Shifter will be the popular alternatives to use. We have been attempting to stand up the reana infrastructure components in pure singularity and are facing some challenges. That said, if attempting the split infrastructure route I do have some concerns mainly with user authentication:
|
To expand on the approach that keeps most of the infrastructure on some non-HPC edge service: The advantage here is that VC3 gives you a VM launched via openstack on the fly (the headnode is in principle created within a few clicks using a portal website) and with HTCondor installed on it. All condor jobs submitted from this VM are matched to glidein pilots running at the campus or HPC Centers (SLURM, SGE, PBS, etc supported) configured with the request by a VC3 factory. So, we could have all REANA components in a single machine, including the job controller, but submitting to the "local" condor cluster rather than kubernetes and VC3 would do the job translation work and submission to the remote clusters, disregarding if the batch system there is not HTCondor but PBS, SLURM, SGE or LSF. The downside of it is that not all centers allow passwordless authentication (ssh keys). NERSC and Bridges have their own standard procedures to do this, but other centers don't, so it would require a case-by-case negotiation with the centers ( I think ATLAS made this sort of arrangement with TACC for example). This approach doesn't work with centers with worker nodes with no outbound connection either (so ALCF resources wouldn't work with this at present). |
We shall implement a new job backend prototype for Slurm. It should aim at submitting a REANA non-dockerised, no input/outputs files job to Slurm (file management will be addressed by #143). |
* SSH client is needed for connecting to SLURM submit node. Connects reanahub#71
* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>
* SSH client is needed for connecting to SLURM submit node. Connects reanahub#71
* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>
* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>
* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>
* SSH client is needed for connecting to SLURM submit node. Connects reanahub#71
* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>
* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>
* Addresses reanahub#71 Signed-off-by: Rokas Maciulaitis <[email protected]>
* Renames HTCONDORCERN_USERNAME,CERN_USERNAME to CERN_USER Addresses reanahub/reana-job-controller#71
Would it be possible to prepare a prototype to launch jobs on HPC cluster?
The text was updated successfully, but these errors were encountered: