modifications for slurm version 17.11 #726

wrpscott · 2018-02-22T17:15:36Z

Because of memory leaks in the underlying OS, we have to upgrade our clusters to be based on centos 7.4. This also means that the slurm version will be newer than the current 16.x version. This requires some changes to slurmlib in kive. The --uid and --guid options in sbatch are no longer possible, and as we no longer need them anyway, these will be removed.

wrpscott · 2018-02-27T22:34:03Z

For docker support, had added the -sint -f options to scancel. This is required so that a running docker wrapper script cleans up after itself. Without the -sint option, the running process gets a kill signal preventing cleanup. However, scancel with the -s option will not work (it hangs) if the job is not actually running but still in the queue (this make sense -- a 'pending' job does not have a process to send the signal to). The envisaged solution: issue an scontrol hold jobid command in order to freeze a job in the pending state (a running job is unchanged). Then, read its state and issue the required scancel command.

wrpscott · 2018-02-27T22:40:21Z

When a running job is sent a scancel -sint -f, the process in question will receive an INT signal. In our case, this will be the dockerwrapper script. The exit code of this script (exactly how it responds to the signal), will determine the final slurm state of the job. Process returns 0 ==> slurm state COMPLETED. Process returns nonzero ==? slurm state FAILED. Under no circumstances is a slurm state CANCELLED recorded, in contrast with cancelling a pending job (this goes to theslurm state CANCELLED).
Currently, the docker_wrapper script returns a nonzero return code on SIGINT.

This partially reverts commit 5a73c86

wrpscott self-assigned this Feb 22, 2018

donkirkby added a commit that referenced this issue Feb 28, 2018

Convert SIGTERM to SIGINT in docker_wrap.py for #726.

8bb8fa0

donkirkby added a commit that referenced this issue Feb 28, 2018

Revert changes to scancel as part of #726.

9623ff3

This partially reverts commit 5a73c86

donkirkby added this to the 0.11 - Docker support milestone Feb 28, 2018

donkirkby closed this as completed Feb 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modifications for slurm version 17.11 #726

modifications for slurm version 17.11 #726

wrpscott commented Feb 22, 2018

wrpscott commented Feb 27, 2018

wrpscott commented Feb 27, 2018

modifications for slurm version 17.11 #726

modifications for slurm version 17.11 #726

Comments

wrpscott commented Feb 22, 2018

wrpscott commented Feb 27, 2018

wrpscott commented Feb 27, 2018