You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Because of memory leaks in the underlying OS, we have to upgrade our clusters to be based on centos 7.4. This also means that the slurm version will be newer than the current 16.x version. This requires some changes to slurmlib in kive. The --uid and --guid options in sbatch are no longer possible, and as we no longer need them anyway, these will be removed.
The text was updated successfully, but these errors were encountered:
For docker support, had added the -sint -f options to scancel. This is required so that a running docker wrapper script cleans up after itself. Without the -sint option, the running process gets a kill signal preventing cleanup. However, scancel with the -s option will not work (it hangs) if the job is not actually running but still in the queue (this make sense -- a 'pending' job does not have a process to send the signal to). The envisaged solution: issue an scontrol hold jobid command in order to freeze a job in the pending state (a running job is unchanged). Then, read its state and issue the required scancel command.
When a running job is sent a scancel -sint -f, the process in question will receive an INT signal. In our case, this will be the dockerwrapper script. The exit code of this script (exactly how it responds to the signal), will determine the final slurm state of the job. Process returns 0 ==> slurm state COMPLETED. Process returns nonzero ==? slurm state FAILED. Under no circumstances is a slurm state CANCELLED recorded, in contrast with cancelling a pending job (this goes to theslurm state CANCELLED).
Currently, the docker_wrapper script returns a nonzero return code on SIGINT.
Because of memory leaks in the underlying OS, we have to upgrade our clusters to be based on centos 7.4. This also means that the slurm version will be newer than the current 16.x version. This requires some changes to slurmlib in kive. The --uid and --guid options in sbatch are no longer possible, and as we no longer need them anyway, these will be removed.
The text was updated successfully, but these errors were encountered: