Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modifications for slurm version 17.11 #726

Closed
wrpscott opened this issue Feb 22, 2018 · 2 comments
Closed

modifications for slurm version 17.11 #726

wrpscott opened this issue Feb 22, 2018 · 2 comments
Assignees

Comments

@wrpscott
Copy link
Contributor

Because of memory leaks in the underlying OS, we have to upgrade our clusters to be based on centos 7.4. This also means that the slurm version will be newer than the current 16.x version. This requires some changes to slurmlib in kive. The --uid and --guid options in sbatch are no longer possible, and as we no longer need them anyway, these will be removed.

@wrpscott wrpscott self-assigned this Feb 22, 2018
@wrpscott
Copy link
Contributor Author

For docker support, had added the -sint -f options to scancel. This is required so that a running docker wrapper script cleans up after itself. Without the -sint option, the running process gets a kill signal preventing cleanup. However, scancel with the -s option will not work (it hangs) if the job is not actually running but still in the queue (this make sense -- a 'pending' job does not have a process to send the signal to). The envisaged solution: issue an scontrol hold jobid command in order to freeze a job in the pending state (a running job is unchanged). Then, read its state and issue the required scancel command.

@wrpscott
Copy link
Contributor Author

When a running job is sent a scancel -sint -f, the process in question will receive an INT signal. In our case, this will be the dockerwrapper script. The exit code of this script (exactly how it responds to the signal), will determine the final slurm state of the job. Process returns 0 ==> slurm state COMPLETED. Process returns nonzero ==? slurm state FAILED. Under no circumstances is a slurm state CANCELLED recorded, in contrast with cancelling a pending job (this goes to theslurm state CANCELLED).
Currently, the docker_wrapper script returns a nonzero return code on SIGINT.

donkirkby added a commit that referenced this issue Feb 28, 2018
This partially reverts commit 5a73c86
@donkirkby donkirkby added this to the 0.11 - Docker support milestone Feb 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants