Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runs cancelled for memory limits not marked #758

Closed
4 of 5 tasks
donkirkby opened this issue Jan 17, 2019 · 3 comments
Closed
4 of 5 tasks

Runs cancelled for memory limits not marked #758

donkirkby opened this issue Jan 17, 2019 · 3 comments
Assignees
Labels

Comments

@donkirkby
Copy link
Member

donkirkby commented Jan 17, 2019

I launched a bunch of MiCall runs overnight on the test server, and five of the resistance runs got cancelled by Slurm for exceeding the 100MB memory limit. Because there's no fleet manager watching them, those runs just stayed in the running state.
When I was looking at the runs that failed, I found that they ran for two to four minutes on the compute nodes, while the head node ran them in less than ten seconds. I wonder if issue #240 has somehow reappeared. Could the slurm jobs be running under the system Python instead of the virtualenv?

  • Record failure when Slurm kills a job, by catching KeyboardInterruptError. Distinguish from user cancelling a job in Kive. Slurm seems to always use SIGKILL, and that can't be caught.
  • Fix configuration so that jobs aren't so slow on the compute nodes. (Wasn't using system Python.)
  • Record slurm job id from call to sbatch.
  • Check that slurm jobs are still active when container runs are queried.
  • Don't update end time when job is already ended.
@donkirkby
Copy link
Member Author

The slurm jobs are not running under the system Python, it's just slow to import all of the Django framework. This will probably get better when we migrate to Python 3.
We should probably add a step to the purge command that checks for runs that got killed.

@donkirkby
Copy link
Member Author

New plan: check slurm job status when container runs are queried.

@donkirkby
Copy link
Member Author

ContainerRun.check_slurm_state() isn't checking the run state when it's given a run id. That makes any completed run get marked as failed with the current time when you go to view its run page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant