Runs cancelled for memory limits not marked #758

donkirkby · 2019-01-17T22:55:56Z

I launched a bunch of MiCall runs overnight on the test server, and five of the resistance runs got cancelled by Slurm for exceeding the 100MB memory limit. Because there's no fleet manager watching them, those runs just stayed in the running state.
When I was looking at the runs that failed, I found that they ran for two to four minutes on the compute nodes, while the head node ran them in less than ten seconds. I wonder if issue #240 has somehow reappeared. Could the slurm jobs be running under the system Python instead of the virtualenv?

~~Record failure when Slurm kills a job, by catching KeyboardInterruptError. Distinguish from user cancelling a job in Kive.~~ Slurm seems to always use SIGKILL, and that can't be caught.
~~Fix configuration so that jobs aren't so slow on the compute nodes.~~ (Wasn't using system Python.)
Record slurm job id from call to sbatch.
Check that slurm jobs are still active when container runs are queried.
Don't update end time when job is already ended.

The text was updated successfully, but these errors were encountered:

donkirkby · 2019-01-18T18:06:18Z

The slurm jobs are not running under the system Python, it's just slow to import all of the Django framework. This will probably get better when we migrate to Python 3.
We should probably add a step to the purge command that checks for runs that got killed.

donkirkby · 2019-01-18T20:00:55Z

New plan: check slurm job status when container runs are queried.

donkirkby · 2019-01-21T17:25:42Z

ContainerRun.check_slurm_state() isn't checking the run state when it's given a run id. That makes any completed run get marked as failed with the current time when you go to view its run page.

donkirkby added the bug label Jan 17, 2019

donkirkby added this to the 0.13 Simplify Pipeline Configuration milestone Jan 17, 2019

donkirkby self-assigned this Jan 18, 2019

donkirkby added a commit that referenced this issue Jan 18, 2019

Record slurm job id from call to sbatch for #758.

1e4f3e2

donkirkby closed this as completed in d3a5c80 Jan 19, 2019

donkirkby reopened this Jan 21, 2019

donkirkby closed this as completed in 26af802 Jan 21, 2019

donkirkby mentioned this issue Apr 25, 2019

No visible error when job runs out of memory #786

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runs cancelled for memory limits not marked #758

Runs cancelled for memory limits not marked #758

donkirkby commented Jan 17, 2019 •

edited

Loading

donkirkby commented Jan 18, 2019

donkirkby commented Jan 18, 2019

donkirkby commented Jan 21, 2019

Runs cancelled for memory limits not marked #758

Runs cancelled for memory limits not marked #758

Comments

donkirkby commented Jan 17, 2019 • edited Loading

donkirkby commented Jan 18, 2019

donkirkby commented Jan 18, 2019

donkirkby commented Jan 21, 2019

donkirkby commented Jan 17, 2019 •

edited

Loading