You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I launched a bunch of MiCall runs overnight on the test server, and five of the resistance runs got cancelled by Slurm for exceeding the 100MB memory limit. Because there's no fleet manager watching them, those runs just stayed in the running state.
When I was looking at the runs that failed, I found that they ran for two to four minutes on the compute nodes, while the head node ran them in less than ten seconds. I wonder if issue #240 has somehow reappeared. Could the slurm jobs be running under the system Python instead of the virtualenv?
Record failure when Slurm kills a job, by catching KeyboardInterruptError. Distinguish from user cancelling a job in Kive. Slurm seems to always use SIGKILL, and that can't be caught.
Fix configuration so that jobs aren't so slow on the compute nodes. (Wasn't using system Python.)
Record slurm job id from call to sbatch.
Check that slurm jobs are still active when container runs are queried.
Don't update end time when job is already ended.
The text was updated successfully, but these errors were encountered:
The slurm jobs are not running under the system Python, it's just slow to import all of the Django framework. This will probably get better when we migrate to Python 3.
We should probably add a step to the purge command that checks for runs that got killed.
ContainerRun.check_slurm_state() isn't checking the run state when it's given a run id. That makes any completed run get marked as failed with the current time when you go to view its run page.
I launched a bunch of MiCall runs overnight on the test server, and five of the resistance runs got cancelled by Slurm for exceeding the 100MB memory limit. Because there's no fleet manager watching them, those runs just stayed in the running state.
When I was looking at the runs that failed, I found that they ran for two to four minutes on the compute nodes, while the head node ran them in less than ten seconds. I wonder if issue #240 has somehow reappeared. Could the slurm jobs be running under the system Python instead of the virtualenv?
Record failure when Slurm kills a job, by catchingSlurm seems to always use SIGKILL, and that can't be caught.KeyboardInterruptError
. Distinguish from user cancelling a job in Kive.Fix configuration so that jobs aren't so slow on the compute nodes.(Wasn't using system Python.)sbatch
.The text was updated successfully, but these errors were encountered: