You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On our cluster here we run ariba for some pipelines and I was noticing a pattern where early samples would not finish properly. No output folders but for stderr of early samples we get:
Stopping! Signal received: 18
Stopping! Signal received: 20
Now this is very likely due to our grid management setup with slurm where we have partitions set up such that the partition you are submitted to determines the job ordering (ie high priority jobs take resources from low priority jobs). So when we run a batch of jobs on a per sample basis it launches alphabetically which often has higher priority jobs later in the list. So the initial jobs are started then suspended (into memory) then resumed once high priority jobs are done. This leads to the first set of non priority samples to have the error mentioned above (18 being sigcontv likely telling ariba to resume but I'm guessing ariba handles all signals by stopping). Re-running the commands when the queue is not busy fixes the issue (though is unfriendly for automation).
Hope this information makes sense if not let me know and I'll try and describe it better.
Kind regards,
-Kim Ng
The text was updated successfully, but these errors were encountered:
Hi @kimleeng. Interesting issue. I can replicate this running:
ariba test out
and sending a SIGTSTP signal to it. I haven't got time to fix it right now, but will look into it as there seems to be a batch of issues around signal handling.
Hello,
On our cluster here we run ariba for some pipelines and I was noticing a pattern where early samples would not finish properly. No output folders but for stderr of early samples we get:
Stopping! Signal received: 18
Stopping! Signal received: 20
Now this is very likely due to our grid management setup with slurm where we have partitions set up such that the partition you are submitted to determines the job ordering (ie high priority jobs take resources from low priority jobs). So when we run a batch of jobs on a per sample basis it launches alphabetically which often has higher priority jobs later in the list. So the initial jobs are started then suspended (into memory) then resumed once high priority jobs are done. This leads to the first set of non priority samples to have the error mentioned above (18 being sigcontv likely telling ariba to resume but I'm guessing ariba handles all signals by stopping). Re-running the commands when the queue is not busy fixes the issue (though is unfriendly for automation).
Hope this information makes sense if not let me know and I'll try and describe it better.
Kind regards,
-Kim Ng
The text was updated successfully, but these errors were encountered: