You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running PBS Pro 13.1.0.16056 , having modified the submission templates as per #184 , and my PR #185 , I've noticed that the worker nodes don't clean-up neatly when the master node terminates.
When submitting using drake, upon failure of a target, the workflow is supposed to stop, and the workers terminated.
When a target fails, I subsequently get the following error:
qdel: illegally formed job identifier: cmq7082
This corresponds to the job name for the job array, given by the socket of (I assume) the first worker in the array (or maybe the master).
When I examine the output of qstat -u mstr3336 -x , I see the following:
pbsserver:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3965541.pbsserv mstr3336 small run_make_h 60985 1 1 16gb 23:59 F 00:44
3965544[].pbsse mstr3336 small cmq7082 -- 1 1 4gb 23:59 F --
We see that the jobID of our batch array is 3965544[] , and the job name was indeed given by our submission script.
After you submit a job, PBS returns a job identifier. Format for a job: <sequence number>.<server name>
Format for a job array:
<sequence number>[].<server name>.<domain>
You’ll need the job identifier for any actions involving the job, such as checking job status, modifying the job, tracking the job, or deleting the job
Additionally, the environment variable PBS_JOBID is exposed for the .pbs script.
So it's clear that either:
the return from the qsub for the batch job is needed, or
the PBS_JOBID somehow needs to be sent back to master.
My intuition tells me that getting the return of qsub is the simpler option, though given the following:
stop("Job submission failed with error code ", success)
The result of system(...) is the command's error status.
After checking the man page for system, we can see that by setting intern = TRUE, and then doing a little extra work to retrieve the command output, we are able to access both.
I'll experiment with this, and then put in a PR if all goes well
The text was updated successfully, but these errors were encountered:
Running PBS Pro 13.1.0.16056 , having modified the submission templates as per #184 , and my PR #185 , I've noticed that the worker nodes don't clean-up neatly when the master node terminates.
When submitting using
drake
, upon failure of a target, the workflow is supposed to stop, and the workers terminated.When a target fails, I subsequently get the following error:
This corresponds to the
job name
for the job array, given by the socket of (I assume) the first worker in the array (or maybe the master).When I examine the output of
qstat -u mstr3336 -x
, I see the following:We see that the jobID of our batch array is 3965544[] , and the job name was indeed given by our submission script.
Referring to the SGE child class:
clustermq/R/qsys_sge.r
Lines 26 to 38 in e7c68ed
We see that the
finalize
function callsqdel
onjob_id
, which seems okay, but looking closer at thesubmit jobs
implementation:clustermq/R/qsys_sge.r
Lines 14 to 17 in e7c68ed
job_id
is simply given byjob_name
.(
job_name
is inherited from the following:clustermq/R/qsys.r
Lines 221 to 237 in e7c68ed
)
Uh oh! This is not in concordance with the PBS specs:
PBS Professional 18.2 User’s Guide UG-13
Excerpt from PBS Guide
Additionally, the environment variable
PBS_JOBID
is exposed for the .pbs script.So it's clear that either:
qsub
for the batch job is needed, orPBS_JOBID
somehow needs to be sent back to master.My intuition tells me that getting the return of qsub is the simpler option, though given the following:
clustermq/R/qsys_sge.r
Lines 19 to 22 in e7c68ed
The result of
system(...)
is the command's error status.After checking the man page for
system
, we can see that by settingintern = TRUE
, and then doing a little extra work to retrieve the command output, we are able to access both.I'll experiment with this, and then put in a PR if all goes well
The text was updated successfully, but these errors were encountered: