PBS Batch jobs don't clean-up properly (Illegal job identifier) (PBSPro_13.1.0.160576) #186

strazto · 2020-01-09T04:25:35Z

Running PBS Pro 13.1.0.16056 , having modified the submission templates as per #184 , and my PR #185 , I've noticed that the worker nodes don't clean-up neatly when the master node terminates.

When submitting using drake, upon failure of a target, the workflow is supposed to stop, and the workers terminated.

When a target fails, I subsequently get the following error:

qdel: illegally formed job identifier: cmq7082

This corresponds to the job name for the job array, given by the socket of (I assume) the first worker in the array (or maybe the master).

When I examine the output of qstat -u mstr3336 -x , I see the following:

pbsserver:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
3965541.pbsserv mstr3336 small    run_make_h  60985   1   1   16gb 23:59 F 00:44
3965544[].pbsse mstr3336 small    cmq7082             --    1   1    4gb 23:59 F   --

We see that the jobID of our batch array is 3965544[] , and the job name was indeed given by our submission script.

Referring to the SGE child class:

clustermq/R/qsys_sge.r

Lines 26 to 38 in e7c68ed

    
                   finalize = function(quiet=self$workers_running == 0) { 
        
                       if (!private$is_cleaned_up) { 
        
                           system(paste("qdel", private$job_id), 
        
                                  ignore.stdout=quiet, ignore.stderr=quiet, wait=FALSE) 
        
                           private$is_cleaned_up = TRUE 
        
                       } 
        
                   } 
        
               ), 
        
               private = list( 
        
                   job_id = NULL 
        
               ) 
        
           )

We see that the finalize function calls qdel on job_id, which seems okay, but looking closer at the submit jobs implementation:

clustermq/R/qsys_sge.r

Lines 14 to 17 in e7c68ed

    
           submit_jobs = function(...) { 
        
               opts = private$fill_options(...) 
        
               private$job_id = opts$job_name 
        
               filled = private$fill_template(opts)

job_id is simply given by job_name.

(job_name is inherited from the following:

clustermq/R/qsys.r

Lines 221 to 237 in e7c68ed

    
           fill_options = function(...) { 
        
               values = utils::modifyList(private$defaults, list(...)) 
        
               values$master = private$master 
        
               if (grepl("CMQ_AUTH", private$template)) { 
        
                   # note: auth will be obligatory in the future and this check will 
        
                   #   be removed (i.e., filling will fail if no field in template) 
        
                   values$auth = private$auth = paste(sample(letters, 5, TRUE), collapse="") 
        
               } else { 
        
                   values$auth = NULL 
        
                   warning("Add 'CMQ_AUTH={{ auth }}' to template to enable socket authentication", 
        
                           immediate.=TRUE) 
        
               } 
        
               if (!"job_name" %in% names(values)) 
        
                   values$job_name = paste0("cmq", private$port) 
        
               private$workers_total = values$n_jobs 
        
               values 
        
           },

)

Uh oh! This is not in concordance with the PBS specs:

PBS Professional 18.2 User’s Guide UG-13

Excerpt from PBS Guide

Submitting a PBS Job Chapter 2

2.1.3 The Job Identifier

After you submit a job, PBS returns a job identifier. Format for a job:
<sequence number>.<server name>

Format for a job array:

<sequence number>[].<server name>.<domain>

You’ll need the job identifier for any actions involving the job, such as checking job status, modifying the job, tracking the job, or deleting the job

Additionally, the environment variable PBS_JOBID is exposed for the .pbs script.

So it's clear that either:

the return from the qsub for the batch job is needed, or
the PBS_JOBID somehow needs to be sent back to master.

My intuition tells me that getting the return of qsub is the simpler option, though given the following:

clustermq/R/qsys_sge.r

Lines 19 to 22 in e7c68ed

    
           success = system("qsub", input=filled, ignore.stdout=TRUE) 
        
           if (success != 0) { 
        
               print(filled) 
        
               stop("Job submission failed with error code ", success)

The result of system(...) is the command's error status.

After checking the man page for system, we can see that by setting intern = TRUE, and then doing a little extra work to retrieve the command output, we are able to access both.

I'll experiment with this, and then put in a PR if all goes well

The text was updated successfully, but these errors were encountered:

strazto · 2020-01-09T05:34:27Z

@mschubert , would you be able to review my PR regarding this?

mschubert · 2020-01-27T09:21:23Z

For completeness, rest of discussion is in #187

strazto mentioned this issue Jan 9, 2020

Now qsys for PBS/Torque schedulers get the actual job_id #187

Merged

mschubert added the bug label Jan 27, 2020

mschubert closed this as completed in #187 Feb 7, 2020

strazto added a commit to strazto/clustermq that referenced this issue Feb 11, 2020

Document bugfix for mschubert#186 in changelog

2f25f97

mschubert pushed a commit that referenced this issue Feb 12, 2020

Document bugfix for #186 in changelog (#190)

e2b191a

strazto mentioned this issue May 14, 2021

(develop) torque/PBS- In worker cleanup, seems to be sending job names instead of PBS ids (Apparent regression of #186) #265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PBS Batch jobs don't clean-up properly (Illegal job identifier) (PBSPro_13.1.0.160576) #186

PBS Batch jobs don't clean-up properly (Illegal job identifier) (PBSPro_13.1.0.160576) #186

strazto commented Jan 9, 2020

Submitting a PBS Job Chapter 2

2.1.3 The Job Identifier

strazto commented Jan 9, 2020

mschubert commented Jan 27, 2020 •

edited

Loading

PBS Batch jobs don't clean-up properly (Illegal job identifier) (PBSPro_13.1.0.160576) #186

PBS Batch jobs don't clean-up properly (Illegal job identifier) (PBSPro_13.1.0.160576) #186

Comments

strazto commented Jan 9, 2020

Submitting a PBS Job Chapter 2

2.1.3 The Job Identifier

strazto commented Jan 9, 2020

mschubert commented Jan 27, 2020 • edited Loading

mschubert commented Jan 27, 2020 •

edited

Loading