Prep job failures not captured on exit #691

KateFriedman-NOAA · 2022-03-23T14:33:30Z

Expected behavior

The job (prep.sh) would exit with correct exit value thrown by obsproc package scripts.

Current behavior

Exits with a 0 error code regardless of what happens in the job.

Machines affected

All, doesn't matter the machine.

To Reproduce

Point to the wrong obsproc package (e.g. doesn't exist).

Detailed Description

Here is the bottom of a gdasprep.log on Hera that shows it erroring in the obsproc package script but still exiting with exit 0:

+ 13.467s + /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr/prep.291314/err_exit
+ 13.469s + [ -n '' ]
+ 13.469s + set -e
+ 13.469s + kill -n 9 291727
/scratch1/NCEPDEV/global/glopara/git/obsproc/obsproc_global.v3.4.2/scripts/exglobal_makeprepbufr.sh.ecf: line 81: 291727: Killed
+ 13.494s + errsc=265
+ 13.494s + [ 265 -ne 0 ]
+ 13.494s + exit 265
/scratch1/NCEPDEV/global/glopara/git/obsproc/obsproc_global.v3.4.2/jobs/JGLOBAL_PREP: line 325: 291722 Killed                  $SCRIPTSobsproc_global/exglobal_makeprepbufr.sh.ecf
+ 14s + eval err_gdas_makeprepbufr=137
++ 14s + err_gdas_makeprepbufr=137
+ 14s + eval '[[' '$err_gdas_makeprepbufr' -ne 0 ']]'
++ 14s + [[ 137 -ne 0 ]]
+ 13.470s + exit 7
+ 14s + /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr/prep.291314/err_exit
++ 0s + '[' -n '' ']'
++ 0s + set -e
++ 0s + kill -n 9
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
++ 14s + hostname
++ 14s + date -u
+ 14s + echo ' h16c01  --  Wed Mar 16 17:52:41 UTC 2022'
+ 14s + '[' -n '' ']'
+ 14s + '[' -n '' ']'
+ 14s + '[' NO '!=' YES ']'
+ 14s + cd /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr
+ 14s + rm -rf /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr/prep.291314
+ 14s + date -u
Wed Mar 16 17:52:41 UTC 2022
+ 14s + exit
+ status=0
+ [[ 0 -ne 0 ]]
+ exit 0

This part of jobs/rocoto/prep.sh isn't get the error code coming out of JGLOBAL_PREP:

105     $HOMEobsproc_network/jobs/JGLOBAL_PREP
106     status=$?
107     [[ $status -ne 0 ]] && exit $status

Possible Implementation

Change lines 106 and 107 in jobs/rocoto/prep.sh to get the correct error code variable that comes out of JGLOBAL_PREP.

The text was updated successfully, but these errors were encountered:

lgannoaa · 2022-04-20T18:33:28Z

@KateFriedman-NOAA from what you reported, error code 137 was successfully caught, the exit 7 was issued.
The issue here is the "kill -n 9" instruction has error in syntax.

/scratch1/NCEPDEV/global/glopara/git/obsproc/obsproc_global.v3.4.2/jobs/JGLOBAL_PREP: line 325: 291722 Killed $SCRIPTSobsproc_global/exglobal_makeprepbufr.sh.ecf

14s + eval err_gdas_makeprepbufr=137
++ 14s + err_gdas_makeprepbufr=137
14s + eval '[[' '$err_gdas_makeprepbufr' -ne 0 ']]'
++ 14s + [[ 137 -ne 0 ]]
13.470s + exit 7
14s + /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr/prep.291314/err_exit
++ 0s + '[' -n '' ']'
++ 0s + set -e
++ 0s + kill -n 9
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]

KateFriedman-NOAA · 2022-04-21T12:43:29Z

@lgannoaa While an exit code was caught coming out of one of the OBSPROC scripts, the final exit code in our prep.sh was 0, which is incorrect. See the final lines of the gdasprep.log I show above.

lgannoaa · 2022-04-21T13:01:16Z

@KateFriedman-NOAA , the error code was caught and error exit was executed. That is correct behavior. The error exit kill statement has wrong syntax. I recommend contact system support to find out why this error exit was not working.
++ 0s + kill -n 9
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]

CoryMartin-NOAA · 2023-10-13T17:14:42Z

@kevindougherty-noaa found that this is indeed still an issue. Hera stmp disk quota exceeded earlier in the week and caused the prep job to not produce correct results, but returned an exit code 0. It was only 4 cycles later when the fit2obs job ran that it was discovered that the prepbufr file was not generated.

aerorahul · 2023-10-13T17:41:04Z

@CoryMartin-NOAA
We have identified the root cause of this issue. The failure is in the obsproc code base and even though that job fails, it returns with an exit code 0. The prep job in the global-workflow does not examine the prepbufr file contents and relies on the obsproc j-job JOBSPROC_GLOBAL_PREP to provide the correct exit code.
This has been raised with the obsproc developers.

ilianagenkova · 2023-10-25T17:47:22Z

I am tagging myself here so it gets on my "to do" list @ilianagenkova

DavidHuber-NOAA · 2023-12-21T16:50:57Z

On Hercules, prepobs_prepdata executable is crashing due to the missing MKL library only available on Orion, though the gdasprep and gfsprep jobs are continuing to process thereafter. It seems like a bug that the failure of prepobs_prepdata does not stop the processing of the *prep jobs. An example log file is available here: /work/noaa/global/dhuber/SAVELOGS/cycled_herc2/2021110900/gdasprep.log.

ilianagenkova · 2023-12-21T17:11:05Z

I started looking into this but it's more complicated than simply checking error status and not proceeding further. The code has some intentional "hard crashes" and "silent errors", so we need to understand the reasons for it before changing the code. For example, if a critical data set can't be processed in prepobs, the code crashes in order to get someone's attention - not an elegant solution, but that's how it's done now.

DavidHuber-NOAA · 2023-12-21T18:49:41Z

@ilianagenkova That's good to know. I will mention that this particular problem caused a downstream failure of fit2obs, which fails because the prepbufr file is never generated (by prepobs_prepdata). Granted, fit2obs is a validation piece, but it does stop the cycling process as archiving will not start until fit2obs finishes successfully. I wonder if fit2obs should be amended to finish it's work if the prepbufr is missing.

KateFriedman-NOAA · 2023-12-21T19:08:31Z

@DavidHuber-NOAA In general we don't want the prepbufr file to be missing but part of the problem is that the analysis runs without it (technically, we don't want it to) and the problem documented in this issue means that if the prepbufr file isn't created no one knows without checking since jobs in the cycle don't fail until fit2obs (if it's on). So at the very least, for this issue, we'll want to check for prepbufr existence at the end of the prep job (in the workflow scripts) while waiting for prepobs/obsproc to make updates.

ilianagenkova · 2023-12-21T19:18:46Z

@KateFriedman-NOAA , if this is only dev runs issue, one can default to using the production prepbufr file (if you don't want an experiment to stop) and send notification (mailx) to the developer that something needs to be looked at. Just a thought...

WalterKolczynski-NOAA · 2023-12-21T19:21:58Z

We do want the experiment to stop. The issue is prepobs isn't exiting with a non-zero code, so the workflow thinks it was successful and continues on. This makes identifying the root cause of failures down the line difficult, because the issue was actually in prepobs.

KateFriedman-NOAA · 2023-12-21T19:26:37Z

one can default to using the production prepbufr file

So for development we can't use the production prepbufr because then the output from the experiment's prior cycle won't be included and you'd be resetting the experiment. We need the generated prepbufr each cycle in our experiments.

KateFriedman-NOAA added the bug Something isn't working label Mar 23, 2022

lgannoaa self-assigned this Apr 20, 2022

lgannoaa removed their assignment Apr 21, 2022

lgannoaa mentioned this issue Apr 28, 2022

Orion: err_chk does not abort job when processing non-zero return code #752

Closed

KateFriedman-NOAA self-assigned this Sep 19, 2023

KateFriedman-NOAA mentioned this issue Dec 21, 2023

Port to Hercules #1588

Closed

17 tasks

ilianagenkova mentioned this issue Sep 3, 2024

Convert scripts to bash NOAA-EMC/prepobs#36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prep job failures not captured on exit #691

Prep job failures not captured on exit #691

KateFriedman-NOAA commented Mar 23, 2022

lgannoaa commented Apr 20, 2022 •

edited

Loading

KateFriedman-NOAA commented Apr 21, 2022

lgannoaa commented Apr 21, 2022

CoryMartin-NOAA commented Oct 13, 2023

aerorahul commented Oct 13, 2023

ilianagenkova commented Oct 25, 2023

DavidHuber-NOAA commented Dec 21, 2023

ilianagenkova commented Dec 21, 2023

DavidHuber-NOAA commented Dec 21, 2023

KateFriedman-NOAA commented Dec 21, 2023

ilianagenkova commented Dec 21, 2023

WalterKolczynski-NOAA commented Dec 21, 2023

KateFriedman-NOAA commented Dec 21, 2023

Prep job failures not captured on exit #691

Prep job failures not captured on exit #691

Comments

KateFriedman-NOAA commented Mar 23, 2022

lgannoaa commented Apr 20, 2022 • edited Loading

KateFriedman-NOAA commented Apr 21, 2022

lgannoaa commented Apr 21, 2022

CoryMartin-NOAA commented Oct 13, 2023

aerorahul commented Oct 13, 2023

ilianagenkova commented Oct 25, 2023

DavidHuber-NOAA commented Dec 21, 2023

ilianagenkova commented Dec 21, 2023

DavidHuber-NOAA commented Dec 21, 2023

KateFriedman-NOAA commented Dec 21, 2023

ilianagenkova commented Dec 21, 2023

WalterKolczynski-NOAA commented Dec 21, 2023

KateFriedman-NOAA commented Dec 21, 2023

lgannoaa commented Apr 20, 2022 •

edited

Loading