Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prep job failures not captured on exit #691

Open
KateFriedman-NOAA opened this issue Mar 23, 2022 · 13 comments
Open

Prep job failures not captured on exit #691

KateFriedman-NOAA opened this issue Mar 23, 2022 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@KateFriedman-NOAA
Copy link
Member

Expected behavior

The job (prep.sh) would exit with correct exit value thrown by obsproc package scripts.

Current behavior

Exits with a 0 error code regardless of what happens in the job.

Machines affected

All, doesn't matter the machine.

To Reproduce

Point to the wrong obsproc package (e.g. doesn't exist).

Detailed Description

Here is the bottom of a gdasprep.log on Hera that shows it erroring in the obsproc package script but still exiting with exit 0:

+ 13.467s + /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr/prep.291314/err_exit
+ 13.469s + [ -n '' ]
+ 13.469s + set -e
+ 13.469s + kill -n 9 291727
/scratch1/NCEPDEV/global/glopara/git/obsproc/obsproc_global.v3.4.2/scripts/exglobal_makeprepbufr.sh.ecf: line 81: 291727: Killed
+ 13.494s + errsc=265
+ 13.494s + [ 265 -ne 0 ]
+ 13.494s + exit 265
/scratch1/NCEPDEV/global/glopara/git/obsproc/obsproc_global.v3.4.2/jobs/JGLOBAL_PREP: line 325: 291722 Killed                  $SCRIPTSobsproc_global/exglobal_makeprepbufr.sh.ecf
+ 14s + eval err_gdas_makeprepbufr=137
++ 14s + err_gdas_makeprepbufr=137
+ 14s + eval '[[' '$err_gdas_makeprepbufr' -ne 0 ']]'
++ 14s + [[ 137 -ne 0 ]]
+ 13.470s + exit 7
+ 14s + /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr/prep.291314/err_exit
++ 0s + '[' -n '' ']'
++ 0s + set -e
++ 0s + kill -n 9
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
++ 14s + hostname
++ 14s + date -u
+ 14s + echo ' h16c01  --  Wed Mar 16 17:52:41 UTC 2022'
+ 14s + '[' -n '' ']'
+ 14s + '[' -n '' ']'
+ 14s + '[' NO '!=' YES ']'
+ 14s + cd /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr
+ 14s + rm -rf /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr/prep.291314
+ 14s + date -u
Wed Mar 16 17:52:41 UTC 2022
+ 14s + exit
+ status=0
+ [[ 0 -ne 0 ]]
+ exit 0

This part of jobs/rocoto/prep.sh isn't get the error code coming out of JGLOBAL_PREP:

105     $HOMEobsproc_network/jobs/JGLOBAL_PREP
106     status=$?
107     [[ $status -ne 0 ]] && exit $status

Possible Implementation

Change lines 106 and 107 in jobs/rocoto/prep.sh to get the correct error code variable that comes out of JGLOBAL_PREP.

@KateFriedman-NOAA KateFriedman-NOAA added the bug Something isn't working label Mar 23, 2022
@lgannoaa lgannoaa self-assigned this Apr 20, 2022
@lgannoaa
Copy link
Contributor

lgannoaa commented Apr 20, 2022

@KateFriedman-NOAA from what you reported, error code 137 was successfully caught, the exit 7 was issued.
The issue here is the "kill -n 9" instruction has error in syntax.

/scratch1/NCEPDEV/global/glopara/git/obsproc/obsproc_global.v3.4.2/jobs/JGLOBAL_PREP: line 325: 291722 Killed $SCRIPTSobsproc_global/exglobal_makeprepbufr.sh.ecf

  • 14s + eval err_gdas_makeprepbufr=137
    ++ 14s + err_gdas_makeprepbufr=137
  • 14s + eval '[[' '$err_gdas_makeprepbufr' -ne 0 ']]'
    ++ 14s + [[ 137 -ne 0 ]]
  • 13.470s + exit 7
  • 14s + /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/devv16cyc/2020090200/gdas/prepbufr/prep.291314/err_exit
    ++ 0s + '[' -n '' ']'
    ++ 0s + set -e
    ++ 0s + kill -n 9
    kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]

@KateFriedman-NOAA
Copy link
Member Author

@lgannoaa While an exit code was caught coming out of one of the OBSPROC scripts, the final exit code in our prep.sh was 0, which is incorrect. See the final lines of the gdasprep.log I show above.

@lgannoaa
Copy link
Contributor

@KateFriedman-NOAA , the error code was caught and error exit was executed. That is correct behavior. The error exit kill statement has wrong syntax. I recommend contact system support to find out why this error exit was not working.
++ 0s + kill -n 9
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]

@CoryMartin-NOAA
Copy link
Contributor

@kevindougherty-noaa found that this is indeed still an issue. Hera stmp disk quota exceeded earlier in the week and caused the prep job to not produce correct results, but returned an exit code 0. It was only 4 cycles later when the fit2obs job ran that it was discovered that the prepbufr file was not generated.

@aerorahul
Copy link
Contributor

@CoryMartin-NOAA
We have identified the root cause of this issue. The failure is in the obsproc code base and even though that job fails, it returns with an exit code 0. The prep job in the global-workflow does not examine the prepbufr file contents and relies on the obsproc j-job JOBSPROC_GLOBAL_PREP to provide the correct exit code.
This has been raised with the obsproc developers.

@ilianagenkova
Copy link
Contributor

I am tagging myself here so it gets on my "to do" list @ilianagenkova

@KateFriedman-NOAA KateFriedman-NOAA mentioned this issue Dec 21, 2023
17 tasks
@DavidHuber-NOAA
Copy link
Contributor

On Hercules, prepobs_prepdata executable is crashing due to the missing MKL library only available on Orion, though the gdasprep and gfsprep jobs are continuing to process thereafter. It seems like a bug that the failure of prepobs_prepdata does not stop the processing of the *prep jobs. An example log file is available here: /work/noaa/global/dhuber/SAVELOGS/cycled_herc2/2021110900/gdasprep.log.

@ilianagenkova
Copy link
Contributor

I started looking into this but it's more complicated than simply checking error status and not proceeding further. The code has some intentional "hard crashes" and "silent errors", so we need to understand the reasons for it before changing the code. For example, if a critical data set can't be processed in prepobs, the code crashes in order to get someone's attention - not an elegant solution, but that's how it's done now.

@DavidHuber-NOAA
Copy link
Contributor

@ilianagenkova That's good to know. I will mention that this particular problem caused a downstream failure of fit2obs, which fails because the prepbufr file is never generated (by prepobs_prepdata). Granted, fit2obs is a validation piece, but it does stop the cycling process as archiving will not start until fit2obs finishes successfully. I wonder if fit2obs should be amended to finish it's work if the prepbufr is missing.

@KateFriedman-NOAA
Copy link
Member Author

@DavidHuber-NOAA In general we don't want the prepbufr file to be missing but part of the problem is that the analysis runs without it (technically, we don't want it to) and the problem documented in this issue means that if the prepbufr file isn't created no one knows without checking since jobs in the cycle don't fail until fit2obs (if it's on). So at the very least, for this issue, we'll want to check for prepbufr existence at the end of the prep job (in the workflow scripts) while waiting for prepobs/obsproc to make updates.

@ilianagenkova
Copy link
Contributor

@KateFriedman-NOAA , if this is only dev runs issue, one can default to using the production prepbufr file (if you don't want an experiment to stop) and send notification (mailx) to the developer that something needs to be looked at. Just a thought...

@WalterKolczynski-NOAA
Copy link
Contributor

We do want the experiment to stop. The issue is prepobs isn't exiting with a non-zero code, so the workflow thinks it was successful and continues on. This makes identifying the root cause of failures down the line difficult, because the issue was actually in prepobs.

@KateFriedman-NOAA
Copy link
Member Author

one can default to using the production prepbufr file

So for development we can't use the production prepbufr because then the output from the experiment's prior cycle won't be included and you'd be resetting the experiment. We need the generated prepbufr each cycle in our experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants