-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prep job failures not captured on exit #691
Comments
@KateFriedman-NOAA from what you reported, error code 137 was successfully caught, the exit 7 was issued. /scratch1/NCEPDEV/global/glopara/git/obsproc/obsproc_global.v3.4.2/jobs/JGLOBAL_PREP: line 325: 291722 Killed $SCRIPTSobsproc_global/exglobal_makeprepbufr.sh.ecf
|
@lgannoaa While an exit code was caught coming out of one of the OBSPROC scripts, the final exit code in our |
@KateFriedman-NOAA , the error code was caught and error exit was executed. That is correct behavior. The error exit kill statement has wrong syntax. I recommend contact system support to find out why this error exit was not working. |
@kevindougherty-noaa found that this is indeed still an issue. Hera stmp disk quota exceeded earlier in the week and caused the prep job to not produce correct results, but returned an exit code 0. It was only 4 cycles later when the fit2obs job ran that it was discovered that the prepbufr file was not generated. |
@CoryMartin-NOAA |
I am tagging myself here so it gets on my "to do" list @ilianagenkova |
On Hercules, |
I started looking into this but it's more complicated than simply checking error status and not proceeding further. The code has some intentional "hard crashes" and "silent errors", so we need to understand the reasons for it before changing the code. For example, if a critical data set can't be processed in prepobs, the code crashes in order to get someone's attention - not an elegant solution, but that's how it's done now. |
@ilianagenkova That's good to know. I will mention that this particular problem caused a downstream failure of fit2obs, which fails because the prepbufr file is never generated (by prepobs_prepdata). Granted, fit2obs is a validation piece, but it does stop the cycling process as archiving will not start until fit2obs finishes successfully. I wonder if fit2obs should be amended to finish it's work if the prepbufr is missing. |
@DavidHuber-NOAA In general we don't want the prepbufr file to be missing but part of the problem is that the analysis runs without it (technically, we don't want it to) and the problem documented in this issue means that if the prepbufr file isn't created no one knows without checking since jobs in the cycle don't fail until fit2obs (if it's on). So at the very least, for this issue, we'll want to check for prepbufr existence at the end of the prep job (in the workflow scripts) while waiting for prepobs/obsproc to make updates. |
@KateFriedman-NOAA , if this is only dev runs issue, one can default to using the production prepbufr file (if you don't want an experiment to stop) and send notification (mailx) to the developer that something needs to be looked at. Just a thought... |
We do want the experiment to stop. The issue is prepobs isn't exiting with a non-zero code, so the workflow thinks it was successful and continues on. This makes identifying the root cause of failures down the line difficult, because the issue was actually in prepobs. |
So for development we can't use the production prepbufr because then the output from the experiment's prior cycle won't be included and you'd be resetting the experiment. We need the generated prepbufr each cycle in our experiments. |
Expected behavior
The job (prep.sh) would exit with correct exit value thrown by obsproc package scripts.
Current behavior
Exits with a 0 error code regardless of what happens in the job.
Machines affected
All, doesn't matter the machine.
To Reproduce
Point to the wrong obsproc package (e.g. doesn't exist).
Detailed Description
Here is the bottom of a gdasprep.log on Hera that shows it erroring in the obsproc package script but still exiting with
exit 0
:This part of
jobs/rocoto/prep.sh
isn't get the error code coming out ofJGLOBAL_PREP
:Possible Implementation
Change lines 106 and 107 in
jobs/rocoto/prep.sh
to get the correct error code variable that comes out ofJGLOBAL_PREP
.The text was updated successfully, but these errors were encountered: