-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors in one reporter from fm_dispatch is not propagated to other reporters, instead just crash #9706
Comments
This can be reproduced with $ git diff
+++ b/src/_ert/forward_model_runner/reporting/file.py
@@ -99,6 +99,7 @@ def report(self, msg: Message):
self._dump_error_file(msg.job, error_msg)
elif isinstance(msg, Running):
+ raise OSError("no space left on device")
fm_step_status.update(
max_memory_usage=msg.memory_status.max_rss, In this scenario, there is no reason for |
We have to be a little careful with the file reporter that actually writes to disk and how we should handle failures there. |
We could maybe remove the reporter that got an Exception: for job_status in job_runner.run(parsed_args.job):
logger.info(f"Job status: {job_status}")
i = 0
while i < len(reporters):
reporter = reporters[i]
try:
reporter.report(job_status)
i += 1
except Exception as err:
logger.exception(
f"Reporter {reporter} failed due to {err}. Removing the reporter."
)
if isinstance(reporter, reporting.Event):
reporter.stop()
del reporters[i]
if isinstance(job_status, Finish) and not job_status.success():
_stop_reporters_and_sigkill(reporters) What do you think @sondreso ? |
Think that would be a good solution yes! ( |
The |
Maybe like this then @berland: for job_status in job_runner.run(parsed_args.job):
logger.info(f"Job status: {job_status}")
i = 0
while i < len(reporters):
reporter = reporters[i]
try:
reporter.report(job_status)
i += 1
except Exception as err:
with contextlib.suppress(Exception):
del reporters[i]
if isinstance(reporter, reporting.Event):
reporter.stop()
logger.exception(
f"Reporter {reporter} failed due to {err}. Removing the reporter."
)
if isinstance(job_status, Finish) and not job_status.success():
_stop_reporters_and_sigkill(reporters) |
Nitpick, but maybe the |
ert/src/_ert/forward_model_runner/cli.py
Lines 136 to 144 in d833851
For instance, if file reporter crashes due to no space left on device, this is never sent across network. The message
fm_dispatch failed due to [Errno 28] No space left on device. Stopping and cleaning up.
is placed in lsf stdout if there is enough space for it.The text was updated successfully, but these errors were encountered: