Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in one reporter from fm_dispatch is not propagated to other reporters, instead just crash #9706

Open
eivindjahren opened this issue Jan 10, 2025 · 7 comments · May be fixed by #9890
Open
Assignees
Labels

Comments

@eivindjahren
Copy link
Contributor

eivindjahren commented Jan 10, 2025

for reporter in reporters:
try:
reporter.report(job_status)
except OSError as oserror:
print(f"fm_dispatch failed due to {oserror}. Stopping and cleaning up.")
_stop_reporters_and_sigkill(reporters)
if isinstance(job_status, Finish) and not job_status.success():
_stop_reporters_and_sigkill(reporters)

For instance, if file reporter crashes due to no space left on device, this is never sent across network. The message fm_dispatch failed due to [Errno 28] No space left on device. Stopping and cleaning up. is placed in lsf stdout if there is enough space for it.

@berland
Copy link
Contributor

berland commented Jan 10, 2025

This can be reproduced with

$ git diff
+++ b/src/_ert/forward_model_runner/reporting/file.py
@@ -99,6 +99,7 @@ def report(self, msg: Message):
                 self._dump_error_file(msg.job, error_msg)
 
         elif isinstance(msg, Running):
+            raise OSError("no space left on device")
             fm_step_status.update(
                 max_memory_usage=msg.memory_status.max_rss,

In this scenario, there is no reason for fm_dispatch.py to give up. It should propably keep calm and carry on, sending over network if possible, and let the the actual forward model step fail instead.

@berland berland added the bug label Jan 10, 2025
@berland berland moved this to Todo in SCOUT Jan 10, 2025
@sondreso
Copy link
Collaborator

We have to be a little careful with the file reporter that actually writes to disk and how we should handle failures there.

@eivindjahren
Copy link
Contributor Author

eivindjahren commented Jan 24, 2025

We could maybe remove the reporter that got an Exception:

    for job_status in job_runner.run(parsed_args.job):
        logger.info(f"Job status: {job_status}")
        i = 0
        while i < len(reporters):
            reporter = reporters[i]
            try:
                reporter.report(job_status)
                i += 1
            except Exception as err:
                logger.exception(
                    f"Reporter {reporter} failed due to {err}. Removing the reporter."
                )
                if isinstance(reporter, reporting.Event):
                    reporter.stop()
                    del reporters[i]

        if isinstance(job_status, Finish) and not job_status.success():
            _stop_reporters_and_sigkill(reporters)

What do you think @sondreso ?

@sondreso
Copy link
Collaborator

Think that would be a good solution yes! (del reporters[i] needs to have one less indentation)

@berland
Copy link
Contributor

berland commented Jan 24, 2025

The logger.exception in the catch might also fail due to no space left on device.

@eivindjahren
Copy link
Contributor Author

eivindjahren commented Jan 24, 2025

Maybe like this then @berland:

    for job_status in job_runner.run(parsed_args.job):
        logger.info(f"Job status: {job_status}")
        i = 0
        while i < len(reporters):
            reporter = reporters[i]
            try:
                reporter.report(job_status)
                i += 1
            except Exception as err:
                with contextlib.suppress(Exception):
                  del reporters[i]
                  if isinstance(reporter, reporting.Event):
                      reporter.stop()
                  logger.exception(
                      f"Reporter {reporter} failed due to {err}. Removing the reporter."
                  )
        if isinstance(job_status, Finish) and not job_status.success():
            _stop_reporters_and_sigkill(reporters)

@sondreso
Copy link
Collaborator

Nitpick, but maybe the i += 1is a candidate for an else clause. Either you remove the reporter or increment the counter

@eivindjahren eivindjahren moved this from Todo to In Progress in SCOUT Jan 28, 2025
@eivindjahren eivindjahren linked a pull request Jan 28, 2025 that will close this issue
9 tasks
@eivindjahren eivindjahren moved this from In Progress to Ready for Review in SCOUT Jan 28, 2025
@eivindjahren eivindjahren self-assigned this Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Ready for Review
Development

Successfully merging a pull request may close this issue.

3 participants