Errors in one reporter from fm_dispatch is not propagated to other reporters, instead just crash #9706

eivindjahren · 2025-01-10T13:20:27Z

ert/src/_ert/forward_model_runner/cli.py

Lines 136 to 144 in d833851

    
           for reporter in reporters: 
        
               try: 
        
                   reporter.report(job_status) 
        
               except OSError as oserror: 
        
                   print(f"fm_dispatch failed due to {oserror}. Stopping and cleaning up.") 
        
                   _stop_reporters_and_sigkill(reporters) 
        
           if isinstance(job_status, Finish) and not job_status.success(): 
        
               _stop_reporters_and_sigkill(reporters)

For instance, if file reporter crashes due to no space left on device, this is never sent across network. The message fm_dispatch failed due to [Errno 28] No space left on device. Stopping and cleaning up. is placed in lsf stdout if there is enough space for it.

The text was updated successfully, but these errors were encountered:

berland · 2025-01-10T14:15:02Z

This can be reproduced with

$ git diff
+++ b/src/_ert/forward_model_runner/reporting/file.py
@@ -99,6 +99,7 @@ def report(self, msg: Message):
                 self._dump_error_file(msg.job, error_msg)
 
         elif isinstance(msg, Running):
+            raise OSError("no space left on device")
             fm_step_status.update(
                 max_memory_usage=msg.memory_status.max_rss,

In this scenario, there is no reason for fm_dispatch.py to give up. It should propably keep calm and carry on, sending over network if possible, and let the the actual forward model step fail instead.

sondreso · 2025-01-17T15:04:45Z

We have to be a little careful with the file reporter that actually writes to disk and how we should handle failures there.

eivindjahren · 2025-01-24T12:06:52Z

We could maybe remove the reporter that got an Exception:

    for job_status in job_runner.run(parsed_args.job):
        logger.info(f"Job status: {job_status}")
        i = 0
        while i < len(reporters):
            reporter = reporters[i]
            try:
                reporter.report(job_status)
                i += 1
            except Exception as err:
                logger.exception(
                    f"Reporter {reporter} failed due to {err}. Removing the reporter."
                )
                if isinstance(reporter, reporting.Event):
                    reporter.stop()
                    del reporters[i]

        if isinstance(job_status, Finish) and not job_status.success():
            _stop_reporters_and_sigkill(reporters)

What do you think @sondreso ?

sondreso · 2025-01-24T12:19:08Z

Think that would be a good solution yes! (del reporters[i] needs to have one less indentation)

berland · 2025-01-24T12:25:05Z

The logger.exception in the catch might also fail due to no space left on device.

eivindjahren · 2025-01-24T12:37:28Z

Maybe like this then @berland:

    for job_status in job_runner.run(parsed_args.job):
        logger.info(f"Job status: {job_status}")
        i = 0
        while i < len(reporters):
            reporter = reporters[i]
            try:
                reporter.report(job_status)
                i += 1
            except Exception as err:
                with contextlib.suppress(Exception):
                  del reporters[i]
                  if isinstance(reporter, reporting.Event):
                      reporter.stop()
                  logger.exception(
                      f"Reporter {reporter} failed due to {err}. Removing the reporter."
                  )
        if isinstance(job_status, Finish) and not job_status.success():
            _stop_reporters_and_sigkill(reporters)

sondreso · 2025-01-24T12:58:13Z

Nitpick, but maybe the i += 1is a candidate for an else clause. Either you remove the reporter or increment the counter

github-project-automation bot added this to SCOUT Jan 10, 2025

berland added the bug label Jan 10, 2025

berland moved this to Todo in SCOUT Jan 10, 2025

eivindjahren moved this from Todo to In Progress in SCOUT Jan 28, 2025

eivindjahren linked a pull request Jan 28, 2025 that will close this issue

Drop fm_dispatch reporters on error #9890

Open

9 tasks

eivindjahren moved this from In Progress to Ready for Review in SCOUT Jan 28, 2025

eivindjahren self-assigned this Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors in one reporter from fm_dispatch is not propagated to other reporters, instead just crash #9706

Errors in one reporter from fm_dispatch is not propagated to other reporters, instead just crash #9706

eivindjahren commented Jan 10, 2025 •

edited

Loading

berland commented Jan 10, 2025 •

edited

Loading

sondreso commented Jan 17, 2025

eivindjahren commented Jan 24, 2025 •

edited

Loading

sondreso commented Jan 24, 2025

berland commented Jan 24, 2025

eivindjahren commented Jan 24, 2025 •

edited

Loading

sondreso commented Jan 24, 2025

Errors in one reporter from fm_dispatch is not propagated to other reporters, instead just crash #9706

Errors in one reporter from fm_dispatch is not propagated to other reporters, instead just crash #9706

Comments

eivindjahren commented Jan 10, 2025 • edited Loading

berland commented Jan 10, 2025 • edited Loading

sondreso commented Jan 17, 2025

eivindjahren commented Jan 24, 2025 • edited Loading

sondreso commented Jan 24, 2025

berland commented Jan 24, 2025

eivindjahren commented Jan 24, 2025 • edited Loading

sondreso commented Jan 24, 2025

eivindjahren commented Jan 10, 2025 •

edited

Loading

berland commented Jan 10, 2025 •

edited

Loading

eivindjahren commented Jan 24, 2025 •

edited

Loading

eivindjahren commented Jan 24, 2025 •

edited

Loading