Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix false success on check #168

Merged
merged 1 commit into from
Dec 13, 2021
Merged

Conversation

forsyth2
Copy link
Collaborator

Fix false success on zstash check. Resolves #167.

@forsyth2 forsyth2 added semver: bug Bug fix (will increment patch version) for next version labels Nov 19, 2021
@forsyth2 forsyth2 self-assigned this Nov 19, 2021
@forsyth2
Copy link
Collaborator Author

@golaz This change (adding daemon=True to the multiprocessing.Process call) at least seems to make zstash fail immediately. I tested it out and it exited after two errors Exception: Error=Transferring file from HPSS: 000001.tar and Exception: Error=Transferring file from HPSS: 000000.tar, as opposed to running through all tars. However, it still prints INFO: No failures detected when checking the files., so it doesn't actually fix the bug.

Line 59 (logger.info("No failures detected when {} the files.".format(verb))) is run > failures was an empty list > extract_database returned an empty list > multiprocess_extract returned an empty list > there must be some problem in this code block:

    failures: List[FilesRow] = []
    while any(p.is_alive() for p in processes):
        while not failure_queue.empty():
            failures.append(failure_queue.get())

@forsyth2
Copy link
Collaborator Author

Note that this bug only happens when there are multiple workers. In the one-worker case, ERROR: Error=Transferring file from HPSS: 000000.tar is printed without being followed by INFO: No failures detected when checking the files.. (Interestingly, the Exception: Error=Transferring file from HPSS: 000001.tar was not printed in the one-worker case.)

@golaz
Copy link
Collaborator

golaz commented Nov 22, 2021

@forsyth2 : thanks for investigating. Stopping immediately would at least be a step in the right direction.

The serial and parallel options use significantly different code, so it makes sense that they could behave differently. Based on what you have found, there might be an implementation issue with the failure queue. Unfortunately, I don't know how that code works.

Do you want to incorporate your change first as a partial solution?

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Dec 3, 2021

Stopping immediately would at least be a step in the right direction.

@golaz Alright, should we add something like Run "grep -i Exception" on the check log to double check to INFO: No failures detected when checking the files. until we get the issue fixed fully?

I tried debugging further, but the tars that were causing the Exception suddenly appear to not cause issues. In any case, code blocks to investigate would include the following (all in https://github.com/E3SM-Project/zstash/blob/master/zstash/extract.py):

In multiprocess_extract:

    failures: List[FilesRow] = []
    while any(p.is_alive() for p in processes):
        while not failure_queue.empty():
            failures.append(failure_queue.get())

In extractFiles:

        except Exception:
            # Catch all exceptions here.
            traceback.print_exc()
            logger.error("Retrieving {}".format(files_row.name))
            failures.append(files_row)
    if multiprocess_worker:
        # If there are things left to print, print them.
        multiprocess_worker.print_all_contents()

        # Add the failures to the queue.
        # When running with multiprocessing, the function multiprocess_extract()
        # that calls this extractFiles() function will return the failures as a list.
        for f in failures:
            multiprocess_worker.failure_queue.put(f)

@forsyth2 forsyth2 marked this pull request as ready for review December 13, 2021 23:25
@forsyth2 forsyth2 merged commit 5b58ed3 into E3SM-Project:master Dec 13, 2021
@forsyth2 forsyth2 deleted the fix-false-success branch December 13, 2021 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: bug Bug fix (will increment patch version)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zstash check reports success even if exceptions occur
2 participants