Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Fix possible deadlock in PBS-based scheduler backends when a job is cancelled immediately after submission #1301

Merged

Conversation

teojgo
Copy link
Contributor

@teojgo teojgo commented May 7, 2020

  • Create the stdout, stderr files if the don't exist
    to make the torque scheduler treat the job as finished.

Fixes #1298

@teojgo teojgo added this to the ReFrame sprint 20.07 milestone May 7, 2020
@teojgo teojgo requested a review from vkarak May 7, 2020 15:30
@teojgo teojgo self-assigned this May 7, 2020
teojgo added 2 commits May 7, 2020 17:41
* Create the stdout, stderr files if the don't exist
  to make the torque scheduler treat the job as finished.
@teojgo teojgo changed the title [bugfix] Fix deadlock in test_cancel for the torque scheduler [bugfix] Fix deadlock in test_cancel for pbs-based schedulers May 7, 2020
@teojgo teojgo force-pushed the bugfix/torque_test_cancel_deadlock branch from 28ca7e4 to bc47e46 Compare May 7, 2020 16:04
Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is more fundamental, because this deadlock could happen in normal operation. I guess that a better solution would be to fix that in the PBS/Torque backends, such that if the job is cancelled, we should not do the additional check for its stdout/stderr files to mark it as done.

@teojgo
Copy link
Contributor Author

teojgo commented May 8, 2020

I think the problem is more fundamental, because this deadlock could happen in normal operation. I guess that a better solution would be to fix that in the PBS/Torque backends, such that if the job is cancelled, we should not do the additional check for its stdout/stderr files to mark it as done.

Then maybe open a new issue to find a fundamental solution, while temporarily have this one so that it does not block the CI?

@vkarak
Copy link
Contributor

vkarak commented May 8, 2020

Then maybe open a new issue to find a fundamental solution, while temporarily have this one so that it does not block the CI?

No, the solution is quite easy. Set an is_cancelling flag that will be set when cancel() is called and the finished() should simply do the additional check only if this flag is not set.

Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@vkarak vkarak changed the title [bugfix] Fix deadlock in test_cancel for pbs-based schedulers [bugfix] Fix possible deadlock in PBS-based scheduler backends when a job is cancelled immediately after submission May 8, 2020
@vkarak vkarak merged commit 1cd8237 into reframe-hpc:master May 8, 2020
@teojgo teojgo deleted the bugfix/torque_test_cancel_deadlock branch May 12, 2020 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unittest test_cancel deadlocks for PBS-based schedulers
2 participants