-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenPBS job queue hang #1930
Comments
Hi @cblackworth I suspect where the problem comes from. ReFrame uses |
here's the output with logging set to debug (running with -vvv didn't output any more details)
If we turn off job history which is needed for other uses at the site I get this error when the job hangs and never closes (reframe will just run indefinitely until it times out) basically hanging at "Entering stage: run_complete"
|
I think it is a bug of our PBS scheduler backend in the way we set the job final state. I'll propose a fix in a PR. |
I pulled down your branch and put it on our test system and using the same config, same system that worked for 3.5.0 is throwing this error on tests for scheduler items:
|
Hi @cblackworth, this should be just a warning... It simply says that you haven't set a modules system, so any attempt to load |
And you get this warning, because my branch that you pulled is beyond 3.5.3 :-) |
This actually is set (the error wasn't there before the update). It causes the otherwise able to run jobs to fail to run as it can no longer compile my tests now. to be clear, the only change to anything has been to update the version of reframe, everything else is unchanged from before when this test worked. Reverting the version of reframe allows this test to work but fails to run on PBS. So I am unable to verify if this patch works.
|
This is interesting... It seems that ReFrame picks up the generic system, which is totally weird if you run exactly the same command. Could you rerun with |
Config file:
Command output:
|
Full output from -vvv run.
|
Hi @cblackworth, regarding the error of not recognising your system after you pulled in my branch, it looks like a regression in the framework (see #1948), but meanwhile you can try to explicitly pass |
That doesn't seem to work?
|
That last command should be "reframe -C ./settings.py -c ./hello/ --system=testbed -r" |
Hmm, your case doesn't seem to be the same as in #1948. May I ask how do you install ReFrame? One strange thing that I see and can't understand is that your settings file does not define the |
It looks like there was a lingering module in python somewhere. I re ran this using just the bootstrap script and it worked. For the github releases, is there a "correct" install that doesn't use bootstrap.sh? (this system isn't able to access the internet) |
I am now running in to this odd behavior and I'm wondering if its related? This error only shows up on this one job and only when used with PBS, the local only job of the same test runs fine: (both the gnu and cray tests fail, hence the changing test name, same error though)
Further details:
|
Hi @cblackworth is this failure persistent? It really looks like #1394. There, we were on a Lustre filesystem and I'm not sure if there was somehow glitch, but we haven't seen that error since a while ago. |
Hi @cblackworth, I'll close this issue now. Feel free to reopen it, if it persists. |
When running jobs in OpenPBS 20.0.1 we have found what looks to be partly related to #1301 & #1473
Currently our PBS implementation uses job history output, when running jobs in PBS with reframe the job will error out due to this:
if we disable the job history support (not ideal), the job just hangs infinitely regardless of if the PBS jobs fail or run successfully
Reframe version:
reframe --version 3.5.0
SLE 15 SP2
lmod
OpenPBS 20.0.1
The text was updated successfully, but these errors were encountered: