-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent failures lurking in CI... #2251
Comments
Does this seem to be mostly debian? |
It may be related to disabling test retries. @KyleFromKitware was there a specific reason for turning it off? |
It seems that was cargo-culted from whatever script I copied from :) I'll turn it back on. |
Ah, but that answers a pending question I had about turning off test repeats so I could try to track down the rarer heisenbugs. I do still seem to have debian builds failing in setup. |
It still begs the question of why the failures even occur and they do seem to usually be associated with set. It seems your suspicion of masking deadlocks / race conditions is likely correct but enabling the retries for now should allow us to make forward progress on the release until they can be dealt with. |
Agreed. I'm moving forward with a diagnostic PR that has test repeats turned off and SstVerbose turned on for some set of the tests. What failures I've seen seem to be in 1x1 versions. Whether or not the verbose output will change the timing and make the tests not fail is anyone's guess, but it's what I can do now. |
Worked around for now so no longer needs the |
Based upon runs of PR #2252, which turns off the rerun-until-success feature, I'm going to expand the scope of this issue beyond SST. In particular, in looking at several runs of this PR, I don't find any failures that are SST-specific, but a variety of other problems that we should probably deal with. For example, in the win2016_vs2017_msmpi_ninja build here: https://dev.azure.com/ornladios/adios2/_build/results?buildId=1913&view=logs&jobId=7f7df6e2-d428-52cf-2d19-85fa4b5b6db8 However in the el7-gnu8-openmpi-ohpc build here:
SST is represented in these tests, but digging into the logs, it looks like all of these failures are because the MPI job isn't actually starting (or one of the MPI jobs in the case of the SST jobs). This actually has some significant consequences because we run the full 5 minute default test timeout for all of these except the staging tests which are set to 30 seconds. Test rerun hides these odd MPI failure-to-execute things, but this is likely why this build is incredibly slow in CI. SSTs appearance here just looks be due to the fact that it also uses MPI and any MPI failure will hit it proportionately. Finally, in the suse-pgi-openmpi build here: I'm going to keep running this PR (or a similar one) regularly to try to make sure that there aren't actually SST heisen-bugs out there that the rerun facility is making less obvious, but it would probably be good to try to sort out the MPI issues, particularly for the el7-gnu8-openmpi-ohpc build, because of its impact on CI turnaround. @chuckatkins? |
Update: I have seen what seems to be a race condition resulting in the delivery of incorrect data in the LockGeometry test. I'm to replicate this in such a way that I get some diagnostic info. I have PR #2264 that turns off this test so that it doesn't impede CI, but that PR is itself blocked on a problem with centos7. However, I probably need to sort out this bug before the release is finalized... |
@eisenhauer should we leave this open for now or close it and jsut open issues for specific tests as they arise? |
Happy to close this one in favor of more specific ones... |
Some of the
Staging.*.SST
tests are failing or timing out with such frequency they are blocking CI from proceeding.Just one example: https://open.cdash.org/viewTest.php?onlyfailed&buildid=6527861
But many of the open PRs are blocked on a similar failure.
The text was updated successfully, but these errors were encountered: