Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent failures lurking in CI... #2251

Closed
chuckatkins opened this issue May 14, 2020 · 11 comments
Closed

Inconsistent failures lurking in CI... #2251

chuckatkins opened this issue May 14, 2020 · 11 comments
Assignees

Comments

@chuckatkins
Copy link
Contributor

Some of the Staging.*.SST tests are failing or timing out with such frequency they are blocking CI from proceeding.

Just one example: https://open.cdash.org/viewTest.php?onlyfailed&buildid=6527861
But many of the open PRs are blocked on a similar failure.

@chuckatkins chuckatkins added the triage: high This issue is a blocker and has to be addressed before the next release or milestone label May 14, 2020
@eisenhauer
Copy link
Member

Does this seem to be mostly debian?

@chuckatkins
Copy link
Contributor Author

It may be related to disabling test retries. @KyleFromKitware was there a specific reason for turning it off?

@KyleFromKitware
Copy link
Collaborator

@KyleFromKitware was there a specific reason for turning it off?

It seems that was cargo-culted from whatever script I copied from :) I'll turn it back on.

@eisenhauer
Copy link
Member

Ah, but that answers a pending question I had about turning off test repeats so I could try to track down the rarer heisenbugs. I do still seem to have debian builds failing in setup.

@chuckatkins
Copy link
Contributor Author

It still begs the question of why the failures even occur and they do seem to usually be associated with set. It seems your suspicion of masking deadlocks / race conditions is likely correct but enabling the retries for now should allow us to make forward progress on the release until they can be dealt with.

@eisenhauer
Copy link
Member

Agreed. I'm moving forward with a diagnostic PR that has test repeats turned off and SstVerbose turned on for some set of the tests. What failures I've seen seem to be in 1x1 versions. Whether or not the verbose output will change the timing and make the tests not fail is anyone's guess, but it's what I can do now.

@chuckatkins
Copy link
Contributor Author

Worked around for now so no longer needs the triage: high label

@chuckatkins chuckatkins removed the triage: high This issue is a blocker and has to be addressed before the next release or milestone label May 15, 2020
@chuckatkins chuckatkins changed the title SST tests are failing and holding up CI SST tests are inconsistently failing May 15, 2020
@eisenhauer
Copy link
Member

eisenhauer commented May 15, 2020

Based upon runs of PR #2252, which turns off the rerun-until-success feature, I'm going to expand the scope of this issue beyond SST. In particular, in looking at several runs of this PR, I don't find any failures that are SST-specific, but a variety of other problems that we should probably deal with.

For example, in the win2016_vs2017_msmpi_ninja build here: https://dev.azure.com/ornladios/adios2/_build/results?buildId=1913&view=logs&jobId=7f7df6e2-d428-52cf-2d19-85fa4b5b6db8
There are 9 failures, all of them BP, bindings or interface tests where it looks like MPI failed (all fast failures thankfully). No idea what might this represent, and it may just be the unreliability of MPI, but fortunately with the quick failures if we do test rerun there are no big consequences.

However in the el7-gnu8-openmpi-ohpc build here:
https://github.com/ornladios/ADIOS2/pull/2252/checks?check_run_id=678394519
we have 16 test failures:

  • Interface.ADIOSInterfaceWriteTest.DefineVar_uint64_t_1x10.MPI
  • Interface.ADIOSDefineVariableTest.DefineGlobalArrayConstantDims.MPI
  • Engine.BP.BPWriteReadTestADIOS2stdio.OpenEngineTwice.BP3.MPI
  • Engine.BP.BPWriteReadTestADIOS2stdio.OpenEngineTwice.BP4.MPI
  • Engine.BP.BPWriteReadAsStreamTestADIOS2_Threads.ADIOS2BPWriteRead1D8.BP3.MPI
  • Engine.BP.BPWriteReadVector.ADIOS2BPWriteRead2D2x4.BP3.MPI
  • Engine.BP.BPWriteReadVector.ADIOS2BPWriteReadVector2D4x2_MultiSteps.BP4.MPI
  • Engine.BP./BPWRZFP.ADIOS2BPWRZFP1D/.BP4.MPI
  • Engine.BP./BPWriteReadZfpConfig.ADIOS2BPWriteReadZfp3DSel/.BP4.MPI
  • Engine.BP./BPWriteReadBZIP2.ADIOS2BPWriteReadBZIP21DSel/.BP4.MPI
  • Engine.SSC.SscEngineTest.TestSscReaderMultiblock.MPI
  • Staging.KillWriterTimeout_2x2.CommMin.FFS.SST
  • Staging.DelayedReader_3x5.CommPeer.FFS.SST
  • Staging.3x5.BP3
  • Staging.5x3.BP3
  • Staging.5x3.LocalVarying.BP3

SST is represented in these tests, but digging into the logs, it looks like all of these failures are because the MPI job isn't actually starting (or one of the MPI jobs in the case of the SST jobs). This actually has some significant consequences because we run the full 5 minute default test timeout for all of these except the staging tests which are set to 30 seconds. Test rerun hides these odd MPI failure-to-execute things, but this is likely why this build is incredibly slow in CI. SSTs appearance here just looks be due to the fact that it also uses MPI and any MPI failure will hit it proportionately.

Finally, in the suse-pgi-openmpi build here:
https://github.com/ornladios/ADIOS2/pull/2252/checks?check_run_id=678394548
We have only one failure, in Staging.1x1.LocalVarying.BPS.BP4_stream , and I know @pnorbert is looking at the heisen-failures in BP4 streaming.

I'm going to keep running this PR (or a similar one) regularly to try to make sure that there aren't actually SST heisen-bugs out there that the rerun facility is making less obvious, but it would probably be good to try to sort out the MPI issues, particularly for the el7-gnu8-openmpi-ohpc build, because of its impact on CI turnaround. @chuckatkins?

@eisenhauer eisenhauer changed the title SST tests are inconsistently failing Inconsistent failures lurking in CI... May 15, 2020
@eisenhauer
Copy link
Member

Update: I have seen what seems to be a race condition resulting in the delivery of incorrect data in the LockGeometry test. I'm to replicate this in such a way that I get some diagnostic info. I have PR #2264 that turns off this test so that it doesn't impede CI, but that PR is itself blocked on a problem with centos7. However, I probably need to sort out this bug before the release is finalized...

@chuckatkins
Copy link
Contributor Author

@eisenhauer should we leave this open for now or close it and jsut open issues for specific tests as they arise?

@eisenhauer
Copy link
Member

Happy to close this one in favor of more specific ones...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants