Inconsistent failures lurking in CI... #2251

chuckatkins · 2020-05-14T15:24:54Z

Some of the Staging.*.SST tests are failing or timing out with such frequency they are blocking CI from proceeding.

Just one example: https://open.cdash.org/viewTest.php?onlyfailed&buildid=6527861
But many of the open PRs are blocked on a similar failure.

The text was updated successfully, but these errors were encountered:

eisenhauer · 2020-05-14T15:26:54Z

Does this seem to be mostly debian?

chuckatkins · 2020-05-14T15:30:26Z

It may be related to disabling test retries. @KyleFromKitware was there a specific reason for turning it off?

KyleFromKitware · 2020-05-14T15:32:54Z

@KyleFromKitware was there a specific reason for turning it off?

It seems that was cargo-culted from whatever script I copied from :) I'll turn it back on.

eisenhauer · 2020-05-14T15:37:08Z

Ah, but that answers a pending question I had about turning off test repeats so I could try to track down the rarer heisenbugs. I do still seem to have debian builds failing in setup.

chuckatkins · 2020-05-14T15:54:58Z

It still begs the question of why the failures even occur and they do seem to usually be associated with set. It seems your suspicion of masking deadlocks / race conditions is likely correct but enabling the retries for now should allow us to make forward progress on the release until they can be dealt with.

eisenhauer · 2020-05-14T15:57:37Z

Agreed. I'm moving forward with a diagnostic PR that has test repeats turned off and SstVerbose turned on for some set of the tests. What failures I've seen seem to be in 1x1 versions. Whether or not the verbose output will change the timing and make the tests not fail is anyone's guess, but it's what I can do now.

chuckatkins · 2020-05-15T11:57:05Z

Worked around for now so no longer needs the triage: high label

eisenhauer · 2020-05-15T16:58:46Z

Based upon runs of PR #2252, which turns off the rerun-until-success feature, I'm going to expand the scope of this issue beyond SST. In particular, in looking at several runs of this PR, I don't find any failures that are SST-specific, but a variety of other problems that we should probably deal with.

For example, in the win2016_vs2017_msmpi_ninja build here: https://dev.azure.com/ornladios/adios2/_build/results?buildId=1913&view=logs&jobId=7f7df6e2-d428-52cf-2d19-85fa4b5b6db8
There are 9 failures, all of them BP, bindings or interface tests where it looks like MPI failed (all fast failures thankfully). No idea what might this represent, and it may just be the unreliability of MPI, but fortunately with the quick failures if we do test rerun there are no big consequences.

However in the el7-gnu8-openmpi-ohpc build here:
https://github.com/ornladios/ADIOS2/pull/2252/checks?check_run_id=678394519
we have 16 test failures:

Interface.ADIOSInterfaceWriteTest.DefineVar_uint64_t_1x10.MPI
Interface.ADIOSDefineVariableTest.DefineGlobalArrayConstantDims.MPI
Engine.BP.BPWriteReadTestADIOS2stdio.OpenEngineTwice.BP3.MPI
Engine.BP.BPWriteReadTestADIOS2stdio.OpenEngineTwice.BP4.MPI
Engine.BP.BPWriteReadAsStreamTestADIOS2_Threads.ADIOS2BPWriteRead1D8.BP3.MPI
Engine.BP.BPWriteReadVector.ADIOS2BPWriteRead2D2x4.BP3.MPI
Engine.BP.BPWriteReadVector.ADIOS2BPWriteReadVector2D4x2_MultiSteps.BP4.MPI
Engine.BP./BPWRZFP.ADIOS2BPWRZFP1D/.BP4.MPI
Engine.BP./BPWriteReadZfpConfig.ADIOS2BPWriteReadZfp3DSel/.BP4.MPI
Engine.BP./BPWriteReadBZIP2.ADIOS2BPWriteReadBZIP21DSel/.BP4.MPI
Engine.SSC.SscEngineTest.TestSscReaderMultiblock.MPI
Staging.KillWriterTimeout_2x2.CommMin.FFS.SST
Staging.DelayedReader_3x5.CommPeer.FFS.SST
Staging.3x5.BP3
Staging.5x3.BP3
Staging.5x3.LocalVarying.BP3

SST is represented in these tests, but digging into the logs, it looks like all of these failures are because the MPI job isn't actually starting (or one of the MPI jobs in the case of the SST jobs). This actually has some significant consequences because we run the full 5 minute default test timeout for all of these except the staging tests which are set to 30 seconds. Test rerun hides these odd MPI failure-to-execute things, but this is likely why this build is incredibly slow in CI. SSTs appearance here just looks be due to the fact that it also uses MPI and any MPI failure will hit it proportionately.

Finally, in the suse-pgi-openmpi build here:
https://github.com/ornladios/ADIOS2/pull/2252/checks?check_run_id=678394548
We have only one failure, in Staging.1x1.LocalVarying.BPS.BP4_stream , and I know @pnorbert is looking at the heisen-failures in BP4 streaming.

I'm going to keep running this PR (or a similar one) regularly to try to make sure that there aren't actually SST heisen-bugs out there that the rerun facility is making less obvious, but it would probably be good to try to sort out the MPI issues, particularly for the el7-gnu8-openmpi-ohpc build, because of its impact on CI turnaround. @chuckatkins?

eisenhauer · 2020-05-17T11:19:23Z

Update: I have seen what seems to be a race condition resulting in the delivery of incorrect data in the LockGeometry test. I'm to replicate this in such a way that I get some diagnostic info. I have PR #2264 that turns off this test so that it doesn't impede CI, but that PR is itself blocked on a problem with centos7. However, I probably need to sort out this bug before the release is finalized...

chuckatkins · 2020-07-02T15:32:43Z

@eisenhauer should we leave this open for now or close it and jsut open issues for specific tests as they arise?

eisenhauer · 2020-07-02T15:35:22Z

Happy to close this one in favor of more specific ones...

chuckatkins added the triage: high This issue is a blocker and has to be addressed before the next release or milestone label May 14, 2020

chuckatkins assigned eisenhauer May 14, 2020

chuckatkins removed the triage: high This issue is a blocker and has to be addressed before the next release or milestone label May 15, 2020

chuckatkins changed the title ~~SST tests are failing and holding up CI~~ SST tests are inconsistently failing May 15, 2020

eisenhauer changed the title ~~SST tests are inconsistently failing~~ Inconsistent failures lurking in CI... May 15, 2020

eisenhauer closed this as completed Jul 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent failures lurking in CI... #2251

Inconsistent failures lurking in CI... #2251

chuckatkins commented May 14, 2020

eisenhauer commented May 14, 2020

chuckatkins commented May 14, 2020

KyleFromKitware commented May 14, 2020

eisenhauer commented May 14, 2020

chuckatkins commented May 14, 2020

eisenhauer commented May 14, 2020

chuckatkins commented May 15, 2020

eisenhauer commented May 15, 2020 •

edited

Loading

eisenhauer commented May 17, 2020

chuckatkins commented Jul 2, 2020

eisenhauer commented Jul 2, 2020

Inconsistent failures lurking in CI... #2251

Inconsistent failures lurking in CI... #2251

Comments

chuckatkins commented May 14, 2020

eisenhauer commented May 14, 2020

chuckatkins commented May 14, 2020

KyleFromKitware commented May 14, 2020

eisenhauer commented May 14, 2020

chuckatkins commented May 14, 2020

eisenhauer commented May 14, 2020

chuckatkins commented May 15, 2020

eisenhauer commented May 15, 2020 • edited Loading

eisenhauer commented May 17, 2020

chuckatkins commented Jul 2, 2020

eisenhauer commented Jul 2, 2020

eisenhauer commented May 15, 2020 •

edited

Loading