test battery: flaky tests #2894

kinow · 2018-11-27T22:11:42Z

Just to record that that the test battery can fail randomly or in certain scenarios (under a time zone, when an environment variable is present, etc). This ticket is just to check if it'd be possible to make it more stable, perhaps avoid re-running it, and, of course, sporadically failures.

[Edited by @matthewrmshin ] Since #3224, sensitive tests have gained their own hierarchy in the source tree under flakytests/ and run under Travis CI in serial. New flaky tests should be moved to the hierarchy, with the view that it may be possible to move some back to live with the other non-flaky tests. (I have commented out the original list - as the source tree is now the master list.)

List (post #3224 and #3286):

./tests/cylc-reset/03-output-2.t - failed in Remove obsolete command category "hook". #3311

2019-08-25T21:54:01Z DEBUG - zmq:send {'data': (True, 'Messages queued: 1')}
2019-08-25T21:54:02Z CRITICAL - [t2.1] status=running: (received)failed/XCPU at 2019-08-25T21:54:00Z for job(01)
2019-08-25T21:54:02Z DEBUG - [t2.1] -running => failed
2019-08-25T21:54:02Z CRITICAL - [t2.1] -job(01) failed
2019-08-25T21:54:02Z ERROR - AUTOMATIC(ON-TASK-FAILURE)
	Traceback (most recent call last):
	  File "/home/travis/build/cylc/cylc-flow/cylc/flow/scheduler.py", line 252, in start
	    self.run()
	  File "/home/travis/build/cylc/cylc-flow/cylc/flow/scheduler.py", line 1505, in run
	    self.suite_shutdown()
	  File "/home/travis/build/cylc/cylc-flow/cylc/flow/scheduler.py", line 1211, in suite_shutdown
	    raise SchedulerError(self.stop_mode.value)
	cylc.flow.scheduler.SchedulerError: AUTOMATIC(ON-TASK-FAILURE)
2019-08-25T21:54:02Z ERROR - Suite shutting down - AUTOMATIC(ON-TASK-FAILURE)

./tests/registration/02-on-the-fly.t - failed in Remove obsolete command category "hook". #3311

==== /tmp/travis/cylctb-20190825T214903Z/registration/02-on-the-fly/02-on-the-fly-cylc-run-dir-stop.stderr ====
ClientTimeout: Timeout waiting for server response.
==== /tmp/travis/cylctb-20190825T214903Z/registration/02-on-the-fly/02-on-the-fly-cylc-run-dir-stop.stdout ====
==== /tmp/travis/cylctb-20190825T214903Z/registration/02-on-the-fly/02-on-the-fly-cylc-run-dir-2-validate.stdout-contains-ok.stderr ====
Missing lines:
REGISTERED cylctb-20190825T214903Z/02-on-the-fly -> /home/travis/cylc-run/cylctb-20190825T214903Z/02-on-the-fly

(When adding new ones, please bear in mind that the list is in alphabetical order.)

The text was updated successfully, but these errors were encountered:

kinow · 2018-11-27T22:13:46Z

Leaving unassigned, good issue to any other newcomer to the project too 🎉

matthewrmshin · 2019-01-16T10:56:06Z

Sorry hijacked the issue to record other flaky tests as well.

kinow · 2019-01-16T20:09:57Z

Thanks @matthewrmshin !!! With that list, working on this issue will be thousand times easier. Thanks!!!

matthewrmshin · 2019-01-17T08:39:07Z

Yes, I also notice that tests/execution-time-limit/04-poll.t fails randomly at times. I'll take a look at these ones and the one currently being discussed in #2929.

hjoliver · 2019-01-17T11:07:59Z

Another vote for ./tests/jobscript/19-exit-script.t.

sadielbartholomew · 2019-01-17T11:51:07Z

Could we (eventually) actually automate a procedure that can quantify 'flakiness' to highlight individual test issues?

Notably, we could create a script/workflow that will run (say, overnight) the test battery &/or Travis CI a number of times & compare failures on each set to pick up on any bad tests. Then we could send out emails prompting us to get any issues fixed.

There is actually some interesting literature about, such as this 2014 study which I just skim read, if anyone is looking for some bed-time reading!

hjoliver · 2019-01-17T12:09:16Z

Travis CI already "knows" which tests are flaky (in its environment, at least) because it sometimes has to run those ones twice to get the test battery to pass. Maybe we can just get it to notify us of which tests it has to run twice - shouldn't be too hard to arrange?

sadielbartholomew · 2019-01-17T12:20:33Z

it sometimes has to run those ones twice to get the test battery to pass

I am aware of that, but I think two is statistically not the greatest sample size for indication of test reliability. Though your comment made me wonder if there is a Travis CI configuration setting for the number of re-runs? If we up that, then only the relatively small number of possibly flaky tests will need to be re-run, instead of manually having to restart whole jobs in test battery runs whenever flaky tests crop up.

Maybe we can just get it to notify us of which tests it has to run twice - shouldn't be too hard to arrange?

That would certainly be useful & hopefully simple to configure.

I'll have a look through the Travis CI docs to see what is possible.

hjoliver · 2019-01-17T12:25:22Z

if there is a Travis CI configuration setting for the number of re-runs?

I believe we manage the rerunning of failed tests ourselves, in our Travis CI config.

hjoliver · 2019-01-17T12:26:44Z

(By "knows" I meant our T-CI build script knows, so maybe we could just add a line to that to send a notification out somehow).

hjoliver · 2019-01-17T12:29:05Z

two is statistically not the greatest sample size for indication of test reliability

True! But we would gradually identify all flaky tests as they randomly fail over many separate T-CI runs.

matthewrmshin · 2019-01-17T12:52:24Z

The best outcome for this issue is that all tests are fixed and we'll no longer have to re-run them as we do now.

sadielbartholomew · 2019-01-17T13:11:32Z

The best outcome for this issue

Sure. But thinking about the future, flaky tests do cause issues here & then, so having some procedures for monitoring & fixing unreliable ones could be nice.

hjoliver · 2019-01-17T21:37:33Z

A general comment - we should provide clear comments in our functional tests on exactly how they are supposed to work, because some of them do very strange things in order to set up certain test conditions that are difficult to reproduce "naturally". Case in point: #2929 (comment) - git blame fingers me, but I didn't explain in the code and just had to waste time figuring out how my own test worked all over again!! 😬

matthewrmshin · 2019-01-17T21:46:47Z

@hjoliver Don't worry. I did manage to work out what was going on - and it was (sort of) my fault.

kinow · 2019-03-28T02:58:50Z

Note on ulimit's in the Travis distribution used in our builds:

0.00s$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 29790
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 30000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 29790
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Comparing with my local environment, it has the same settings or higher number for memory/files/etc. Except max locked memory (kbytes, -l) 64. In my environment is says max locked memory (kbytes, -l) 16384.

max locked memory (kbytes, -l)

The maximum size that may be locked into memory. Memory locking ensures the memory is always in RAM and never moved to the swap disk

But I do not believe this could cause random failures... I guess.

oliver-sanders · 2019-05-14T10:43:14Z

OK, the test is failing as the "health check settings" log line is different:

2019-05-14T10:40:39Z INFO - [foo.1] -health check settings: submission timeout=None, polling intervals=PT2S,...

Running locally I see

2019-05-14T11:26:44+01:00 INFO - [foo.1] -health check settings: execution timeout=PT10S, polling intervals=PT10S,...

hjoliver · 2019-05-14T23:28:18Z

how to hammer flaky tests on Travis CI

Activate Travis CI in your own clone.
Make a new branch with a single commit that changes .travis/cover.py to run just the flaky test(s)
push the the branch to your clone ... observe build results, re-trigger at will

hjoliver · 2019-05-14T23:37:04Z

(Having done that with cylc-poll/16-execution-time-limit.t, it always passes when run alone! 😬 )

kinow · 2019-05-29T00:19:35Z

Bit off topic I think, but an interesting post about a team that collected reasons for their flaky tests and produced a nice summary of each issue, post mortem, etc.

https://samsaffron.com/archive/2019/05/15/tests-that-sometimes-fail

I like the author's optimism (I think he's co-founder of discourse BTW)

I would like to disagree a bit with Martin. Sometimes I find flaky tests are useful at finding underlying flaws in our application. In some cases when fixing a flaky test, the fix is in the app, not in the test.

hjoliver · 2019-05-29T03:21:00Z

Haha, the first paragraphs describe our situation exactly 😬

sadielbartholomew · 2019-07-11T14:06:37Z

A few Travis CI runs on the chunk 3/4 have failed on tests/job-submission/18-check-chunking.t (which possibly influences subsequent tests if still run) with some shared errors including a repeated report of Cannot allocate memory: see for example this log & this one, which after the other error messages includes a traceback as follows:

Traceback (most recent call last):
  File ".travis/cover.py", line 38, in <module>
    main()
  File ".travis/cover.py", line 33, in main
    call(fn_tests + ' --state=failed -j 5', shell=True)  # nosec
  File "/opt/python/3.7.1/lib/python3.7/subprocess.py", line 317, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/opt/python/3.7.1/lib/python3.7/subprocess.py", line 769, in __init__
    restore_signals, start_new_session)
  File "/opt/python/3.7.1/lib/python3.7/subprocess.py", line 1447, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

I've added this to the "flaky test" list above.

kinow · 2019-07-24T02:23:32Z

Log for a failure in Travis where a test failed due to "Contact info not found for suite" https://gist.github.com/kinow/1429a911fae2e7cbbb7dcec63e64c653

Looks like when this error occurs, some tests that are not in the list of flaky tests fail. Which could indicate resource unintentionally shared by tests?

oliver-sanders · 2019-07-26T11:40:46Z

May have a fix for tests/restart/21-task-elapsed.t

kinow · 2019-08-25T22:38:27Z

@matthewrmshin I've added two tests that failed today in a trivial PR, but it was just so I don't forget about them.

Should I, instead, create a PR for each test that fails under Travis - and is not related to the change, and pass locally in the same branch - to move that test to the flaky tests folder?

hjoliver · 2019-08-25T23:00:14Z

Should I, instead, create a PR for each test that fails under Travis - and is not related to the change, and pass locally in the same branch - to move that test to the flaky tests folder?

👍

hjoliver · 2019-08-25T23:01:09Z

Then we can close this issue? - until someone dreams up a better way to handle flaky tests (which can be a new issue).

kinow · 2019-08-25T23:39:10Z

Then we can close this issue? - until someone dreams up a better way to handle flaky tests (which can be a new issue).

+1 !

kinow · 2019-08-25T23:39:34Z

Will raise a PR for the flaky tests in a few minutes. Closing this one for now 🎉

kinow added the small label Nov 27, 2018

kinow added this to the some-day milestone Nov 27, 2018

matthewrmshin changed the title ~~06-prereqs-outputs.t test can fail sometimes~~ test battery: flaky tests Jan 16, 2019

matthewrmshin modified the milestones: some-day, soon Jan 16, 2019

matthewrmshin added the bug? Not sure if this is a bug or not label Jan 16, 2019

matthewrmshin mentioned this issue Jan 16, 2019

Replace flaky tests/events/02-multi.t #2926

Merged

kinow mentioned this issue Jan 17, 2019

Out of bounds warning #2921

Merged

kinow removed the small label Jan 17, 2019

sadielbartholomew mentioned this issue Jan 21, 2019

testing: register flaky tests for fixing metomi/rose#2278

Closed

3 tasks

hjoliver mentioned this issue Mar 1, 2019

Fix missing python2 hash-bang lines. #2969

Merged

kinow mentioned this issue Mar 2, 2019

Make suite context available before config parsing #2963

Merged

kinow mentioned this issue Mar 27, 2019

Improve the total time of our functional tests #3046

Open

kinow mentioned this issue Mar 28, 2019

repeatable flaky test #3048

Closed

hjoliver mentioned this issue May 14, 2019

flaky test: cylc-poll/16-execution-time-limit.t #3163

Closed

This was referenced Jun 27, 2019

tests/execution-time-limit/00: reduce failure chance #3200

Merged

tests/modes/02: remove time constraint #3201

Merged

matthewrmshin mentioned this issue Jul 17, 2019

Collect flaky tests #3224

Merged

6 tasks

kinow mentioned this issue Jul 18, 2019

Upgrade isodatetime to metomi isodatetime #3222

Merged

5 tasks

sadielbartholomew mentioned this issue Jul 23, 2019

Remove possibility for very slow cylc print #3228

Merged

6 tasks

matthewrmshin mentioned this issue Jul 25, 2019

More flaky tests #3244

Merged

6 tasks

sadielbartholomew mentioned this issue Aug 8, 2019

Record all broadcast values in xtrigger return [backport] #3280

Merged

6 tasks

matthewrmshin mentioned this issue Aug 12, 2019

Reference tests and related improvements #3286

Merged

7 tasks

kinow mentioned this issue Aug 24, 2019

Update copyright notices for setup (& other) files #3310

Merged

6 tasks

kinow closed this as completed Aug 25, 2019

kinow mentioned this issue Aug 25, 2019

Move flaky tests to the flakytests folder/group #3312

Merged

6 tasks

matthewrmshin modified the milestones: soon, cylc-8.0a1 Aug 27, 2019

matthewrmshin assigned kinow, matthewrmshin and hjoliver Aug 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test battery: flaky tests #2894

test battery: flaky tests #2894

kinow commented Nov 27, 2018 •

edited by hjoliver

Loading

kinow commented Nov 27, 2018 •

edited

Loading

matthewrmshin commented Jan 16, 2019

kinow commented Jan 16, 2019

matthewrmshin commented Jan 17, 2019

hjoliver commented Jan 17, 2019

sadielbartholomew commented Jan 17, 2019 •

edited

Loading

hjoliver commented Jan 17, 2019 •

edited

Loading

sadielbartholomew commented Jan 17, 2019

hjoliver commented Jan 17, 2019

hjoliver commented Jan 17, 2019 •

edited

Loading

hjoliver commented Jan 17, 2019

matthewrmshin commented Jan 17, 2019

sadielbartholomew commented Jan 17, 2019

hjoliver commented Jan 17, 2019

matthewrmshin commented Jan 17, 2019

kinow commented Mar 28, 2019

oliver-sanders commented May 14, 2019 •

edited

Loading

hjoliver commented May 14, 2019

hjoliver commented May 14, 2019

kinow commented May 29, 2019 •

edited

Loading

hjoliver commented May 29, 2019

sadielbartholomew commented Jul 11, 2019

kinow commented Jul 24, 2019

oliver-sanders commented Jul 26, 2019

kinow commented Aug 25, 2019

hjoliver commented Aug 25, 2019

hjoliver commented Aug 25, 2019

kinow commented Aug 25, 2019

kinow commented Aug 25, 2019

test battery: flaky tests #2894

test battery: flaky tests #2894

Comments

kinow commented Nov 27, 2018 • edited by hjoliver Loading

kinow commented Nov 27, 2018 • edited Loading

matthewrmshin commented Jan 16, 2019

kinow commented Jan 16, 2019

matthewrmshin commented Jan 17, 2019

hjoliver commented Jan 17, 2019

sadielbartholomew commented Jan 17, 2019 • edited Loading

hjoliver commented Jan 17, 2019 • edited Loading

sadielbartholomew commented Jan 17, 2019

hjoliver commented Jan 17, 2019

hjoliver commented Jan 17, 2019 • edited Loading

hjoliver commented Jan 17, 2019

matthewrmshin commented Jan 17, 2019

sadielbartholomew commented Jan 17, 2019

hjoliver commented Jan 17, 2019

matthewrmshin commented Jan 17, 2019

kinow commented Mar 28, 2019

oliver-sanders commented May 14, 2019 • edited Loading

hjoliver commented May 14, 2019

how to hammer flaky tests on Travis CI

hjoliver commented May 14, 2019

kinow commented May 29, 2019 • edited Loading

hjoliver commented May 29, 2019

sadielbartholomew commented Jul 11, 2019

kinow commented Jul 24, 2019

oliver-sanders commented Jul 26, 2019

kinow commented Aug 25, 2019

hjoliver commented Aug 25, 2019

hjoliver commented Aug 25, 2019

kinow commented Aug 25, 2019

kinow commented Aug 25, 2019

kinow commented Nov 27, 2018 •

edited by hjoliver

Loading

kinow commented Nov 27, 2018 •

edited

Loading

sadielbartholomew commented Jan 17, 2019 •

edited

Loading

hjoliver commented Jan 17, 2019 •

edited

Loading

hjoliver commented Jan 17, 2019 •

edited

Loading

oliver-sanders commented May 14, 2019 •

edited

Loading

kinow commented May 29, 2019 •

edited

Loading