Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test battery: flaky tests #2894

Closed
kinow opened this issue Nov 27, 2018 · 34 comments
Closed

test battery: flaky tests #2894

kinow opened this issue Nov 27, 2018 · 34 comments
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@kinow
Copy link
Member

kinow commented Nov 27, 2018

Just to record that that the test battery can fail randomly or in certain scenarios (under a time zone, when an environment variable is present, etc). This ticket is just to check if it'd be possible to make it more stable, perhaps avoid re-running it, and, of course, sporadically failures.

[Edited by @matthewrmshin ] Since #3224, sensitive tests have gained their own hierarchy in the source tree under flakytests/ and run under Travis CI in serial. New flaky tests should be moved to the hierarchy, with the view that it may be possible to move some back to live with the other non-flaky tests. (I have commented out the original list - as the source tree is now the master list.)

List (post #3224 and #3286):

2019-08-25T21:54:01Z DEBUG - zmq:send {'data': (True, 'Messages queued: 1')}
2019-08-25T21:54:02Z CRITICAL - [t2.1] status=running: (received)failed/XCPU at 2019-08-25T21:54:00Z for job(01)
2019-08-25T21:54:02Z DEBUG - [t2.1] -running => failed
2019-08-25T21:54:02Z CRITICAL - [t2.1] -job(01) failed
2019-08-25T21:54:02Z ERROR - AUTOMATIC(ON-TASK-FAILURE)
	Traceback (most recent call last):
	  File "/home/travis/build/cylc/cylc-flow/cylc/flow/scheduler.py", line 252, in start
	    self.run()
	  File "/home/travis/build/cylc/cylc-flow/cylc/flow/scheduler.py", line 1505, in run
	    self.suite_shutdown()
	  File "/home/travis/build/cylc/cylc-flow/cylc/flow/scheduler.py", line 1211, in suite_shutdown
	    raise SchedulerError(self.stop_mode.value)
	cylc.flow.scheduler.SchedulerError: AUTOMATIC(ON-TASK-FAILURE)
2019-08-25T21:54:02Z ERROR - Suite shutting down - AUTOMATIC(ON-TASK-FAILURE)
==== /tmp/travis/cylctb-20190825T214903Z/registration/02-on-the-fly/02-on-the-fly-cylc-run-dir-stop.stderr ====
ClientTimeout: Timeout waiting for server response.
==== /tmp/travis/cylctb-20190825T214903Z/registration/02-on-the-fly/02-on-the-fly-cylc-run-dir-stop.stdout ====
==== /tmp/travis/cylctb-20190825T214903Z/registration/02-on-the-fly/02-on-the-fly-cylc-run-dir-2-validate.stdout-contains-ok.stderr ====
Missing lines:
REGISTERED cylctb-20190825T214903Z/02-on-the-fly -> /home/travis/cylc-run/cylctb-20190825T214903Z/02-on-the-fly

(When adding new ones, please bear in mind that the list is in alphabetical order.)

@kinow kinow added the small label Nov 27, 2018
@kinow kinow added this to the some-day milestone Nov 27, 2018
@kinow
Copy link
Member Author

kinow commented Nov 27, 2018

Leaving unassigned, good issue to any other newcomer to the project too 🎉

@matthewrmshin matthewrmshin changed the title 06-prereqs-outputs.t test can fail sometimes test battery: flaky tests Jan 16, 2019
@matthewrmshin matthewrmshin modified the milestones: some-day, soon Jan 16, 2019
@matthewrmshin matthewrmshin added the bug? Not sure if this is a bug or not label Jan 16, 2019
@matthewrmshin
Copy link
Contributor

Sorry hijacked the issue to record other flaky tests as well.

@kinow
Copy link
Member Author

kinow commented Jan 16, 2019

Thanks @matthewrmshin !!! With that list, working on this issue will be thousand times easier. Thanks!!!

@matthewrmshin
Copy link
Contributor

Yes, I also notice that tests/execution-time-limit/04-poll.t fails randomly at times. I'll take a look at these ones and the one currently being discussed in #2929.

@kinow kinow removed the small label Jan 17, 2019
@hjoliver
Copy link
Member

Another vote for ./tests/jobscript/19-exit-script.t.

@sadielbartholomew
Copy link
Collaborator

sadielbartholomew commented Jan 17, 2019

Could we (eventually) actually automate a procedure that can quantify 'flakiness' to highlight individual test issues?

Notably, we could create a script/workflow that will run (say, overnight) the test battery &/or Travis CI a number of times & compare failures on each set to pick up on any bad tests. Then we could send out emails prompting us to get any issues fixed.

There is actually some interesting literature about, such as this 2014 study which I just skim read, if anyone is looking for some bed-time reading!

@hjoliver
Copy link
Member

hjoliver commented Jan 17, 2019

Travis CI already "knows" which tests are flaky (in its environment, at least) because it sometimes has to run those ones twice to get the test battery to pass. Maybe we can just get it to notify us of which tests it has to run twice - shouldn't be too hard to arrange?

@sadielbartholomew
Copy link
Collaborator

it sometimes has to run those ones twice to get the test battery to pass

I am aware of that, but I think two is statistically not the greatest sample size for indication of test reliability. Though your comment made me wonder if there is a Travis CI configuration setting for the number of re-runs? If we up that, then only the relatively small number of possibly flaky tests will need to be re-run, instead of manually having to restart whole jobs in test battery runs whenever flaky tests crop up.

Maybe we can just get it to notify us of which tests it has to run twice - shouldn't be too hard to arrange?

That would certainly be useful & hopefully simple to configure.

I'll have a look through the Travis CI docs to see what is possible.

@hjoliver
Copy link
Member

if there is a Travis CI configuration setting for the number of re-runs?

I believe we manage the rerunning of failed tests ourselves, in our Travis CI config.

@hjoliver
Copy link
Member

hjoliver commented Jan 17, 2019

(By "knows" I meant our T-CI build script knows, so maybe we could just add a line to that to send a notification out somehow).

@hjoliver
Copy link
Member

two is statistically not the greatest sample size for indication of test reliability

True! But we would gradually identify all flaky tests as they randomly fail over many separate T-CI runs.

@matthewrmshin
Copy link
Contributor

The best outcome for this issue is that all tests are fixed and we'll no longer have to re-run them as we do now.

@sadielbartholomew
Copy link
Collaborator

The best outcome for this issue

Sure. But thinking about the future, flaky tests do cause issues here & then, so having some procedures for monitoring & fixing unreliable ones could be nice.

@hjoliver
Copy link
Member

A general comment - we should provide clear comments in our functional tests on exactly how they are supposed to work, because some of them do very strange things in order to set up certain test conditions that are difficult to reproduce "naturally". Case in point: #2929 (comment) - git blame fingers me, but I didn't explain in the code and just had to waste time figuring out how my own test worked all over again!! 😬

@matthewrmshin
Copy link
Contributor

@hjoliver Don't worry. I did manage to work out what was going on - and it was (sort of) my fault.

@kinow
Copy link
Member Author

kinow commented Mar 28, 2019

Note on ulimit's in the Travis distribution used in our builds:

0.00s$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 29790
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 30000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 29790
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Comparing with my local environment, it has the same settings or higher number for memory/files/etc. Except max locked memory (kbytes, -l) 64. In my environment is says max locked memory (kbytes, -l) 16384.

max locked memory (kbytes, -l)

The maximum size that may be locked into memory. Memory locking ensures the memory is always in RAM and never moved to the swap disk

But I do not believe this could cause random failures... I guess.

@oliver-sanders
Copy link
Member

oliver-sanders commented May 14, 2019

OK, the test is failing as the "health check settings" log line is different:

2019-05-14T10:40:39Z INFO - [foo.1] -health check settings: submission timeout=None, polling intervals=PT2S,...

Running locally I see

2019-05-14T11:26:44+01:00 INFO - [foo.1] -health check settings: execution timeout=PT10S, polling intervals=PT10S,...

@hjoliver
Copy link
Member

how to hammer flaky tests on Travis CI

  1. Activate Travis CI in your own clone.
  2. Make a new branch with a single commit that changes .travis/cover.py to run just the flaky test(s)
  3. push the the branch to your clone ... observe build results, re-trigger at will

@hjoliver
Copy link
Member

(Having done that with cylc-poll/16-execution-time-limit.t, it always passes when run alone! 😬 )

@kinow
Copy link
Member Author

kinow commented May 29, 2019

Bit off topic I think, but an interesting post about a team that collected reasons for their flaky tests and produced a nice summary of each issue, post mortem, etc.

https://samsaffron.com/archive/2019/05/15/tests-that-sometimes-fail

I like the author's optimism (I think he's co-founder of discourse BTW)

I would like to disagree a bit with Martin. Sometimes I find flaky tests are useful at finding underlying flaws in our application. In some cases when fixing a flaky test, the fix is in the app, not in the test.

@hjoliver
Copy link
Member

Haha, the first paragraphs describe our situation exactly 😬

@sadielbartholomew
Copy link
Collaborator

A few Travis CI runs on the chunk 3/4 have failed on tests/job-submission/18-check-chunking.t (which possibly influences subsequent tests if still run) with some shared errors including a repeated report of Cannot allocate memory: see for example this log & this one, which after the other error messages includes a traceback as follows:

Traceback (most recent call last):
  File ".travis/cover.py", line 38, in <module>
    main()
  File ".travis/cover.py", line 33, in main
    call(fn_tests + ' --state=failed -j 5', shell=True)  # nosec
  File "/opt/python/3.7.1/lib/python3.7/subprocess.py", line 317, in call
    with Popen(*popenargs, **kwargs) as p:
  File "/opt/python/3.7.1/lib/python3.7/subprocess.py", line 769, in __init__
    restore_signals, start_new_session)
  File "/opt/python/3.7.1/lib/python3.7/subprocess.py", line 1447, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

I've added this to the "flaky test" list above.

@kinow
Copy link
Member Author

kinow commented Jul 24, 2019

Log for a failure in Travis where a test failed due to "Contact info not found for suite" https://gist.github.com/kinow/1429a911fae2e7cbbb7dcec63e64c653

Looks like when this error occurs, some tests that are not in the list of flaky tests fail. Which could indicate resource unintentionally shared by tests?

@matthewrmshin matthewrmshin mentioned this issue Jul 25, 2019
6 tasks
@oliver-sanders
Copy link
Member

May have a fix for tests/restart/21-task-elapsed.t

@kinow
Copy link
Member Author

kinow commented Aug 25, 2019

@matthewrmshin I've added two tests that failed today in a trivial PR, but it was just so I don't forget about them.

Should I, instead, create a PR for each test that fails under Travis - and is not related to the change, and pass locally in the same branch - to move that test to the flaky tests folder?

@hjoliver
Copy link
Member

Should I, instead, create a PR for each test that fails under Travis - and is not related to the change, and pass locally in the same branch - to move that test to the flaky tests folder?

👍

@hjoliver
Copy link
Member

Then we can close this issue? - until someone dreams up a better way to handle flaky tests (which can be a new issue).

@kinow
Copy link
Member Author

kinow commented Aug 25, 2019

Then we can close this issue? - until someone dreams up a better way to handle flaky tests (which can be a new issue).

+1 !

@kinow
Copy link
Member Author

kinow commented Aug 25, 2019

Will raise a PR for the flaky tests in a few minutes. Closing this one for now 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

No branches or pull requests

5 participants