-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tests less flaky on CI #716
Conversation
8a21392
to
93b7337
Compare
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## master #716 +/- ##
==========================================
+ Coverage 96.28% 96.65% +0.36%
==========================================
Files 27 27
Lines 3554 3554
==========================================
+ Hits 3422 3435 +13
+ Misses 132 119 -13 see 7 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
092eabd
to
8e2e933
Compare
84a34a4
to
53bc8c1
Compare
cde26c3
to
a41f564
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments. Overall, I'm still 👍 on this, but realized there's a few more sections where it'd be helpful to have more of the history/rationale of the change documented, not only for me but also future git blame
spelunkers.
Would also appreciate if @ceache takes a look, although IMO we shouldn't wait for him before merging this if he doesn't have time.
This is fine for this PR, but is this something we should consider changing/fixing in |
I'll try to review this thoroughly this week. Sorry I have been quite
distracted recently.
Besides the flakiness within the tests themselves, i was looking at the
harness recently and I wanted to know how you felt about swapping our
custom zookeeper ensemble orchestration driver for some docker compose
driver.
This would remove a lot of complexity and allow us to know we are starting
the tests only after the ensemble is actually up and ready.
I know how to craft such a compose definition and used it numerous times
for my own testing.
The down side is introducing a dependency on docker for local testing (or
we could leave the "native" harness around for docker-less development)
What do you think?
…On Tue, Apr 11, 2023, 18:05 Stephen Sorriaux ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In kazoo/testing/harness.py
<#716 (comment)>:
> + client_cluster_health.start()
+ client_cluster_health.ensure_path("/")
+ client_cluster_health.stop()
+ self.log(logging.INFO, "cluster looks ready to go")
+ break
+ except Exception:
+ tries += 1
+ if tries >= MAX_INIT_TRIES:
+ raise
+ if tries > 0 and tries % 2 == 0:
+ self.log(
+ logging.WARNING,
+ "nuking current cluster and making another one",
+ )
+ self.cluster.terminate()
+ self.cluster.start()
Yes, there is already some sleep() hidden in cluster.start() (
https://github.com/python-zk/kazoo/blob/master/kazoo/testing/common.py#L370
)
—
Reply to this email directly, view it on GitHub
<#716 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIFTHUSQS3TH5DKY3SMGM3XAXIQ7ANCNFSM6AAAAAAWHB5WWQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Fixed the `test_read_only` test based on what is done in the official Java client, we were missing some things and that led the test to fail from time to time (see comments in the code for full story). Also renamed `test__connection.py` to `test_connection.py`, where it belongs.
Now performing a ZK cluster nuking at the end of each test case since keeping the same ZK state from test cases to test cases actually led to some tests randomly failing. Added `client_cluster_health` ZK client in the `setup_zookeeper()` step that is there to "guarantee" the cluster is up and running, without even considering the configuration that will be used by the test itself.
Activate `pytest`'s `logcli` setting to better read the client lifecycle during testing. Add `log()` function to KazooTestHarness class so that it is possible to log like crazy when something is not working Display the ZK client port in logs when starting a ZK server (useful for test "debug") Be able to get more than the last 100 lines of ZK logs (can be useful, believe me)
a41f564
to
64af698
Compare
Yes, I think we should consider looking into it, especially what to expect of the retry when used as a |
I'll be all in for a docker based harness! I personally have java installed on my desktop just for kazoo... |
I would 110% support a docker-based harness. No problem adding that as a test/dev dependency, it's almost certainly more accessible to drive-by devs as well who are going to be familiar with that from other libraries doing similar things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me, thank you so much for the detailed explanations, not only for me but also on behalf of other folks who stumble across this later.
Probably best to wait to merge til Charles get a chance to look at it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Thanks a lot @StephenSorriaux !
Update: I think #685 tried to tackle the "issue" I mentionned |
I will try to rebase my PR, now that this is merged.
Would there be anything holding it back ?
I understood Jeff's comment to mean he thought it was ok.
…On Mon, Apr 24, 2023, 21:55 Stephen Sorriaux ***@***.***> wrote:
- I added a "custom" retry logic without using a KazooRetry since
KazooRetry does not handle refused/dropped connections like I would
want too: even if a connection_retry is given to KazooClient(), the
retry configuration is actually not respected because the interrupt()
end up being triggered anyway (ie. the client._stopped event)
This is fine for this PR, but is this something we should consider
changing/fixing in KazooClient/KazooRetry long-term?
Yes, I think we should consider looking into it, especially what to expect
of the retry when used as a connection_retry parameter for the client
Update: I think #685 <#685> tried
to tackle the "issue" I mentionned
—
Reply to this email directly, view it on GitHub
<#716 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIFTHVM7SYTPRBYKHIHEWTXC3LERANCNFSM6AAAAAAWHB5WWQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sorry for the rebase, I think I owe you a review on #685, will take the time now! |
Why is this needed?
This makes the tests way less flaky, I hope (and first results confirmed it... but you never know).
Proposed Changes
test__connection.py
totest_connection.py
, where it belongsclient_cluster_health
in thesetup_zookeeper
step that is there to "guarantee" the cluster is up and running, without even considering the configuration that will be used by the test itself. I added a "custom" retry logic without using aKazooRetry
sinceKazooRetry
does not handle refused/dropped connections like I would want too: even if aconnection_retry
is given toKazooClient()
, the retry configuration is actually not respected because theinterrupt()
end up being triggered anyway (ie. theclient._stopped
event)test_read_only
test based on what is done in the official java client (https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/test/java/org/apache/zookeeper/test/ReadOnlyModeTest.java), we were missing some things and that led the test to fail from time to time (see comments in the code)log_cli
configuration, as it gives more logs for each tests and I find it usefulDoes this PR introduce any breaking change?
Nope.