-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Flaky-tests) Reduce flakiness of tests #6202
Conversation
92c0acd
to
301f4b8
Compare
It would be easier to fix more of these in a single PR because otherwise, each of the smaller PRs would need to be re-run many times just to get all of the flaky tests to pass. |
d941c5c
to
db9c80b
Compare
Increased timeouts for state tests. apache#6200 apache#6198 Increased timeouts to testSimpleConsumerEventsWithoutPartition and introduced await to poll on assertions to eliminate use of Thread.sleep in several places. (apache#6014) Attempting to fix testPulsarKafkaProducerWithSerializer issue by adding await to test. (apache#6137) Attempt to fix apache#6207 and add more debugging information by pruning docker containers. Fixed typo in docker commands for getting debug info. apache#6207. Removing timeouts as per comments in apache#5333. This is for apache#6202. Fixed timeout issues for CPP tests. apache#6202 and apache#6137 Increased more timeouts. apache#6202 and apache#6137 Fixed typo in CPP test timeout fix. apache#6202 apache#4884 Edited comment to trigger build apache#6202
…y broke two of the tests that depend on timeout configurations. Those changes will require more investigation. apache#6202
Increased timeouts for state tests. apache#6200 apache#6198 Increased timeouts to testSimpleConsumerEventsWithoutPartition and introduced await to poll on assertions to eliminate use of Thread.sleep in several places. (apache#6014) Attempting to fix testPulsarKafkaProducerWithSerializer issue by adding await to test. (apache#6137) Attempt to fix apache#6207 and add more debugging information by pruning docker containers. Fixed typo in docker commands for getting debug info. apache#6207. Removing timeouts as per comments in apache#5333. This is for apache#6202. Fixed timeout issues for CPP tests. apache#6202 and apache#6137 Increased more timeouts. apache#6202 and apache#6137 Fixed typo in CPP test timeout fix. apache#6202 apache#4884 Edited comment to trigger build apache#6202 Rolled back changes to PulsarSpoutTest because fixing some instability broke two of the tests that depend on timeout configurations. Those changes will require more investigation. apache#6202
1b5b262
to
2be4cc9
Compare
…at was accidentially included in ReaderTest. apache#6202
@tuteng I see that you beat me to fixing the Github YAML files that were causing the 2 GB build error. I had some of that code in this PR, but I hadn't gotten all of the tests to pass yet, so I wasn't able to merge. Your change was more robust anyway. |
6272ea5
to
2393d08
Compare
Increased timeouts for state tests. apache#6200 apache#6198 Increased timeouts to testSimpleConsumerEventsWithoutPartition and introduced await to poll on assertions to eliminate use of Thread.sleep in several places. (apache#6014) Attempting to fix testPulsarKafkaProducerWithSerializer issue by adding await to test. (apache#6137) Attempt to fix apache#6207 and add more debugging information by pruning docker containers. Fixed typo in docker commands for getting debug info. apache#6207. Removing timeouts as per comments in apache#5333. This is for apache#6202. Fixed timeout issues for CPP tests. apache#6202 and apache#6137 Increased more timeouts. apache#6202 and apache#6137 Fixed typo in CPP test timeout fix. apache#6202 apache#4884 Edited comment to trigger build apache#6202 Rolled back changes to PulsarSpoutTest because fixing some instability broke two of the tests that depend on timeout configurations. Those changes will require more investigation. apache#6202 Added timeouts back in places where required. Increased timeouts though. apache#6202 Fixed timeouts for Storm and Kafka tests. Also removed debug block that was accidentially included in ReaderTest. apache#6202 Editing comment to trigger new build. apache#6202 Attempt to workaround test failure. apache#6202 Adding some timeouts back to get beyond hanging tests. apache#6202 Increased sleep value as temporary workaround for thread timeout. apache#6202 Added back timeouts to fix hang but increased timeouts from 1s to 5s. apache#6202 Added back timeout (but made it longer) to prevent hanging test. apache#6202 Fixed formatting since it was breaking the build. apache#6202
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern is changing receive
to receive(x, TimeUnit.SECONDS)
. I actually think we should do the reverse - changing receive(x, TimeUnit.SECONDS)
to receive()
@@ -1646,7 +1646,7 @@ public void persistentTopicsCursorResetAfterReset(String topicName) throws Excep | |||
}); | |||
|
|||
for (int i = 0; i < 10; i++) { | |||
Message<byte[]> message = consumer.receive(); | |||
Message<byte[]> message = consumer.receive(5, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it is a good idea to change to 5 seconds. Because it can cause flakiness if JVM pauses for 5 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sijie I'll double-check on this specific test, but on a lot of the tests, what was happening is that either the receive()
call was hanging or timing out too soon (before all the messages were received.) The timeouts that were too short involved consumer.receive(1, TimeUnit.SECONDS);
(if I remember correctly), and for one of them, consumer.receive(3, TimeUnit.SECONDS);
was still too short. So, I increased it to consumer.receive(5, TimeUnit.SECONDS);
which seemed to get the tests to pass consistently.
…sorResetAfterReset(..) test. apache#6202
ba1d77b
to
e768d50
Compare
Increased timeouts for state tests. apache#6200 apache#6198 Increased timeouts to testSimpleConsumerEventsWithoutPartition and introduced await to poll on assertions to eliminate use of Thread.sleep in several places. (apache#6014) Attempting to fix testPulsarKafkaProducerWithSerializer issue by adding await to test. (apache#6137) Attempt to fix apache#6207 and add more debugging information by pruning docker containers. Fixed typo in docker commands for getting debug info. apache#6207. Removing timeouts as per comments in apache#5333. This is for apache#6202. Fixed timeout issues for CPP tests. apache#6202 and apache#6137 Increased more timeouts. apache#6202 and apache#6137 Fixed typo in CPP test timeout fix. apache#6202 apache#4884 Edited comment to trigger build apache#6202 Rolled back changes to PulsarSpoutTest because fixing some instability broke two of the tests that depend on timeout configurations. Those changes will require more investigation. apache#6202 Added timeouts back in places where required. Increased timeouts though. apache#6202 Fixed timeouts for Storm and Kafka tests. Also removed debug block that was accidentially included in ReaderTest. apache#6202 Editing comment to trigger new build. apache#6202 Attempt to workaround test failure. apache#6202 Adding some timeouts back to get beyond hanging tests. apache#6202 Increased sleep value as temporary workaround for thread timeout. apache#6202 Added back timeouts to fix hang but increased timeouts from 1s to 5s. apache#6202 Added back timeout (but made it longer) to prevent hanging test. apache#6202 Fixed formatting since it was breaking the build. apache#6202 Increased more test timeouts to get them to pass on slow hardware. apache#6202 Increased more test timeouts to get them to pass on slow hardware. apache#6202 Edited more test timeouts to get them to pass on slow hardware. apache#6202 Triggering tests due to 'Could not transfer artifact' maven issue. apache#6202 Increased or edited timeouts to get more tests to pass. apache#6202 Triggering new build by changing comment. apache#6202 Fixed timeouts (to short timeouts) when null message is expected. apache#6202 Triggering new build by changing comment. apache#6202 Increased timeout. apache#6202 Increased sleep as temporary workaround. apache#6202 Tuned timeouts more. apache#6202 Widening time to force timeout in timeout test. apache#6202 Fixed spelling typo. apache#6202 Added randomization of namespace name. apache#6202 Added random name generator to names of producers, subscriptions, and topics in ClientDeduplicationTest to fix duplicate name conflicts. apache#6202 Fixed issues with duplicate namespaces with repeated test runs. apache#6202 Added randomization to topic name to prevent potential conflicts that might be causing non-determinism in test. apache#6202 Added randomization to namespace name to prevent issues with topics not clearing out before second run of tests. apache#6202 Attempt to get C++ test fixed. It's not clear if this commit will build though... apache#6202 Replaced snake_case with camelCase to try to get c++ format to pass the build. apache#6202 Adding random name to subscription to see if that resolves the fact that this test only fails on the second subsequent run. apache#6202 Fixed timeout issues. apache#6202 Attempting fix of testPerTopicStats() by addressing race condition. apache#6202 Adding some debugging to help troubleshoot flaky test. apache#6202 Removing code that wasn't building anyway. apache#6202 Changed how we're testing Prometheus by filtering the topic name to fix race conditions between test runs and sharing broker state. apache#6202 Added more debugging information and fixed assertion apache#6202 Trigger new build apache#6202 Added long timeouts to ensure that broker tests do timeout instead of hanging but without timing out too soon apache#6202 Fixed imports for TimeUnit apache#6202 Fixed imports for TimeUnit apache#6202 Pushing changes to allow discussion on what's happening. apache#6202 Fixed timeouts for the testSharedSingleAckedPartitionedTopic() test. apache#6202 Fixed issue with Prometheus test. apache#6202 Can't use receive with timeout, if the queue size is 0. Fixed InterceptorsTest. apache#6202 Can't use receive with timeout, if the queue size is 0. apache#6202 Fixed Can't use receive with timeout, if the queue size is 0. apache#6202 Edited comment to trigger re-run of all tests to find more flaky tests. apache#6202 Fixed more of the concurrency issue in testPerTopicStats that was causing namespace conflicts. apache#6202 Fixed something I missed during rebasing. apache#6202 Fixed issues with Prometheus tests. apache#6256 Changed MessageId.latest to MessageId.earliest to fix apache#6224 Fixes issue apache#6352 Triggering build to inspect test results. apache#6202 Added timeouts to fix hanging tests. apache#6202 Triggering new build. apache#6202 Updating Github workflow to build surefire artifacts if previous step was cancelled, not just failed. apache#6202 Changing CI Unit Action to always build surefire artifacts to help with debugging hanging test. apache#6202 Triggering new build with arbitrary edit. apache#6202 Triggering build with arbitrary change to comment apache#6202 Triggering new build with arbitrary code change. apache#6202 Triggering new build with arbitrary code change. apache#6202 Changing surefire trigger back to failure() apache#6202 Added surefire artifacts to run always again. apache#6202 Triggering new build. apache#6202 Added condition to make testPartitions() more robust during repeated runs apache#6202 Implementing Sijie's suggestion about timeout for persistentTopicsCursorResetAfterReset(..) test. apache#6202
@sijie Since it might be a while before I'm able to get back to this, we probably should get it reviewed and merged so at least some of the flaky tests are fixed. I'll go ahead and rebase it now. |
Added awaitility to two pom files. Increased timeouts for state tests. apache#6200 apache#6198 Increased timeouts to testSimpleConsumerEventsWithoutPartition and introduced await to poll on assertions to eliminate use of Thread.sleep in several places. (apache#6014) Attempting to fix testPulsarKafkaProducerWithSerializer issue by adding await to test. (apache#6137) Attempt to fix apache#6207 and add more debugging information by pruning docker containers. Fixed typo in docker commands for getting debug info. apache#6207. Removing timeouts as per comments in apache#5333. This is for apache#6202. Fixed timeout issues for CPP tests. apache#6202 and apache#6137 Increased more timeouts. apache#6202 and apache#6137 Fixed typo in CPP test timeout fix. apache#6202 apache#4884 Edited comment to trigger build apache#6202 Rolled back changes to PulsarSpoutTest because fixing some instability broke two of the tests that depend on timeout configurations. Those changes will require more investigation. apache#6202 Added timeouts back in places where required. Increased timeouts though. apache#6202 Fixed timeouts for Storm and Kafka tests. Also removed debug block that was accidentially included in ReaderTest. apache#6202 Editing comment to trigger new build. apache#6202 Attempt to workaround test failure. apache#6202 Adding some timeouts back to get beyond hanging tests. apache#6202 Increased sleep value as temporary workaround for thread timeout. apache#6202 Added back timeouts to fix hang but increased timeouts from 1s to 5s. apache#6202 Added back timeout (but made it longer) to prevent hanging test. apache#6202 Fixed formatting since it was breaking the build. apache#6202 Increased more test timeouts to get them to pass on slow hardware. apache#6202 Increased more test timeouts to get them to pass on slow hardware. apache#6202 Edited more test timeouts to get them to pass on slow hardware. apache#6202 Triggering tests due to 'Could not transfer artifact' maven issue. apache#6202 Increased or edited timeouts to get more tests to pass. apache#6202 Triggering new build by changing comment. apache#6202 Fixed timeouts (to short timeouts) when null message is expected. apache#6202 Triggering new build by changing comment. apache#6202 Increased timeout. apache#6202 Increased sleep as temporary workaround. apache#6202 Tuned timeouts more. apache#6202 Widening time to force timeout in timeout test. apache#6202 Fixed spelling typo. apache#6202 Added randomization of namespace name. apache#6202 Added random name generator to names of producers, subscriptions, and topics in ClientDeduplicationTest to fix duplicate name conflicts. apache#6202 Fixed issues with duplicate namespaces with repeated test runs. apache#6202 Added randomization to topic name to prevent potential conflicts that might be causing non-determinism in test. apache#6202 Added randomization to namespace name to prevent issues with topics not clearing out before second run of tests. apache#6202 Attempt to get C++ test fixed. It's not clear if this commit will build though... apache#6202 Replaced snake_case with camelCase to try to get c++ format to pass the build. apache#6202 Adding random name to subscription to see if that resolves the fact that this test only fails on the second subsequent run. apache#6202 Fixed timeout issues. apache#6202 Attempting fix of testPerTopicStats() by addressing race condition. apache#6202 Adding some debugging to help troubleshoot flaky test. apache#6202 Removing code that wasn't building anyway. apache#6202 Changed how we're testing Prometheus by filtering the topic name to fix race conditions between test runs and sharing broker state. apache#6202 Added more debugging information and fixed assertion apache#6202 Trigger new build apache#6202 Added long timeouts to ensure that broker tests do timeout instead of hanging but without timing out too soon apache#6202 Fixed imports for TimeUnit apache#6202 Fixed imports for TimeUnit apache#6202 Pushing changes to allow discussion on what's happening. apache#6202 Fixed timeouts for the testSharedSingleAckedPartitionedTopic() test. apache#6202 Fixed issue with Prometheus test. apache#6202 Can't use receive with timeout, if the queue size is 0. Fixed InterceptorsTest. apache#6202 Can't use receive with timeout, if the queue size is 0. apache#6202 Fixed Can't use receive with timeout, if the queue size is 0. apache#6202 Edited comment to trigger re-run of all tests to find more flaky tests. apache#6202 Fixed more of the concurrency issue in testPerTopicStats that was causing namespace conflicts. apache#6202 Fixed something I missed during rebasing. apache#6202 Fixed issues with Prometheus tests. apache#6256 Changed MessageId.latest to MessageId.earliest to fix apache#6224 Fixes issue apache#6352 Triggering build to inspect test results. apache#6202 Added timeouts to fix hanging tests. apache#6202 Triggering new build. apache#6202 Updating Github workflow to build surefire artifacts if previous step was cancelled, not just failed. apache#6202 Changing CI Unit Action to always build surefire artifacts to help with debugging hanging test. apache#6202 Triggering new build with arbitrary edit. apache#6202 Triggering build with arbitrary change to comment apache#6202 Triggering new build with arbitrary code change. apache#6202 Triggering new build with arbitrary code change. apache#6202 Changing surefire trigger back to failure() apache#6202 Added surefire artifacts to run always again. apache#6202 Triggering new build. apache#6202 Added condition to make testPartitions() more robust during repeated runs apache#6202 Implementing Sijie's suggestion about timeout for persistentTopicsCursorResetAfterReset(..) test. apache#6202 Fixed file that I forgot to merge. apache#6202 Increased robustness of testPartitions() for repeated execution. apache#6202 Added more debugging to ParserProxyHandler's channelRead, changed test from private to public, and decreased test noise. apache#6332 Trying to get more debug info apache#6332 Added more debugging log statements to try to pinpoint where the failure happens. apache#6332 Added more debugging log statements to try to pinpoint where the failure happens. apache#6332 Added even more debugging for tracing purposes. apache#6332 Added even more debugging for tracing purposes. apache#6332 Rolling back unnecessary changes. apache#6202 Rolling back unnecessary changes. apache#6202 Fixed issue with testDeadLetterTopic() where redelivery was getting triggered. apache#6202 Adding more debug information and methods to test hypothesis. apache#6332 Adding keepAlive to ServerConnection to see what that does. apache#6332 Increasing ProxyServer keepAliveInterval to 90 seconds in case it is timing out during server tests. apache#6332 Rolling back changes. apache#6332
40a23c7
to
81e23eb
Compare
/pulsarbot run-failure-checks |
2 similar comments
/pulsarbot run-failure-checks |
/pulsarbot run-failure-checks |
@sijie On second thought, it looks like the remaining test failures are rather stubborn, so this PR might just need to stay on hold until I'm able to resume working on it. |
We have lots of CI failure, is it worth to rebase to current master ? |
@devinbost:Thanks for your contribution. For this PR, do we need to update docs? |
Closed as stale. There's too many conflict to continue :/ |
The related works, if still valuable, can be rebased on the master and submitted one by one so that we can merge it quickly. |
Increased test timeouts and made other changes to tests.
Fixes #2647
Fixes #2651
Fixes #6014
Fixes #6198
Fixes #6224
Fixes #6232
Fixes #6254
Fixes #6256
Fixes #6299
Fixes #6304
Fixes #6306
Fixes #6352
Fixes other flaky tests that still need issues created for them.
Master Issue: #6137
Motivation
Need to resolve these flaky tests.
Modifications
Among other things, this PR fixes all issues like this:
java.lang.AssertionError: expected [val1-9] but found [val1-6]