Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Flaky-tests) Reduce flakiness of tests #6202

Closed
wants to merge 1 commit into from

Conversation

devinbost
Copy link
Contributor

@devinbost devinbost commented Feb 4, 2020

Increased test timeouts and made other changes to tests.

Fixes #2647
Fixes #2651
Fixes #6014
Fixes #6198
Fixes #6224
Fixes #6232
Fixes #6254
Fixes #6256
Fixes #6299
Fixes #6304
Fixes #6306
Fixes #6352

Fixes other flaky tests that still need issues created for them.

Master Issue: #6137

Motivation

Need to resolve these flaky tests.

Modifications

  1. Fixes issues caused by test timeouts being too short. (The majority of changes in this PR are in this category.)
  2. Fixes issues caused by hanging tests resulting from no timeouts.
  3. Fixes Prometheus test issues (since these required more extensive changes to the tests.) There were multiple race conditions and concurrency issues in how we were getting data from Prometheus. (A more robust change would migrate our use of Prometheus to the new approach used in the Go Functions, but I don't think that change is necessary at this time.)
  4. Changes that introduced awaitility library. (Only used in a couple of places to avoid the use of Thread.sleep.)
  5. Fixes involving concurrency issues on topic names. (There were a few in this category.)

Among other things, this PR fixes all issues like this: java.lang.AssertionError: expected [val1-9] but found [val1-6]

@devinbost devinbost force-pushed the fix-flaky-tests-attempt-2 branch from 92c0acd to 301f4b8 Compare February 4, 2020 06:21
@devinbost devinbost changed the title [Issue 6200] [Issue 6198] (WIP) Fix flaky state tests [Issue 6200] [Issue 6198] (WIP) Fix flaky state and E2E tests Feb 4, 2020
@devinbost
Copy link
Contributor Author

@merlimat It looks like there might be some overlap between this PR and the one you were working on (#5328). It looks like that PR is stale. Would you like me to merge your changes into this one and pick up responsibility on it?

@devinbost
Copy link
Contributor Author

@merlimat Same with this one: #5333

@devinbost
Copy link
Contributor Author

It would be easier to fix more of these in a single PR because otherwise, each of the smaller PRs would need to be re-run many times just to get all of the flaky tests to pass.

devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 4, 2020
@devinbost
Copy link
Contributor Author

Looks like a couple of the commits from this PR (e292cba and 79afaee) fixed the Docker disk space availability issue.

devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 5, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 5, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 5, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 5, 2020
@devinbost devinbost force-pushed the fix-flaky-tests-attempt-2 branch from d941c5c to db9c80b Compare February 5, 2020 01:09
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 5, 2020
Increased timeouts for state tests. apache#6200 apache#6198

Increased timeouts to testSimpleConsumerEventsWithoutPartition and introduced await to poll on assertions to eliminate use of Thread.sleep in several places. (apache#6014)

Attempting to fix testPulsarKafkaProducerWithSerializer issue by adding await to test. (apache#6137)

Attempt to fix apache#6207 and add more debugging information by pruning docker containers.

Fixed typo in docker commands for getting debug info. apache#6207.

Removing timeouts as per comments in apache#5333. This is for apache#6202.

Fixed timeout issues for CPP tests. apache#6202 and apache#6137

Increased more timeouts. apache#6202 and apache#6137

Fixed typo in CPP test timeout fix. apache#6202  apache#4884

Edited comment to trigger build apache#6202
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 5, 2020
…y broke two of the tests that depend on timeout configurations. Those changes will require more investigation. apache#6202
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 5, 2020
Increased timeouts for state tests. apache#6200 apache#6198

Increased timeouts to testSimpleConsumerEventsWithoutPartition and introduced await to poll on assertions to eliminate use of Thread.sleep in several places. (apache#6014)

Attempting to fix testPulsarKafkaProducerWithSerializer issue by adding await to test. (apache#6137)

Attempt to fix apache#6207 and add more debugging information by pruning docker containers.

Fixed typo in docker commands for getting debug info. apache#6207.

Removing timeouts as per comments in apache#5333. This is for apache#6202.

Fixed timeout issues for CPP tests. apache#6202 and apache#6137

Increased more timeouts. apache#6202 and apache#6137

Fixed typo in CPP test timeout fix. apache#6202  apache#4884

Edited comment to trigger build apache#6202

Rolled back changes to PulsarSpoutTest because fixing some instability broke two of the tests that depend on timeout configurations. Those changes will require more investigation. apache#6202
@devinbost devinbost force-pushed the fix-flaky-tests-attempt-2 branch from 1b5b262 to 2be4cc9 Compare February 5, 2020 04:16
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 5, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 5, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 5, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 5, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 6, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 6, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 6, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 6, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 6, 2020
@devinbost
Copy link
Contributor Author

devinbost commented Feb 7, 2020

@tuteng I see that you beat me to fixing the Github YAML files that were causing the 2 GB build error. I had some of that code in this PR, but I hadn't gotten all of the tests to pass yet, so I wasn't able to merge. Your change was more robust anyway.

@devinbost devinbost force-pushed the fix-flaky-tests-attempt-2 branch from 6272ea5 to 2393d08 Compare February 7, 2020 00:06
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 7, 2020
Increased timeouts for state tests. apache#6200 apache#6198

Increased timeouts to testSimpleConsumerEventsWithoutPartition and introduced await to poll on assertions to eliminate use of Thread.sleep in several places. (apache#6014)

Attempting to fix testPulsarKafkaProducerWithSerializer issue by adding await to test. (apache#6137)

Attempt to fix apache#6207 and add more debugging information by pruning docker containers.

Fixed typo in docker commands for getting debug info. apache#6207.

Removing timeouts as per comments in apache#5333. This is for apache#6202.

Fixed timeout issues for CPP tests. apache#6202 and apache#6137

Increased more timeouts. apache#6202 and apache#6137

Fixed typo in CPP test timeout fix. apache#6202  apache#4884

Edited comment to trigger build apache#6202

Rolled back changes to PulsarSpoutTest because fixing some instability broke two of the tests that depend on timeout configurations. Those changes will require more investigation. apache#6202

Added timeouts back in places where required. Increased timeouts though. apache#6202

Fixed timeouts for Storm and Kafka tests. Also removed debug block that was accidentially included in ReaderTest. apache#6202

Editing comment to trigger new build. apache#6202

Attempt to workaround test failure. apache#6202

Adding some timeouts back to get beyond hanging tests. apache#6202

Increased sleep value as temporary workaround for thread timeout. apache#6202

Added back timeouts to fix hang but increased timeouts from 1s to 5s. apache#6202

Added back timeout (but made it longer) to prevent hanging test. apache#6202

Fixed formatting since it was breaking the build. apache#6202
@devinbost devinbost changed the title [Issue 6200] [Issue 6198] (WIP) Fix flaky state and E2E tests [Issue 6200] [Issue 6198] (WIP) Reduce flakiness of tests Feb 7, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 7, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 22, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 22, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 22, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 22, 2020
Copy link
Member

@sijie sijie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is changing receive to receive(x, TimeUnit.SECONDS). I actually think we should do the reverse - changing receive(x, TimeUnit.SECONDS) to receive()

@@ -1646,7 +1646,7 @@ public void persistentTopicsCursorResetAfterReset(String topicName) throws Excep
});

for (int i = 0; i < 10; i++) {
Message<byte[]> message = consumer.receive();
Message<byte[]> message = consumer.receive(5, TimeUnit.SECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is a good idea to change to 5 seconds. Because it can cause flakiness if JVM pauses for 5 seconds.

Copy link
Contributor Author

@devinbost devinbost Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sijie I'll double-check on this specific test, but on a lot of the tests, what was happening is that either the receive() call was hanging or timing out too soon (before all the messages were received.) The timeouts that were too short involved consumer.receive(1, TimeUnit.SECONDS); (if I remember correctly), and for one of them, consumer.receive(3, TimeUnit.SECONDS); was still too short. So, I increased it to consumer.receive(5, TimeUnit.SECONDS); which seemed to get the tests to pass consistently.

devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 24, 2020
@devinbost devinbost force-pushed the fix-flaky-tests-attempt-2 branch from ba1d77b to e768d50 Compare February 24, 2020 20:25
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 24, 2020
Increased timeouts for state tests. apache#6200 apache#6198

Increased timeouts to testSimpleConsumerEventsWithoutPartition and introduced await to poll on assertions to eliminate use of Thread.sleep in several places. (apache#6014)

Attempting to fix testPulsarKafkaProducerWithSerializer issue by adding await to test. (apache#6137)

Attempt to fix apache#6207 and add more debugging information by pruning docker containers.

Fixed typo in docker commands for getting debug info. apache#6207.

Removing timeouts as per comments in apache#5333. This is for apache#6202.

Fixed timeout issues for CPP tests. apache#6202 and apache#6137

Increased more timeouts. apache#6202 and apache#6137

Fixed typo in CPP test timeout fix. apache#6202  apache#4884

Edited comment to trigger build apache#6202

Rolled back changes to PulsarSpoutTest because fixing some instability broke two of the tests that depend on timeout configurations. Those changes will require more investigation. apache#6202

Added timeouts back in places where required. Increased timeouts though. apache#6202

Fixed timeouts for Storm and Kafka tests. Also removed debug block that was accidentially included in ReaderTest. apache#6202

Editing comment to trigger new build. apache#6202

Attempt to workaround test failure. apache#6202

Adding some timeouts back to get beyond hanging tests. apache#6202

Increased sleep value as temporary workaround for thread timeout. apache#6202

Added back timeouts to fix hang but increased timeouts from 1s to 5s. apache#6202

Added back timeout (but made it longer) to prevent hanging test. apache#6202

Fixed formatting since it was breaking the build. apache#6202

Increased more test timeouts to get them to pass on slow hardware. apache#6202

Increased more test timeouts to get them to pass on slow hardware. apache#6202

Edited more test timeouts to get them to pass on slow hardware. apache#6202

Triggering tests due to 'Could not transfer artifact' maven issue. apache#6202

Increased or edited timeouts to get more tests to pass. apache#6202

Triggering new build by changing comment. apache#6202

Fixed timeouts (to short timeouts) when null message is expected. apache#6202

Triggering new build by changing comment. apache#6202

Increased timeout. apache#6202

Increased sleep as temporary workaround. apache#6202

Tuned timeouts more. apache#6202

Widening time to force timeout in timeout test. apache#6202

Fixed spelling typo. apache#6202

Added randomization of namespace name. apache#6202

Added random name generator to names of producers, subscriptions, and topics in ClientDeduplicationTest to fix duplicate name conflicts. apache#6202

Fixed issues with duplicate namespaces with repeated test runs. apache#6202

Added randomization to topic name to prevent potential conflicts that might be causing non-determinism in test. apache#6202

Added randomization to namespace name to prevent issues with topics not clearing out before second run of tests. apache#6202

Attempt to get C++ test fixed. It's not clear if this commit will build though... apache#6202

Replaced snake_case with camelCase to try to get c++ format to pass the build. apache#6202

Adding random name to subscription to see if that resolves the fact that this test only fails on the second subsequent run. apache#6202

Fixed timeout issues. apache#6202

Attempting fix of testPerTopicStats() by addressing race condition. apache#6202

Adding some debugging to help troubleshoot flaky test. apache#6202

Removing code that wasn't building anyway. apache#6202

Changed how we're testing Prometheus by filtering the topic name to fix race conditions between test runs and sharing broker state. apache#6202

Added more debugging information and fixed assertion apache#6202

Trigger new build apache#6202

Added long timeouts to ensure that broker tests do timeout instead of hanging but without timing out too soon apache#6202

Fixed imports for TimeUnit apache#6202

Fixed imports for TimeUnit apache#6202

Pushing changes to allow discussion on what's happening. apache#6202

Fixed timeouts for the testSharedSingleAckedPartitionedTopic() test. apache#6202

Fixed issue with Prometheus test. apache#6202

Can't use receive with timeout, if the queue size is 0. Fixed InterceptorsTest. apache#6202

Can't use receive with timeout, if the queue size is 0. apache#6202

Fixed Can't use receive with timeout, if the queue size is 0. apache#6202

Edited comment to trigger re-run of all tests to find more flaky tests. apache#6202

Fixed more of the concurrency issue in testPerTopicStats that was causing namespace conflicts. apache#6202

Fixed something I missed during rebasing. apache#6202

Fixed issues with Prometheus tests. apache#6256

Changed MessageId.latest to MessageId.earliest to fix apache#6224

Fixes issue apache#6352

Triggering build to inspect test results. apache#6202

Added timeouts to fix hanging tests. apache#6202

Triggering new build. apache#6202

Updating Github workflow to build surefire artifacts if previous step was cancelled, not just failed. apache#6202

Changing CI Unit Action to always build surefire artifacts to help with debugging hanging test. apache#6202

Triggering new build with arbitrary edit. apache#6202

Triggering build with arbitrary change to comment apache#6202

Triggering new build with arbitrary code change. apache#6202

Triggering new build with arbitrary code change. apache#6202

Changing surefire trigger back to failure() apache#6202

Added surefire artifacts to run always again. apache#6202

Triggering new build. apache#6202

Added condition to make testPartitions() more robust during repeated runs apache#6202

Implementing Sijie's suggestion about timeout for persistentTopicsCursorResetAfterReset(..) test. apache#6202
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 24, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 24, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 25, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 25, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this pull request Feb 25, 2020
@devinbost devinbost requested a review from sijie March 13, 2020 17:33
@devinbost
Copy link
Contributor Author

devinbost commented Mar 13, 2020

@sijie Since it might be a while before I'm able to get back to this, we probably should get it reviewed and merged so at least some of the flaky tests are fixed. I'll go ahead and rebase it now.

Added awaitility to two pom files.

Increased timeouts for state tests. apache#6200 apache#6198

Increased timeouts to testSimpleConsumerEventsWithoutPartition and introduced await to poll on assertions to eliminate use of Thread.sleep in several places. (apache#6014)

Attempting to fix testPulsarKafkaProducerWithSerializer issue by adding await to test. (apache#6137)

Attempt to fix apache#6207 and add more debugging information by pruning docker containers.

Fixed typo in docker commands for getting debug info. apache#6207.

Removing timeouts as per comments in apache#5333. This is for apache#6202.

Fixed timeout issues for CPP tests. apache#6202 and apache#6137

Increased more timeouts. apache#6202 and apache#6137

Fixed typo in CPP test timeout fix. apache#6202  apache#4884

Edited comment to trigger build apache#6202

Rolled back changes to PulsarSpoutTest because fixing some instability broke two of the tests that depend on timeout configurations. Those changes will require more investigation. apache#6202

Added timeouts back in places where required. Increased timeouts though. apache#6202

Fixed timeouts for Storm and Kafka tests. Also removed debug block that was accidentially included in ReaderTest. apache#6202

Editing comment to trigger new build. apache#6202

Attempt to workaround test failure. apache#6202

Adding some timeouts back to get beyond hanging tests. apache#6202

Increased sleep value as temporary workaround for thread timeout. apache#6202

Added back timeouts to fix hang but increased timeouts from 1s to 5s. apache#6202

Added back timeout (but made it longer) to prevent hanging test. apache#6202

Fixed formatting since it was breaking the build. apache#6202

Increased more test timeouts to get them to pass on slow hardware. apache#6202

Increased more test timeouts to get them to pass on slow hardware. apache#6202

Edited more test timeouts to get them to pass on slow hardware. apache#6202

Triggering tests due to 'Could not transfer artifact' maven issue. apache#6202

Increased or edited timeouts to get more tests to pass. apache#6202

Triggering new build by changing comment. apache#6202

Fixed timeouts (to short timeouts) when null message is expected. apache#6202

Triggering new build by changing comment. apache#6202

Increased timeout. apache#6202

Increased sleep as temporary workaround. apache#6202

Tuned timeouts more. apache#6202

Widening time to force timeout in timeout test. apache#6202

Fixed spelling typo. apache#6202

Added randomization of namespace name. apache#6202

Added random name generator to names of producers, subscriptions, and topics in ClientDeduplicationTest to fix duplicate name conflicts. apache#6202

Fixed issues with duplicate namespaces with repeated test runs. apache#6202

Added randomization to topic name to prevent potential conflicts that might be causing non-determinism in test. apache#6202

Added randomization to namespace name to prevent issues with topics not clearing out before second run of tests. apache#6202

Attempt to get C++ test fixed. It's not clear if this commit will build though... apache#6202

Replaced snake_case with camelCase to try to get c++ format to pass the build. apache#6202

Adding random name to subscription to see if that resolves the fact that this test only fails on the second subsequent run. apache#6202

Fixed timeout issues. apache#6202

Attempting fix of testPerTopicStats() by addressing race condition. apache#6202

Adding some debugging to help troubleshoot flaky test. apache#6202

Removing code that wasn't building anyway. apache#6202

Changed how we're testing Prometheus by filtering the topic name to fix race conditions between test runs and sharing broker state. apache#6202

Added more debugging information and fixed assertion apache#6202

Trigger new build apache#6202

Added long timeouts to ensure that broker tests do timeout instead of hanging but without timing out too soon apache#6202

Fixed imports for TimeUnit apache#6202

Fixed imports for TimeUnit apache#6202

Pushing changes to allow discussion on what's happening. apache#6202

Fixed timeouts for the testSharedSingleAckedPartitionedTopic() test. apache#6202

Fixed issue with Prometheus test. apache#6202

Can't use receive with timeout, if the queue size is 0. Fixed InterceptorsTest. apache#6202

Can't use receive with timeout, if the queue size is 0. apache#6202

Fixed Can't use receive with timeout, if the queue size is 0. apache#6202

Edited comment to trigger re-run of all tests to find more flaky tests. apache#6202

Fixed more of the concurrency issue in testPerTopicStats that was causing namespace conflicts. apache#6202

Fixed something I missed during rebasing. apache#6202

Fixed issues with Prometheus tests. apache#6256

Changed MessageId.latest to MessageId.earliest to fix apache#6224

Fixes issue apache#6352

Triggering build to inspect test results. apache#6202

Added timeouts to fix hanging tests. apache#6202

Triggering new build. apache#6202

Updating Github workflow to build surefire artifacts if previous step was cancelled, not just failed. apache#6202

Changing CI Unit Action to always build surefire artifacts to help with debugging hanging test. apache#6202

Triggering new build with arbitrary edit. apache#6202

Triggering build with arbitrary change to comment apache#6202

Triggering new build with arbitrary code change. apache#6202

Triggering new build with arbitrary code change. apache#6202

Changing surefire trigger back to failure() apache#6202

Added surefire artifacts to run always again. apache#6202

Triggering new build. apache#6202

Added condition to make testPartitions() more robust during repeated runs apache#6202

Implementing Sijie's suggestion about timeout for persistentTopicsCursorResetAfterReset(..) test. apache#6202

Fixed file that I forgot to merge. apache#6202

Increased robustness of testPartitions() for repeated execution. apache#6202

Added more debugging to ParserProxyHandler's channelRead, changed test from private to public, and decreased test noise. apache#6332

Trying to get more debug info apache#6332

Added more debugging log statements to try to pinpoint where the failure happens. apache#6332

Added more debugging log statements to try to pinpoint where the failure happens. apache#6332

Added even more debugging for tracing purposes. apache#6332

Added even more debugging for tracing purposes. apache#6332

Rolling back unnecessary changes. apache#6202

Rolling back unnecessary changes. apache#6202

Fixed issue with testDeadLetterTopic() where redelivery was getting triggered. apache#6202

Adding more debug information and methods to test hypothesis. apache#6332

Adding keepAlive to ServerConnection to see what that does. apache#6332

Increasing ProxyServer keepAliveInterval to 90 seconds in case it is timing out during server tests. apache#6332

Rolling back changes. apache#6332
@devinbost devinbost force-pushed the fix-flaky-tests-attempt-2 branch from 40a23c7 to 81e23eb Compare March 13, 2020 18:47
@devinbost
Copy link
Contributor Author

/pulsarbot run-failure-checks

2 similar comments
@devinbost
Copy link
Contributor Author

/pulsarbot run-failure-checks

@devinbost
Copy link
Contributor Author

/pulsarbot run-failure-checks

@devinbost
Copy link
Contributor Author

@sijie On second thought, it looks like the remaining test failures are rather stubborn, so this PR might just need to stay on hold until I'm able to resume working on it.

@eolivelli
Copy link
Contributor

We have lots of CI failure,
this work looks very valuable

is it worth to rebase to current master ?

@github-actions
Copy link

github-actions bot commented Mar 4, 2022

@devinbost:Thanks for your contribution. For this PR, do we need to update docs?
(The PR template contains info about doc, which helps others know more about the changes. Can you provide doc-related info in this and future PR descriptions? Thanks)

@tisonkun
Copy link
Member

Closed as stale. There's too many conflict to continue :/

@tisonkun tisonkun closed this Nov 14, 2022
@tisonkun
Copy link
Member

The related works, if still valuable, can be rebased on the master and submitted one by one so that we can merge it quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment