-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Watcher test framework resiliency #40658
Conversation
It is possible for the watches tracked by `ScheduleTriggerEngineMock` to get out of sync with the Watches in the `ScheduleTriggerEngine` production code, which can lead to watches failing to run. This commit adds some additional checks and locking to the test code (production code is unaffected) to eliminate these problems.
Pinging @elastic/es-core-features |
@elasticmachine run elasticsearch-ci/2 Just running the tests a few times to make sure this is stable on CI. |
I'm going to run these tests on CI several more times just to be sure, but I think this is ready for review and it's been very stable locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are good improvements to the mock engine.
I think we should backport this change slowly, so that if unexpected failures occur in CI, it doesn't fail on all branches this change is targeted for.
Thanks for the review @martijnvg, and that's also a good idea to wait on the backports. I'll add the test you suggest and run the tests a couple more times before merging, then let it sit in master for a few days before backporting and check build-stats to make sure it's all good before backporting. |
...r/src/test/java/org/elasticsearch/xpack/watcher/test/AbstractWatcherIntegrationTestCase.java
Outdated
Show resolved
Hide resolved
@@ -31,7 +33,8 @@ | |||
public class ScheduleTriggerEngineMock extends ScheduleTriggerEngine { | |||
private static final Logger logger = LogManager.getLogger(ScheduleTriggerEngineMock.class); | |||
|
|||
private final ConcurrentMap<String, Watch> watches = new ConcurrentHashMap<>(); | |||
private final AtomicReference<Map<String, Watch>> watches = new AtomicReference<>(new ConcurrentHashMap<>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why using an AtomicReference here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvmd.. I see now that you are swapping out the instances below.
New question, why atomic swaps vs. shared lock (for this test code) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An earlier version of this used a lock and it made the tests very slow (this version takes about ~2-3m to run the whole suite, I killed the version with locks at ~10m), I'm not entirely sure why.
@@ -50,37 +53,42 @@ public ScheduleTriggerEvent parseTriggerEvent(TriggerService service, String wat | |||
|
|||
@Override | |||
public void start(Collection<Watch> jobs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method by itself is not thread safe. I would advise to synchronize this method, or use a shared lock between this start/stop/add/remove, which negates the need for the atomic reference, or throw an exception if attempted to start/stop twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll take another look at this. There's a few cases where this could potentially have issues now that I look at it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is invoked from TriggerService#start(...)
and that method is synchronized
, so I don't think we need to synchronize
this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added synchronized
to the methods which modify watches
to ensure this is thread safe. How does this look @jakelandis?
This was not the cause of the new failures This reverts commit d5dde77
I've run the most recent revision (with a very minor tweak over the last reviewed version) over the weekend and it hasn't failed once in ~4000 runs, so I'm pretty confident in this version and intend to merge this after a couple successful CI runs. |
@elasticmachine run elasticsearch-ci/2 Run was successful, just retriggering to get more confidence |
As suggested above, I'm going to merge this to master and wait a few days before backporting to make sure this is stable. |
…forced-unsafe-publication * elastic/master: Improve Watcher test framework resiliency (elastic#40658) Fix order of request body search parameter names in documentation (elastic#40777) Node repurpose tool docs (elastic#40525) [Docs] Delete explanation for completion suggester default analyzer choice (elastic#36720) Revert "Revert "Change HLRC CCR response tests to use AbstractResponseTestCase base class. (elastic#40257)"" (elastic#40971) Short-circuit rebalancing when disabled (elastic#40966)
It is possible for the watches tracked by ScheduleTriggerEngineMock to get out of sync with the Watches in the ScheduleTriggerEngine production code, which can lead to watches failing to run. This commit: 1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code. 2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change. 3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification. 4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
It is possible for the watches tracked by ScheduleTriggerEngineMock to get out of sync with the Watches in the ScheduleTriggerEngine production code, which can lead to watches failing to run. This commit: 1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code. 2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change. 3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification. 4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
It is possible for the watches tracked by ScheduleTriggerEngineMock to get out of sync with the Watches in the ScheduleTriggerEngine production code, which can lead to watches failing to run. This commit: 1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code. 2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change. 3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification. 4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
It is possible for the watches tracked by ScheduleTriggerEngineMock to get out of sync with the Watches in the ScheduleTriggerEngine production code, which can lead to watches failing to run. This commit: 1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code. 2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change. 3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification. 4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
It is possible for the watches tracked by ScheduleTriggerEngineMock to get out of sync with the Watches in the ScheduleTriggerEngine production code, which can lead to watches failing to run. This commit: 1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code. 2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change. 3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification. 4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
It is possible for the watches tracked by ScheduleTriggerEngineMock to get out of sync with the Watches in the ScheduleTriggerEngine production code, which can lead to watches failing to run. This commit: 1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code. 2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change. 3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification. 4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
These changes have been in master since the beginning of the week with no failures in the unmuted tests, so I've merged the backports as well. |
It is possible for the watches tracked by ScheduleTriggerEngineMock to get out of sync with the Watches in the ScheduleTriggerEngine production code, which can lead to watches failing to run. This commit: 1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code. 2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change. 3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification. 4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
It is possible for the watches tracked by
ScheduleTriggerEngineMock
toget out of sync with the Watches in the
ScheduleTriggerEngine
production code, which can lead to watches failing to run.
This commit:
TimeWarp
to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.ScheduleTriggerEngineMock
respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.ScheduleTriggerEngineMock
to prevent race conditions due to concurrent modification.WatcherConcreteIndexTests
to useTimeWarp
instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.This should fix:
#35503
#35506
#40587
#40631
#40682 (this one is only muted on 6.7 so I'll have to test + unmute on the backport)
I've run the entire Watcher test suite with these changes ~1000 times and have only seen two failures, which were unrelated (e.g.
SocketTimeoutException
) , which is much more reliable than before these changes.