Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Watcher test framework resiliency #40658

Merged
merged 18 commits into from
Apr 9, 2019

Conversation

gwbrown
Copy link
Contributor

@gwbrown gwbrown commented Mar 29, 2019

It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

  1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
  2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
  3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
  4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.

This should fix:
#35503
#35506
#40587
#40631
#40682 (this one is only muted on 6.7 so I'll have to test + unmute on the backport)


I've run the entire Watcher test suite with these changes ~1000 times and have only seen two failures, which were unrelated (e.g. SocketTimeoutException) , which is much more reliable than before these changes.

It is possible for the watches tracked by `ScheduleTriggerEngineMock` to
get out of sync with the Watches in the `ScheduleTriggerEngine`
production code, which can lead to watches failing to run.

This commit adds some additional checks and locking to the test code
(production code is unaffected) to eliminate these problems.
@gwbrown gwbrown added >test Issues or PRs that are addressing/adding tests :Data Management/Watcher labels Mar 29, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@gwbrown gwbrown marked this pull request as ready for review April 2, 2019 16:13
@gwbrown
Copy link
Contributor Author

gwbrown commented Apr 2, 2019

@elasticmachine run elasticsearch-ci/2

Just running the tests a few times to make sure this is stable on CI.

@gwbrown gwbrown requested a review from hub-cap April 2, 2019 20:49
@gwbrown gwbrown added the v7.0.0 label Apr 2, 2019
@gwbrown gwbrown requested a review from jakelandis April 2, 2019 20:53
@gwbrown
Copy link
Contributor Author

gwbrown commented Apr 2, 2019

I'm going to run these tests on CI several more times just to be sure, but I think this is ready for review and it's been very stable locally.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are good improvements to the mock engine.

I think we should backport this change slowly, so that if unexpected failures occur in CI, it doesn't fail on all branches this change is targeted for.

@martijnvg
Copy link
Member

@gwbrown I think this PR should also fix: #40631 Maybe unmute this test too in this pr?

@gwbrown
Copy link
Contributor Author

gwbrown commented Apr 3, 2019

Thanks for the review @martijnvg, and that's also a good idea to wait on the backports. I'll add the test you suggest and run the tests a couple more times before merging, then let it sit in master for a few days before backporting and check build-stats to make sure it's all good before backporting.

@@ -31,7 +33,8 @@
public class ScheduleTriggerEngineMock extends ScheduleTriggerEngine {
private static final Logger logger = LogManager.getLogger(ScheduleTriggerEngineMock.class);

private final ConcurrentMap<String, Watch> watches = new ConcurrentHashMap<>();
private final AtomicReference<Map<String, Watch>> watches = new AtomicReference<>(new ConcurrentHashMap<>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why using an AtomicReference here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvmd.. I see now that you are swapping out the instances below.

New question, why atomic swaps vs. shared lock (for this test code) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An earlier version of this used a lock and it made the tests very slow (this version takes about ~2-3m to run the whole suite, I killed the version with locks at ~10m), I'm not entirely sure why.

@@ -50,37 +53,42 @@ public ScheduleTriggerEvent parseTriggerEvent(TriggerService service, String wat

@Override
public void start(Collection<Watch> jobs) {
Copy link
Contributor

@jakelandis jakelandis Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method by itself is not thread safe. I would advise to synchronize this method, or use a shared lock between this start/stop/add/remove, which negates the need for the atomic reference, or throw an exception if attempted to start/stop twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll take another look at this. There's a few cases where this could potentially have issues now that I look at it again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is invoked from TriggerService#start(...) and that method is synchronized, so I don't think we need to synchronize this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added synchronized to the methods which modify watches to ensure this is thread safe. How does this look @jakelandis?

@gwbrown
Copy link
Contributor Author

gwbrown commented Apr 8, 2019

I've run the most recent revision (with a very minor tweak over the last reviewed version) over the weekend and it hasn't failed once in ~4000 runs, so I'm pretty confident in this version and intend to merge this after a couple successful CI runs.

@gwbrown
Copy link
Contributor Author

gwbrown commented Apr 8, 2019

@elasticmachine run elasticsearch-ci/2

Run was successful, just retriggering to get more confidence

@gwbrown
Copy link
Contributor Author

gwbrown commented Apr 9, 2019

As suggested above, I'm going to merge this to master and wait a few days before backporting to make sure this is stable.

@gwbrown gwbrown merged commit ff61bad into elastic:master Apr 9, 2019
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 9, 2019
…forced-unsafe-publication

* elastic/master:
  Improve Watcher test framework resiliency (elastic#40658)
  Fix order of request body search parameter names in documentation (elastic#40777)
  Node repurpose tool docs (elastic#40525)
  [Docs] Delete explanation for completion suggester default analyzer choice (elastic#36720)
  Revert "Revert "Change HLRC CCR response tests to use AbstractResponseTestCase base class. (elastic#40257)"" (elastic#40971)
  Short-circuit rebalancing when disabled (elastic#40966)
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Apr 9, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Apr 9, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
gwbrown added a commit to gwbrown/elasticsearch that referenced this pull request Apr 9, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
gwbrown added a commit that referenced this pull request Apr 12, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
gwbrown added a commit that referenced this pull request Apr 12, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
gwbrown added a commit that referenced this pull request Apr 12, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
@gwbrown
Copy link
Contributor Author

gwbrown commented Apr 12, 2019

These changes have been in master since the beginning of the week with no failures in the unmuted tests, so I've merged the backports as well.

gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants