Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RoundRobinLoadBalancer: re-subscribe when all hosts become unhealthy #2514

Merged
merged 4 commits into from
Feb 24, 2023

Conversation

idelpivnitskiy
Copy link
Member

@idelpivnitskiy idelpivnitskiy commented Feb 21, 2023

Motivation:

It’s possible for the RRLB to turn into a state when all available
hosts become unhealthy. Possible scenarios:

  • DNS returned incorrect results;
  • All hosts were restarted and got different addresses.
    RRL will be in a dead state until TTL expires, which may take some time.

Modifications:

  • When RRLB detects that all hosts are unhealthy, it re-subscribes to the
    SD events publisher to trigger a new resolution;
  • Add RoundRobinLoadBalancerFactory.Builder#healthCheckResubscribeInterval
    option to configure the new feature;
  • Wrap backgroundExecutor with NormalizedTimeSourceExecutor to make
    sure currentTime is always positive;
  • Test behavior when the new feature is enabled/disabled;
  • Add DurationUtils.ensureNonNegative(...) utility for validation;
  • Make TestExecutor(long) constructor public to test
    NormalizedTimeSourceExecutor;
  • Add SequentialPublisherSubscriberFunction.numberOfSubscribers() to
    verify how many subscribers the TestPublisher already saw;

Result:

RRLB forces SD to re-resolve addresses by cancelling the current subscription
and re-subscribing to the publisher when it detects that all hosts become
unhealthy.


This PR depends on #2513 and #2516, review only the last commit.

TODO:

@idelpivnitskiy idelpivnitskiy self-assigned this Feb 21, 2023
@idelpivnitskiy idelpivnitskiy force-pushed the rrlb-resubscribe-unhealty branch 3 times, most recently from b871499 to 755096c Compare February 21, 2023 23:12
@idelpivnitskiy idelpivnitskiy marked this pull request as ready for review February 21, 2023 23:14
idelpivnitskiy added a commit to idelpivnitskiy/servicetalk that referenced this pull request Feb 22, 2023
Motivation:

When we poll DNS in the background the cache is not important because
we always schedule the next query after cache expires. It only helps
with concurrent resolutions of the same address, which in practice
should be minimal if users properly reuse clients. Disadvantage of
having a cache is that if it gets poisoned with invalid or stale entries
there is no way to clear the cache. Cancelling the events publisher and
re-subscribing doesn't help because re-subscribe always hits the cache.
See apple#2514.

Modifications:

- Disable cache by default;
- Provide API to opt-in for caching (can be useful if users resolve per
new connection instead of polling in the background);
- Deprecate `minTTL`, remove `maxTTL` builder methods, introduce
`ttl(min, max, cache)` instead;
- Invoke `ttlCache.prepareForResolution(name)` only for scheduled queries,
keep it unchanged when cancel/re-subscribe to correctly offset ttl;
- Ignore empty list inside `MinTtlCache.get(...)`;
- Make `DefaultDnsClient` logging more consistent;
- Enhance testing;

Result:

Caching is disabled by default. Polling is driven by TTL. In case of
re-subscribe, we always send a new query. Users have API to configure
min/max polling intervals and caching.
@idelpivnitskiy idelpivnitskiy force-pushed the rrlb-resubscribe-unhealty branch from 755096c to f7fc25f Compare February 22, 2023 07:44
@@ -41,6 +43,7 @@ public Subscriber<? super T> apply(final Subscriber<? super T> subscriber) {
throw new IllegalStateException("Duplicate subscriber: " + subscriber);
}
this.subscriber = subscriber;
numberOfSubscribers.incrementAndGet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only increment, do we ever need to decrement this counter too (i.e. on unsubscribe?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a sequential mode there is only one subscriber allowed at a time and this can be verified by isSubscribed() method. The goal of this feature is to count how many subscribers it saw during the lifetime of the publisher. Maybe it needs a better name. I will try numberOfSubscribersSeen(). LMK if you have other ideas.


toSource(eventPublisher).subscribe(
new Subscriber<Collection<? extends ServiceDiscovererEvent<ResolvedAddress>>>() {
private void subscribeToEvents(boolean resubscribe) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a possibility that this method is called multiple times concurrently (and if so, do you think it has racy conditions around cancelation?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextResubscribeTime is used to make sure there is only one thread at a time that performs re-subscribe (unless I messed up something around lines 540-543). I will add assert nextResubscribeTime == RESUBSCRIBING at the beginning of this method.

Motivation:

It’s possible for the RRLB to turn into a state when all available
hosts become unhealthy. Possible scenarios:
- DNS returned incorrect result;
- All hosts were restarted and got different addresses.
RRL will be in a dead state until TTL expires, which may take some time.

Modifications:
- When RRLB detects that all hosts are unhealthy, it re-subscribes to the
SD events publisher to trigger a new resolution;
- Add `RoundRobinLoadBalancerFactory.Builder#healthCheckResubscribeInterval`
option to configure new feature;
- Wrap `backgroundExecutor` with `NormalizedTimeSourceExecutor` to make
sure `currentTime` is always positive;
- Test behavior when the new features is enabled/disabled;
- Add `DurationUtils.ensureNonNegative(...)` utility for validation;
- Make `TestExecutor(long)` constructor public to test
`NormalizedTimeSourceExecutor`;
- Add `SequentialPublisherSubscriberFunction.numberOfSubscribers()` to
verify how many subscribers the `TestPublisher` already saw;

Result:

RRLB forces SD to re-resolve addresses by cancelling the current subscription
and re-subscribing to the publisher when it detects that all hosts become
unhealthy.
@idelpivnitskiy idelpivnitskiy force-pushed the rrlb-resubscribe-unhealty branch from f7fc25f to 12da712 Compare February 22, 2023 17:40
idelpivnitskiy added a commit that referenced this pull request Feb 24, 2023
Motivation:

When we poll DNS in the background the cache is not important because we always schedule the next query after cache expires. It only helps with concurrent resolutions of the same address, which in practice should be minimal if users properly reuse clients. The disadvantage of having a cache is that if it gets poisoned with invalid or stale entries there is no way to clear the cache. Cancelling the events publisher and re-subscribing doesn't help because re-subscribe always hits the cache. See #2514.

Modifications:

- Disable cache by default;
- Provide API to opt-in for caching (can be useful if users resolve per new connection instead of polling in the background);
- Deprecate `minTTL`, remove `maxTTL` builder methods, introduce `ttl(min, max)` and `ttl(min, max, minCache, maxCache)` instead;
- Invoke `ttlCache.prepareForResolution(name)` only for scheduled queries, keep it unchanged when cancel/re-subscribe to correctly offset ttl;
- Ignore empty list inside `MinTtlCache.get(...)`;
- Make `DefaultDnsClient` logging more consistent;
- Include `id` in the returned `ServiceDiscoverer.toString()` from the builder;
- Enhance testing;

Result:

Caching is disabled by default. Polling is driven by TTL. In case of re-subscribe, we always send a new query. Users have API to configure min/max polling intervals and caching.
@idelpivnitskiy idelpivnitskiy merged commit b4543f4 into apple:main Feb 24, 2023
@idelpivnitskiy idelpivnitskiy deleted the rrlb-resubscribe-unhealty branch February 24, 2023 18:11
idelpivnitskiy added a commit to idelpivnitskiy/servicetalk that referenced this pull request Jul 12, 2024
Motivation:

apple#2514 added an ability for `LoadBalancer` to cancel discovery stream and
re-subscribe in attempt to trigger a new resolution when it's internal
state became `UNHEALTHY`. However, it did not account for a possibility
of new events racing with cancellation or a Publisher not stopping
updates immediately. As a result, it's possible to end up in a corrupted
state if old `EventSubscriber` receives new events after a new
`EventSubscriber` received a new "state of the world".

Modifications:

- Track the current `EventSubscriber` and process events only if it's
current.
- Enhance `resubscribeToEventsWhenAllHostsAreUnhealthy()` to simulate
described behavior.

Result:

New "state of the world" can not be corrupted by events emitted for the
old `EventSubscriber`.
idelpivnitskiy added a commit to idelpivnitskiy/servicetalk that referenced this pull request Jul 15, 2024
Motivation:

apple#2514 added an ability for `LoadBalancer` to cancel discovery stream and
re-subscribe in attempt to trigger a new resolution when it's internal
state became `UNHEALTHY`. However, it did not account for a possibility
of new events racing with cancellation or a Publisher not stopping
updates immediately. As a result, it's possible to end up in a corrupted
state if old `EventSubscriber` receives new events after a new
`EventSubscriber` received a new "state of the world".

Modifications:

- Track the current `EventSubscriber` and process events only if it's
current.
- Enhance `resubscribeToEventsWhenAllHostsAreUnhealthy()` to simulate
described behavior.

Result:

New "state of the world" can not be corrupted by events emitted for the
old `EventSubscriber`.
idelpivnitskiy added a commit that referenced this pull request Jul 15, 2024
…#3005)

Motivation:

#2514 added an ability for `LoadBalancer` to cancel discovery stream and
re-subscribe in attempt to trigger a new resolution when its internal
state became `UNHEALTHY`. However, it did not account for a possibility
of new events racing with cancellation or a Publisher not stopping
updates immediately. As a result, it's possible to end up in a corrupted
state if the old `EventSubscriber` receives new events after a new
`EventSubscriber` received a new "state of the world".

Modifications:

- Track the current `EventSubscriber` and process events only if it's
current.
- Enhance `resubscribeToEventsWhenAllHostsAreUnhealthy()` to simulate
the described behavior.

Result:

New "state of the world" can not be corrupted by events emitted for the
old `EventSubscriber`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants