`RoundRobinLoadBalancer`: re-subscribe when all hosts become unhealthy #2514

idelpivnitskiy · 2023-02-21T07:47:56Z

Motivation:

It’s possible for the RRLB to turn into a state when all available
hosts become unhealthy. Possible scenarios:

DNS returned incorrect results;
All hosts were restarted and got different addresses.
RRL will be in a dead state until TTL expires, which may take some time.

Modifications:

When RRLB detects that all hosts are unhealthy, it re-subscribes to the
SD events publisher to trigger a new resolution;
Add RoundRobinLoadBalancerFactory.Builder#healthCheckResubscribeInterval
option to configure the new feature;
Wrap backgroundExecutor with NormalizedTimeSourceExecutor to make
sure currentTime is always positive;
Test behavior when the new feature is enabled/disabled;
Add DurationUtils.ensureNonNegative(...) utility for validation;
Make TestExecutor(long) constructor public to test
NormalizedTimeSourceExecutor;
Add SequentialPublisherSubscriberFunction.numberOfSubscribers() to
verify how many subscribers the TestPublisher already saw;

Result:

RRLB forces SD to re-resolve addresses by cancelling the current subscription
and re-subscribing to the publisher when it detects that all hosts become
unhealthy.

~~This PR depends on #2513 and #2516, review only the last commit.~~

TODO:

Add tests for RRLB;
Disable cache by default for DNS ServiceDiscoverer #2518;

Motivation: When we poll DNS in the background the cache is not important because we always schedule the next query after cache expires. It only helps with concurrent resolutions of the same address, which in practice should be minimal if users properly reuse clients. Disadvantage of having a cache is that if it gets poisoned with invalid or stale entries there is no way to clear the cache. Cancelling the events publisher and re-subscribing doesn't help because re-subscribe always hits the cache. See apple#2514. Modifications: - Disable cache by default; - Provide API to opt-in for caching (can be useful if users resolve per new connection instead of polling in the background); - Deprecate `minTTL`, remove `maxTTL` builder methods, introduce `ttl(min, max, cache)` instead; - Invoke `ttlCache.prepareForResolution(name)` only for scheduled queries, keep it unchanged when cancel/re-subscribe to correctly offset ttl; - Ignore empty list inside `MinTtlCache.get(...)`; - Make `DefaultDnsClient` logging more consistent; - Enhance testing; Result: Caching is disabled by default. Polling is driven by TTL. In case of re-subscribe, we always send a new query. Users have API to configure min/max polling intervals and caching.

daschl · 2023-02-22T14:45:39Z

...c/testFixtures/java/io/servicetalk/concurrent/api/SequentialPublisherSubscriberFunction.java

@@ -41,6 +43,7 @@ public Subscriber<? super T> apply(final Subscriber<? super T> subscriber) {
            throw new IllegalStateException("Duplicate subscriber: " + subscriber);
        }
        this.subscriber = subscriber;
+        numberOfSubscribers.incrementAndGet();


we only increment, do we ever need to decrement this counter too (i.e. on unsubscribe?)

In a sequential mode there is only one subscriber allowed at a time and this can be verified by isSubscribed() method. The goal of this feature is to count how many subscribers it saw during the lifetime of the publisher. Maybe it needs a better name. I will try numberOfSubscribersSeen(). LMK if you have other ideas.

daschl · 2023-02-22T14:50:28Z

servicetalk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancer.java


-        toSource(eventPublisher).subscribe(
-                new Subscriber<Collection<? extends ServiceDiscovererEvent<ResolvedAddress>>>() {
+    private void subscribeToEvents(boolean resubscribe) {


is there a possibility that this method is called multiple times concurrently (and if so, do you think it has racy conditions around cancelation?)

nextResubscribeTime is used to make sure there is only one thread at a time that performs re-subscribe (unless I messed up something around lines 540-543). I will add assert nextResubscribeTime == RESUBSCRIBING at the beginning of this method.

...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java

Motivation: It’s possible for the RRLB to turn into a state when all available hosts become unhealthy. Possible scenarios: - DNS returned incorrect result; - All hosts were restarted and got different addresses. RRL will be in a dead state until TTL expires, which may take some time. Modifications: - When RRLB detects that all hosts are unhealthy, it re-subscribes to the SD events publisher to trigger a new resolution; - Add `RoundRobinLoadBalancerFactory.Builder#healthCheckResubscribeInterval` option to configure new feature; - Wrap `backgroundExecutor` with `NormalizedTimeSourceExecutor` to make sure `currentTime` is always positive; - Test behavior when the new features is enabled/disabled; - Add `DurationUtils.ensureNonNegative(...)` utility for validation; - Make `TestExecutor(long)` constructor public to test `NormalizedTimeSourceExecutor`; - Add `SequentialPublisherSubscriberFunction.numberOfSubscribers()` to verify how many subscribers the `TestPublisher` already saw; Result: RRLB forces SD to re-resolve addresses by cancelling the current subscription and re-subscribing to the publisher when it detects that all hosts become unhealthy.

servicetalk-utils-internal/src/main/java/io/servicetalk/utils/internal/DurationUtils.java

...loadbalancer/src/test/java/io/servicetalk/loadbalancer/NormalizedTimeSourceExecutorTest.java

…healty

Motivation: When we poll DNS in the background the cache is not important because we always schedule the next query after cache expires. It only helps with concurrent resolutions of the same address, which in practice should be minimal if users properly reuse clients. The disadvantage of having a cache is that if it gets poisoned with invalid or stale entries there is no way to clear the cache. Cancelling the events publisher and re-subscribing doesn't help because re-subscribe always hits the cache. See #2514. Modifications: - Disable cache by default; - Provide API to opt-in for caching (can be useful if users resolve per new connection instead of polling in the background); - Deprecate `minTTL`, remove `maxTTL` builder methods, introduce `ttl(min, max)` and `ttl(min, max, minCache, maxCache)` instead; - Invoke `ttlCache.prepareForResolution(name)` only for scheduled queries, keep it unchanged when cancel/re-subscribe to correctly offset ttl; - Ignore empty list inside `MinTtlCache.get(...)`; - Make `DefaultDnsClient` logging more consistent; - Include `id` in the returned `ServiceDiscoverer.toString()` from the builder; - Enhance testing; Result: Caching is disabled by default. Polling is driven by TTL. In case of re-subscribe, we always send a new query. Users have API to configure min/max polling intervals and caching.

Motivation: apple#2514 added an ability for `LoadBalancer` to cancel discovery stream and re-subscribe in attempt to trigger a new resolution when it's internal state became `UNHEALTHY`. However, it did not account for a possibility of new events racing with cancellation or a Publisher not stopping updates immediately. As a result, it's possible to end up in a corrupted state if old `EventSubscriber` receives new events after a new `EventSubscriber` received a new "state of the world". Modifications: - Track the current `EventSubscriber` and process events only if it's current. - Enhance `resubscribeToEventsWhenAllHostsAreUnhealthy()` to simulate described behavior. Result: New "state of the world" can not be corrupted by events emitted for the old `EventSubscriber`.

…#3005) Motivation: #2514 added an ability for `LoadBalancer` to cancel discovery stream and re-subscribe in attempt to trigger a new resolution when its internal state became `UNHEALTHY`. However, it did not account for a possibility of new events racing with cancellation or a Publisher not stopping updates immediately. As a result, it's possible to end up in a corrupted state if the old `EventSubscriber` receives new events after a new `EventSubscriber` received a new "state of the world". Modifications: - Track the current `EventSubscriber` and process events only if it's current. - Enhance `resubscribeToEventsWhenAllHostsAreUnhealthy()` to simulate the described behavior. Result: New "state of the world" can not be corrupted by events emitted for the old `EventSubscriber`.

idelpivnitskiy requested review from daschl and Scottmitch February 21, 2023 07:47

idelpivnitskiy self-assigned this Feb 21, 2023

idelpivnitskiy force-pushed the rrlb-resubscribe-unhealty branch 3 times, most recently from b871499 to 755096c Compare February 21, 2023 23:12

idelpivnitskiy marked this pull request as ready for review February 21, 2023 23:14

idelpivnitskiy mentioned this pull request Feb 22, 2023

Disable cache by default for DNS ServiceDiscoverer #2518

Merged

idelpivnitskiy force-pushed the rrlb-resubscribe-unhealty branch from 755096c to f7fc25f Compare February 22, 2023 07:44

daschl reviewed Feb 22, 2023

View reviewed changes

idelpivnitskiy added 2 commits February 22, 2023 09:27

comments

12da712

idelpivnitskiy force-pushed the rrlb-resubscribe-unhealty branch from f7fc25f to 12da712 Compare February 22, 2023 17:40

idelpivnitskiy requested a review from daschl February 22, 2023 17:41

daschl reviewed Feb 24, 2023

View reviewed changes

servicetalk-utils-internal/src/main/java/io/servicetalk/utils/internal/DurationUtils.java Show resolved Hide resolved

daschl reviewed Feb 24, 2023

View reviewed changes

...loadbalancer/src/test/java/io/servicetalk/loadbalancer/NormalizedTimeSourceExecutorTest.java Show resolved Hide resolved

daschl approved these changes Feb 24, 2023

View reviewed changes

idelpivnitskiy added 2 commits February 24, 2023 09:22

comments

4c09a02

Merge remote-tracking branch 'upstream/main' into rrlb-resubscribe-un…

37511b7

…healty

idelpivnitskiy merged commit b4543f4 into apple:main Feb 24, 2023

idelpivnitskiy deleted the rrlb-resubscribe-unhealty branch February 24, 2023 18:11

idelpivnitskiy mentioned this pull request Jul 12, 2024

Avoid race b/w ServiceDiscoverer events after cancel and re-subscribe #3005

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`RoundRobinLoadBalancer`: re-subscribe when all hosts become unhealthy #2514

`RoundRobinLoadBalancer`: re-subscribe when all hosts become unhealthy #2514

idelpivnitskiy commented Feb 21, 2023 •

edited

Loading

daschl Feb 22, 2023

idelpivnitskiy Feb 22, 2023

daschl Feb 22, 2023

idelpivnitskiy Feb 22, 2023

RoundRobinLoadBalancer: re-subscribe when all hosts become unhealthy #2514

RoundRobinLoadBalancer: re-subscribe when all hosts become unhealthy #2514

Conversation

idelpivnitskiy commented Feb 21, 2023 • edited Loading

daschl Feb 22, 2023

Choose a reason for hiding this comment

idelpivnitskiy Feb 22, 2023

Choose a reason for hiding this comment

daschl Feb 22, 2023

Choose a reason for hiding this comment

idelpivnitskiy Feb 22, 2023

Choose a reason for hiding this comment

`RoundRobinLoadBalancer`: re-subscribe when all hosts become unhealthy #2514

`RoundRobinLoadBalancer`: re-subscribe when all hosts become unhealthy #2514

idelpivnitskiy commented Feb 21, 2023 •

edited

Loading