-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RoundRobinLoadBalancer
: re-subscribe when all hosts become unhealthy
#2514
RoundRobinLoadBalancer
: re-subscribe when all hosts become unhealthy
#2514
Conversation
b871499
to
755096c
Compare
Motivation: When we poll DNS in the background the cache is not important because we always schedule the next query after cache expires. It only helps with concurrent resolutions of the same address, which in practice should be minimal if users properly reuse clients. Disadvantage of having a cache is that if it gets poisoned with invalid or stale entries there is no way to clear the cache. Cancelling the events publisher and re-subscribing doesn't help because re-subscribe always hits the cache. See apple#2514. Modifications: - Disable cache by default; - Provide API to opt-in for caching (can be useful if users resolve per new connection instead of polling in the background); - Deprecate `minTTL`, remove `maxTTL` builder methods, introduce `ttl(min, max, cache)` instead; - Invoke `ttlCache.prepareForResolution(name)` only for scheduled queries, keep it unchanged when cancel/re-subscribe to correctly offset ttl; - Ignore empty list inside `MinTtlCache.get(...)`; - Make `DefaultDnsClient` logging more consistent; - Enhance testing; Result: Caching is disabled by default. Polling is driven by TTL. In case of re-subscribe, we always send a new query. Users have API to configure min/max polling intervals and caching.
755096c
to
f7fc25f
Compare
@@ -41,6 +43,7 @@ public Subscriber<? super T> apply(final Subscriber<? super T> subscriber) { | |||
throw new IllegalStateException("Duplicate subscriber: " + subscriber); | |||
} | |||
this.subscriber = subscriber; | |||
numberOfSubscribers.incrementAndGet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only increment, do we ever need to decrement this counter too (i.e. on unsubscribe?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a sequential mode there is only one subscriber allowed at a time and this can be verified by isSubscribed()
method. The goal of this feature is to count how many subscribers it saw during the lifetime of the publisher. Maybe it needs a better name. I will try numberOfSubscribersSeen()
. LMK if you have other ideas.
|
||
toSource(eventPublisher).subscribe( | ||
new Subscriber<Collection<? extends ServiceDiscovererEvent<ResolvedAddress>>>() { | ||
private void subscribeToEvents(boolean resubscribe) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a possibility that this method is called multiple times concurrently (and if so, do you think it has racy conditions around cancelation?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nextResubscribeTime
is used to make sure there is only one thread at a time that performs re-subscribe (unless I messed up something around lines 540-543). I will add assert nextResubscribeTime == RESUBSCRIBING
at the beginning of this method.
...lk-loadbalancer/src/main/java/io/servicetalk/loadbalancer/RoundRobinLoadBalancerFactory.java
Show resolved
Hide resolved
Motivation: It’s possible for the RRLB to turn into a state when all available hosts become unhealthy. Possible scenarios: - DNS returned incorrect result; - All hosts were restarted and got different addresses. RRL will be in a dead state until TTL expires, which may take some time. Modifications: - When RRLB detects that all hosts are unhealthy, it re-subscribes to the SD events publisher to trigger a new resolution; - Add `RoundRobinLoadBalancerFactory.Builder#healthCheckResubscribeInterval` option to configure new feature; - Wrap `backgroundExecutor` with `NormalizedTimeSourceExecutor` to make sure `currentTime` is always positive; - Test behavior when the new features is enabled/disabled; - Add `DurationUtils.ensureNonNegative(...)` utility for validation; - Make `TestExecutor(long)` constructor public to test `NormalizedTimeSourceExecutor`; - Add `SequentialPublisherSubscriberFunction.numberOfSubscribers()` to verify how many subscribers the `TestPublisher` already saw; Result: RRLB forces SD to re-resolve addresses by cancelling the current subscription and re-subscribing to the publisher when it detects that all hosts become unhealthy.
f7fc25f
to
12da712
Compare
servicetalk-utils-internal/src/main/java/io/servicetalk/utils/internal/DurationUtils.java
Show resolved
Hide resolved
...loadbalancer/src/test/java/io/servicetalk/loadbalancer/NormalizedTimeSourceExecutorTest.java
Show resolved
Hide resolved
Motivation: When we poll DNS in the background the cache is not important because we always schedule the next query after cache expires. It only helps with concurrent resolutions of the same address, which in practice should be minimal if users properly reuse clients. The disadvantage of having a cache is that if it gets poisoned with invalid or stale entries there is no way to clear the cache. Cancelling the events publisher and re-subscribing doesn't help because re-subscribe always hits the cache. See #2514. Modifications: - Disable cache by default; - Provide API to opt-in for caching (can be useful if users resolve per new connection instead of polling in the background); - Deprecate `minTTL`, remove `maxTTL` builder methods, introduce `ttl(min, max)` and `ttl(min, max, minCache, maxCache)` instead; - Invoke `ttlCache.prepareForResolution(name)` only for scheduled queries, keep it unchanged when cancel/re-subscribe to correctly offset ttl; - Ignore empty list inside `MinTtlCache.get(...)`; - Make `DefaultDnsClient` logging more consistent; - Include `id` in the returned `ServiceDiscoverer.toString()` from the builder; - Enhance testing; Result: Caching is disabled by default. Polling is driven by TTL. In case of re-subscribe, we always send a new query. Users have API to configure min/max polling intervals and caching.
Motivation: apple#2514 added an ability for `LoadBalancer` to cancel discovery stream and re-subscribe in attempt to trigger a new resolution when it's internal state became `UNHEALTHY`. However, it did not account for a possibility of new events racing with cancellation or a Publisher not stopping updates immediately. As a result, it's possible to end up in a corrupted state if old `EventSubscriber` receives new events after a new `EventSubscriber` received a new "state of the world". Modifications: - Track the current `EventSubscriber` and process events only if it's current. - Enhance `resubscribeToEventsWhenAllHostsAreUnhealthy()` to simulate described behavior. Result: New "state of the world" can not be corrupted by events emitted for the old `EventSubscriber`.
Motivation: apple#2514 added an ability for `LoadBalancer` to cancel discovery stream and re-subscribe in attempt to trigger a new resolution when it's internal state became `UNHEALTHY`. However, it did not account for a possibility of new events racing with cancellation or a Publisher not stopping updates immediately. As a result, it's possible to end up in a corrupted state if old `EventSubscriber` receives new events after a new `EventSubscriber` received a new "state of the world". Modifications: - Track the current `EventSubscriber` and process events only if it's current. - Enhance `resubscribeToEventsWhenAllHostsAreUnhealthy()` to simulate described behavior. Result: New "state of the world" can not be corrupted by events emitted for the old `EventSubscriber`.
…#3005) Motivation: #2514 added an ability for `LoadBalancer` to cancel discovery stream and re-subscribe in attempt to trigger a new resolution when its internal state became `UNHEALTHY`. However, it did not account for a possibility of new events racing with cancellation or a Publisher not stopping updates immediately. As a result, it's possible to end up in a corrupted state if the old `EventSubscriber` receives new events after a new `EventSubscriber` received a new "state of the world". Modifications: - Track the current `EventSubscriber` and process events only if it's current. - Enhance `resubscribeToEventsWhenAllHostsAreUnhealthy()` to simulate the described behavior. Result: New "state of the world" can not be corrupted by events emitted for the old `EventSubscriber`.
Motivation:
It’s possible for the RRLB to turn into a state when all available
hosts become unhealthy. Possible scenarios:
RRL will be in a dead state until TTL expires, which may take some time.
Modifications:
SD events publisher to trigger a new resolution;
RoundRobinLoadBalancerFactory.Builder#healthCheckResubscribeInterval
option to configure the new feature;
backgroundExecutor
withNormalizedTimeSourceExecutor
to makesure
currentTime
is always positive;DurationUtils.ensureNonNegative(...)
utility for validation;TestExecutor(long)
constructor public to testNormalizedTimeSourceExecutor
;SequentialPublisherSubscriberFunction.numberOfSubscribers()
toverify how many subscribers the
TestPublisher
already saw;Result:
RRLB forces SD to re-resolve addresses by cancelling the current subscription
and re-subscribing to the publisher when it detects that all hosts become
unhealthy.
This PR depends on #2513 and #2516, review only the last commit.TODO: