Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid race b/w ServiceDiscoverer events after cancel and re-subscribe #3005

Merged
merged 2 commits into from
Jul 15, 2024

Conversation

idelpivnitskiy
Copy link
Member

@idelpivnitskiy idelpivnitskiy commented Jul 12, 2024

Motivation:

#2514 added an ability for LoadBalancer to cancel discovery stream and
re-subscribe in attempt to trigger a new resolution when its internal
state became UNHEALTHY. However, it did not account for a possibility
of new events racing with cancellation or a Publisher not stopping
updates immediately. As a result, it's possible to end up in a corrupted
state if the old EventSubscriber receives new events after a new
EventSubscriber received a new "state of the world".

Modifications:

  • Track the current EventSubscriber and process events only if it's
    current.
  • Enhance resubscribeToEventsWhenAllHostsAreUnhealthy() to simulate
    the described behavior.

Result:

New "state of the world" can not be corrupted by events emitted for the
old EventSubscriber.


Depends on #3004, review only the 2nd commit: d48d8f5

@idelpivnitskiy idelpivnitskiy self-assigned this Jul 12, 2024
// (https://github.com/reactive-streams/reactive-streams-jvm?tab=readme-ov-file#1.8) new events will
// stop eventually but not guaranteed to stop immediately after cancellation or could race with cancel.
// Therefore, we should check that this is the current Subscriber before processing new events.
if (healthCheckConfig != null && currentSubscriber != this) {
Copy link
Member Author

@idelpivnitskiy idelpivnitskiy Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that for the RoundRobinLoadBalancer there is still possibility for a race if ServiceDiscoverer implementation doesn't follow the "use a single thread" recommendation: https://github.com/apple/servicetalk/pull/3002/files#diff-994a09bcaa716664549357bca8bfa8288ee0dd98b0280b139fd30b15cd69bd8eR30-R32

However, taking into account that:

  1. we are close to switching to DefaultLoadBalancer;
  2. the default DNS ServiceDiscoverer is single-threaded, and for other implementations it's a trivial fix;
  3. possibility for that race is significantly reduced by the current check;

I don't see a reason to overcomplicate this implementation unless someone has a simple suggestion.

Example of a race:

  1. SD Thread#1 delivers events for old EventSubscriber, crosses this check and OS doesn't execute it further.
  2. Caller Thread#2 triggers cancellation, resubscribes and flips the current subscriber.
  3. SD Thread#3 delivers "state of the world" for a new EventSubscriber.
  4. OS continues execution of Thread#1 and it corrupts the usedHosts state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with your assessment that we need not address it: this race has been here for a long time and we've survived until now so a little longer wont hurt. However, if you felt like fixing it I think it could be done with a lock protecting all EventSubscriber.onNext calls.

Copy link
Contributor

@bryce-anderson bryce-anderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one concern.

// (https://github.com/reactive-streams/reactive-streams-jvm?tab=readme-ov-file#1.8) new events will
// stop eventually but not guaranteed to stop immediately after cancellation or could race with cancel.
// Therefore, we should check that this is the current Subscriber before processing new events.
if (healthCheckConfig != null && currentSubscriber != this) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with your assessment that we need not address it: this race has been here for a long time and we've survived until now so a little longer wont hurt. However, if you felt like fixing it I think it could be done with a lock protecting all EventSubscriber.onNext calls.

Motivation:

apple#2514 added an ability for `LoadBalancer` to cancel discovery stream and
re-subscribe in attempt to trigger a new resolution when it's internal
state became `UNHEALTHY`. However, it did not account for a possibility
of new events racing with cancellation or a Publisher not stopping
updates immediately. As a result, it's possible to end up in a corrupted
state if old `EventSubscriber` receives new events after a new
`EventSubscriber` received a new "state of the world".

Modifications:

- Track the current `EventSubscriber` and process events only if it's
current.
- Enhance `resubscribeToEventsWhenAllHostsAreUnhealthy()` to simulate
described behavior.

Result:

New "state of the world" can not be corrupted by events emitted for the
old `EventSubscriber`.
@idelpivnitskiy idelpivnitskiy force-pushed the race-after-resubscribe branch from 3cf255f to 0ac49f6 Compare July 15, 2024 16:11
@idelpivnitskiy idelpivnitskiy merged commit 3897efb into apple:main Jul 15, 2024
11 checks passed
@idelpivnitskiy idelpivnitskiy deleted the race-after-resubscribe branch July 15, 2024 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants