-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusterresolver: fix deadlock when dns resolver responds inline with update or error at build time #6563
Conversation
bedf587
to
f78bd31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments.
ret.serializer.Schedule(func(context.Context) { | ||
r, err := newDNS(resolver.Target{URL: *u}, ret, resolver.BuildOptions{}) | ||
if err == nil { | ||
ret.dnsR = r | ||
return | ||
} | ||
|
||
if ret.logger.V(2) { | ||
ret.logger.Infof("Failed to build DNS resolver for target %q: %v", target, err) | ||
} | ||
ret.mu.Lock() | ||
ret.updateReceived = true | ||
ret.mu.Unlock() | ||
ret.topLevelResolver.onUpdate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems correct thinking about it. My musing is super minor here. Previously, when you called this function and an onUpdate() gets called, it's considered having "received" configuration for this particular discovery mechanism and will trigger fallback if DNS Resolver hasn't sent anything yet. Now, it happens async (and we wait for the dns resolver to build before to set the bool). Oh, I guess we can't consider this discovery mechanism having had a chance to received an update before the dns resolver has a chance to return results inline. I was mainly concerned now there's a new time frame between when we build this resource resolver and when this callback actually executes where previously before it would build the config and trigger fallback (in the case of no update sent so no addrs), and now it just waits until onUpdate() is called here. But this wait seems minor and ok. So seems correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this comment. I realized that I was not handling all cases correctly. Specifically, if a url parse failure happened, with my previous commit, things would still deadlock.
So, I thought about this for a while, and came to the conclusion that the correct place to handle this is in the resourceResolver
component. So, my current commit handles this:
- it makes the
onUpdate
non blocking, but pushing the call togenerateLocked
via a callback serializer - make it possible for the
cluster_resolver
LB policy to notify theresourceResolver
whether it is being actually closed or not (instop()
)
The use of the callback serializer is not absolutely required in the resourceResolver
. I could as well have pushed a signal on an unbounded buffer, and have a goroutine read from it. But using a callback serializer just made is easier. Let me know if you have any questions/concerns on this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I see now about url parse error triggering deadlock with onUpdate(). There was a discussion about adding an nack validation in the xDS Client in the xDS Chat, but the discussion got tabled. Then there would no longer need to be this error handling. Let me think some about what layer you put the buffer of callbacks at.
if dr.dnsR != nil { | ||
dr.dnsR.Close() | ||
} | ||
dr.serializerCancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for the callback serializer makes it seem like once this context is cancelled - "It is guaranteed that no callbacks will be added once this context is canceled" it can't add callbacks (and reading the run() goroutine in that component backs it up: https://github.com/easwars/grpc-go/blob/02463732635a827362bcfc44c7169d1131336e85/internal/grpcsync/callback_serializer.go#L84. Should this come after the scheduling on line 150?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change has been deleted.
The callback serializer's Schedule
method returns a value which indicates whether or not the callback was scheduled, and it guarantees that all scheduled callbacks are executed even if the provided context is cancelled. It also provides a Done()
method which the caller can block on after cancelling the context to be 100% sure that the callback serializer is done with everything that it had to execute and that it has freed up all its resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. But wouldn't this recv happen: https://github.com/easwars/grpc-go/blob/02463732635a827362bcfc44c7169d1131336e85/internal/grpcsync/callback_serializer.go#L84, which would trigger the possibility of grabbing this close mu and subsequent code block (before Schedule() runs): https://github.com/easwars/grpc-go/blob/02463732635a827362bcfc44c7169d1131336e85/internal/grpcsync/callback_serializer.go#L98 (setting closed to true), and then Schedule() fails here: https://github.com/easwars/grpc-go/blob/02463732635a827362bcfc44c7169d1131336e85/internal/grpcsync/callback_serializer.go#L71? Anyways, this is now no longer relevant, but I do think there was a misordering of operations here.
if closing { | ||
rr.serializerCancel() | ||
<-rr.serializer.Done() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to close the serializer without causing any queued unran callbacks to execute? I don't think there is, but should we add that method to that type? It feels like a waste to execute all the potentially operations (i.e. generateLocked()) when we know the whole cluster resolver component is closing anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is currently no way to cause the serializer to shut down without running queued callbacks. In fact, the serializer in its first version used to do that, but when I was making changes for channel idleness, I quickly realized that the more common case was the one where it is guaranteed that all scheduled callbacks are run.
It should be trivial for the user of the serializer to have some logic to turn the callback into a no-op when it knows that the context passed to the serializer has been cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. So you think that say if you have 5 generateLocked() calls queued, it's fine to just run them before closing, which would search through all discovery mechanisms. Eh, whatever, minor, in practice won't scale out of proportion.
@@ -280,7 +280,7 @@ func (b *clusterResolverBalancer) handleErrorFromUpdate(err error, fromParent bo | |||
// EDS resource was removed. No action needs to be taken for this, and we | |||
// should continue watching the same EDS resource. | |||
if fromParent && xdsresource.ErrType(err) == xdsresource.ErrorTypeResourceNotFound { | |||
b.resourceWatcher.stop() | |||
b.resourceWatcher.stop(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a bit of trouble following this logic (false plumbed down). Is it to support the use case described in this comment:
// Save the previous childrenMap to stop the children outside the mutex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is basically to handle the comment here:
// stop() is called when the LB policy is closed or when the underlying |
stop()
is called when:
- the CDS resource is deleted, or
- the LB policy is being stopped
And in the first case, when the CDS resource is added back, the cluster_resolver will get a config update with new mechanisms and will need to process it.
I initially thought about cancelling the serializer in stop()
unconditionally and recreate it in updateMechanisms()
, instead of in newResourceResolver
. But this was hard to do because then you need to protect access to the serializer with a mutex, and if we have to do that, then there is no way we can guarantee that onUpdate
won't block on the mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sounds good, thanks for the explanation. It's keeping it's lifecycle coupled with this resource resolver type, which sounds fine since it's a buffer the resource resolver type uses.
xds/internal/balancer/clusterresolver/e2e_test/aggregate_cluster_test.go
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
clusterresolver LB policy currently deadlocks if the dns resolver reports an update or error inline at build time. This is because the dns resolver is built while holding a lock, and the same lock needs to be grabbed to handle an error or update from the resolver.
This was not caught in our current tests because the dns resolver was being overridden with a fake one. I switched as many tests as possible to use the real dns resolver. I also ensure that the dns resolver pushes update inline at build time because I pass it the actual host:port and not a name to be resolved.
I ran into this issue when I fixing some tests in the cds LB policy.
Fixes #6562
RELEASE NOTES: