-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid blocking child type updates on parent ack #15083
Conversation
@erichaberkorn Something that would be good to look into is having a couple Delta xDS protocol tests like in |
"typeUrl", op.TypeUrl, "dependent", xdscommon.ClusterType) | ||
break | ||
} | ||
if endpointHandler := handlers[xdscommon.EndpointType]; endpointHandler.registered && len(endpointHandler.pendingUpdates) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This endpoints check might be redundant given that Envoy waits for endpoints before ACKing clusters
@@ -650,10 +650,6 @@ func (s *ResourceGenerator) endpointsFromDiscoveryChain( | |||
primaryTargetID := node.Resolver.Target | |||
failover := node.Resolver.Failover | |||
|
|||
type targetLoadAssignmentOption struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code
agent/xds/delta.go
Outdated
// we MUST send new data for all its children. Envoy will NOT re-subscribe to the child data upon | ||
// receiving updates for the parent, so we need to handle this ourselves. | ||
// | ||
// Note that we do not check whether the deltaChild.childType is registered here, since we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you complete this sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
agent/xds/delta.go
Outdated
if op.TypeUrl == xdscommon.ListenerType || op.TypeUrl == xdscommon.RouteType { | ||
if clusterHandler := handlers[xdscommon.ClusterType]; clusterHandler.registered && len(clusterHandler.pendingUpdates) > 0 { | ||
generator.Logger.Trace("Skipping delta computation for resource because there are dependent updates pending", | ||
"typeUrl", op.TypeUrl, "dependent", xdscommon.ClusterType) | ||
break | ||
} | ||
if endpointHandler := handlers[xdscommon.EndpointType]; endpointHandler.registered && len(endpointHandler.pendingUpdates) > 0 { | ||
generator.Logger.Trace("Skipping delta computation for resource because there are dependent updates pending", | ||
"typeUrl", op.TypeUrl, "dependent", xdscommon.EndpointType) | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to enforce the Clusters, Endpoints, Listeners, Routes order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are the xDS updates being retriggered when we skip them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to enforce the Clusters, Endpoints, Listeners, Routes order?
Yes.
How are the xDS updates being retriggered when we skip them here?
The updates get unblocked by the select statement at the beginning of this big loop. Pushing xDS updates is generally triggered by Envoy sending us a message (receiving on reqCh
) or a new proxycfg snapshot due to internal updates (receiving on stateCh
). The message from Envoy could be subscribe/unsubscribe/ack/nack.
So if we block updating listeners because of pending updates for endpoints, once those endpoints are ACKd then we're unblocked again. In the NACK case we actually re-block for changes because we assume that we got bad data from proxycfg and want to avoid a hot loop of errors (has happened before).
Re-blocking on NACK should be OK since if clusters/endpoints are NACKed we still don't want to send listeners because they depend on the clusters/endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
When providing updates to Envoy it is safest to follow the update order of: Clusters, Endpoints, Listeners, Routes. This order is encoded in our delta xDS response generation code. Another noteworthy attribute of these resources is that there are parent->child relationships between: Cluster->Endpoints and Listener->Routes. Envoy couples the storage of a parent and a child type such that whenever a control plane sends an update for a parent type, it MUST also send fresh data for the child type, even if Envoy previously had the latest data for the child type. A wrinkle that isn't accounted for currently is that Envoy does not actually ACK parent types until after attempting to receive data for the child types. This behavior also actually differs between Clusters and Listeners: - When a cluster is sent, Envoy will wait up to a 15s timeout for endpoints to arrive before sending the cluster ACK. - When a listener is sent, and it specifies RDS routes, Envoy will wait until those routes arrive before ACKing the listener. Though it's not clear to me what the timeout is for this, it exceeds a minute from my tests. However, a behavior encoded in our xDS update order is that we avoid sending ANY data if we are waiting on ACKs for ANY resource. Meaning that when we first send a cluster to Envoy and it requests endpoints for that cluster, the endpoints do not actually get sent for at least 15. This is because the endpoint update is paused by Consul until the cluster ACK, which Envoy pauses until it gets endpoints or times out. This commit changes the xDS order gating such that we only block listener/route updates if there are cluster/endpoin updates pending. This is to avoid routing to an invalid destination.
Description
When providing updates to Envoy it is safest to follow the update order of: Clusters, Endpoints, Listeners, Routes. This order is encoded in our delta xDS response generation code.
Another noteworthy attribute of these resources is that there are parent->child relationships between: Cluster->Endpoints and Listener->Routes. Envoy couples the storage of a parent and a child type such that whenever a control plane sends an update for a parent type, it MUST also send fresh data for the child type, even if Envoy previously had the latest data for the child type.
A wrinkle that isn't accounted for currently is that Envoy does not actually ACK parent types until after attempting to receive data for the child types. This behavior also actually differs between Clusters and Listeners:
However, a behavior encoded in our xDS update order is that we avoid sending ANY data if we are waiting on ACKs for ANY resource. Meaning that when we first send a cluster to Envoy and it requests endpoints for that cluster, the endpoints do not actually get sent for at least 15s. This is because the endpoint update is paused by Consul until the cluster ACK, which Envoy pauses until it gets endpoints or times out.
This commit changes the xDS order gating such that we only block listener/route updates if there are cluster/endpoin updates pending. This is to avoid routing to an invalid destination.
Testing & Reproduction steps
Links
Likely related to: envoyproxy/envoy#5887
PR Checklist