connect: mesh-gateways cluster configured wrong for TCP services. #6621

banks · 2019-10-14T19:42:46Z

Debugging an example that someone showed me that was not working I found this bug.

I haven't yet tried to make a completely clean reproduction from scratch as I had access to debug in their more elaborate setup but I suspect it would be easy to reproduce with the setup described here.

This is a simplified scenario that I think is sufficient to surface the issue, but if not we can go back to a more complex example and work from there.

Given two HTTP services deployed with Envoy sidecars using sidecar_service, service A and service B both deployed across two datacenters, primary and secondary.
There is a mesh-gateway setup in each DC with a command like:

consul connect envoy -mesh-gateway -address <LAN IP>:2000 -wan-address <Public IP>:2000 -bind-address public=0.0.0.0:2000 -register -- -l debug

Connect works fine to enable normal local resolution of A -> B.

Note that there are no service-defaults setup so the proxies are being configured as TCP.

If we add a service-resolver to attempt to forward all of the service B traffic in secondary to go over to service B in primary via the gateway, requests start to immediately fail:

$ consul config read -kind service-resolver -name service-b
{
    "ConnectTimeout": "3s",
    "Kind": "service-resolver",
    "Name": "service-b",
    "Redirect": {
        "Datacenter": "primary"
    },
    "CreateIndex": 9822,
    "ModifyIndex": 10356
}

Upstream requests are RST by the local sidecar from A (in secondary, primary local traffic still works fine). On inspection of the sidecar we can see that it has loaded the local gateway cluster (note that the DC in the SNI name is primary) and sees it as healthy:

$ banks@service-a-secondary:~$ curl localhost:19000/clusters
...
service-b.default.primary.internal.1d3432fc-6a81-1c88-e954-45bdf15b7122.consul::10.4.0.2:2000::health_flags::healthy
...

But the tcp_proxy generated in the listener is wrong:

"filters": [
         {
          "name": "envoy.tcp_proxy",
          "config": {
           "stat_prefix": "upstream_service_b_tcp",
           "cluster": "service-b.default.secondary.internal.1d3432fc-6a81-1c88-e954-45bdf15b7122.consul"
          }

Also interesting is that at a different time I saw this configured with the cluster as an empty string. I assume that is a sequencing thing as the above was an attempt to get the system back into the same state after it had already been configured a different way.

I was able to fix this particular demo for now because the services were in fact HTTP and if we create the service-defaults entry for service-b to declare it as http then the proxy is reconfigured with an HTTP listnerer and router that works fine.

So the issue appears to be with TCP services being accessed through a mesh-gateway, possibly only when traversing secondary -> primary (not confirmed).

CC @rboyer does this sound like a thing you could put your finger on as described? I initially though it was related to the code path split for whether we have a "default" chain or not, but I'm not sure that's right since there is a resolver with a redirect which is not "default"?

The text was updated successfully, but these errors were encountered:

rboyer · 2019-10-15T17:05:33Z

The bug is that when we render a TCP listener in LDS we do not configure the target cluster because the L7 logic for plumbing up RDS is being overapplied to tcp protocol stuff. This should be a pretty easy fix where we have to special case tcp listeners.

…ing LDS Previously the logic for configuring RDS during LDS for L7 upstreams was overapplied to TCP proxies resulting in a cluster name of <emptystring> being used incorrectly. Fixes #6621

…ing LDS (#6623) Previously the logic for configuring RDS during LDS for L7 upstreams was overapplied to TCP proxies resulting in a cluster name of <emptystring> being used incorrectly. Fixes #6621

banks added type/bug Feature does not function as expected theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies labels Oct 14, 2019

rboyer self-assigned this Oct 15, 2019

rboyer mentioned this issue Oct 15, 2019

xds: tcp services using the discovery chain should not assume RDS during LDS #6623

Merged

rboyer added the backport/1.6 label Oct 15, 2019

rboyer closed this as completed in #6623 Oct 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connect: mesh-gateways cluster configured wrong for TCP services. #6621

connect: mesh-gateways cluster configured wrong for TCP services. #6621

banks commented Oct 14, 2019 •

edited

Loading

rboyer commented Oct 15, 2019

connect: mesh-gateways cluster configured wrong for TCP services. #6621

connect: mesh-gateways cluster configured wrong for TCP services. #6621

Comments

banks commented Oct 14, 2019 • edited Loading

rboyer commented Oct 15, 2019

banks commented Oct 14, 2019 •

edited

Loading