Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect: mesh-gateways cluster configured wrong for TCP services. #6621

Closed
banks opened this issue Oct 14, 2019 · 1 comment · Fixed by #6623
Closed

connect: mesh-gateways cluster configured wrong for TCP services. #6621

banks opened this issue Oct 14, 2019 · 1 comment · Fixed by #6623
Assignees
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected

Comments

@banks
Copy link
Member

banks commented Oct 14, 2019

Debugging an example that someone showed me that was not working I found this bug.

I haven't yet tried to make a completely clean reproduction from scratch as I had access to debug in their more elaborate setup but I suspect it would be easy to reproduce with the setup described here.

This is a simplified scenario that I think is sufficient to surface the issue, but if not we can go back to a more complex example and work from there.

Given two HTTP services deployed with Envoy sidecars using sidecar_service, service A and service B both deployed across two datacenters, primary and secondary.
There is a mesh-gateway setup in each DC with a command like:

consul connect envoy -mesh-gateway -address <LAN IP>:2000 -wan-address <Public IP>:2000 -bind-address public=0.0.0.0:2000 -register -- -l debug

Connect works fine to enable normal local resolution of A -> B.

Note that there are no service-defaults setup so the proxies are being configured as TCP.

If we add a service-resolver to attempt to forward all of the service B traffic in secondary to go over to service B in primary via the gateway, requests start to immediately fail:

$ consul config read -kind service-resolver -name service-b
{
    "ConnectTimeout": "3s",
    "Kind": "service-resolver",
    "Name": "service-b",
    "Redirect": {
        "Datacenter": "primary"
    },
    "CreateIndex": 9822,
    "ModifyIndex": 10356
}

Upstream requests are RST by the local sidecar from A (in secondary, primary local traffic still works fine). On inspection of the sidecar we can see that it has loaded the local gateway cluster (note that the DC in the SNI name is primary) and sees it as healthy:

$ banks@service-a-secondary:~$ curl localhost:19000/clusters
...
service-b.default.primary.internal.1d3432fc-6a81-1c88-e954-45bdf15b7122.consul::10.4.0.2:2000::health_flags::healthy
...

But the tcp_proxy generated in the listener is wrong:

"filters": [
         {
          "name": "envoy.tcp_proxy",
          "config": {
           "stat_prefix": "upstream_service_b_tcp",
           "cluster": "service-b.default.secondary.internal.1d3432fc-6a81-1c88-e954-45bdf15b7122.consul"
          }

Also interesting is that at a different time I saw this configured with the cluster as an empty string. I assume that is a sequencing thing as the above was an attempt to get the system back into the same state after it had already been configured a different way.

I was able to fix this particular demo for now because the services were in fact HTTP and if we create the service-defaults entry for service-b to declare it as http then the proxy is reconfigured with an HTTP listnerer and router that works fine.

So the issue appears to be with TCP services being accessed through a mesh-gateway, possibly only when traversing secondary -> primary (not confirmed).

CC @rboyer does this sound like a thing you could put your finger on as described? I initially though it was related to the code path split for whether we have a "default" chain or not, but I'm not sure that's right since there is a resolver with a redirect which is not "default"?

@banks banks added type/bug Feature does not function as expected theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies labels Oct 14, 2019
@rboyer
Copy link
Member

rboyer commented Oct 15, 2019

The bug is that when we render a TCP listener in LDS we do not configure the target cluster because the L7 logic for plumbing up RDS is being overapplied to tcp protocol stuff. This should be a pretty easy fix where we have to special case tcp listeners.

rboyer added a commit that referenced this issue Oct 15, 2019
…ing LDS

Previously the logic for configuring RDS during LDS for L7 upstreams was
overapplied to TCP proxies resulting in a cluster name of <emptystring>
being used incorrectly.

Fixes #6621
@rboyer rboyer self-assigned this Oct 15, 2019
rboyer added a commit that referenced this issue Oct 16, 2019
…ing LDS

Previously the logic for configuring RDS during LDS for L7 upstreams was
overapplied to TCP proxies resulting in a cluster name of <emptystring>
being used incorrectly.

Fixes #6621
rboyer added a commit that referenced this issue Oct 17, 2019
…ing LDS

Previously the logic for configuring RDS during LDS for L7 upstreams was
overapplied to TCP proxies resulting in a cluster name of <emptystring>
being used incorrectly.

Fixes #6621
rboyer added a commit that referenced this issue Oct 17, 2019
…ing LDS (#6623)

Previously the logic for configuring RDS during LDS for L7 upstreams was
overapplied to TCP proxies resulting in a cluster name of <emptystring>
being used incorrectly.

Fixes #6621
rboyer added a commit that referenced this issue Oct 17, 2019
…ing LDS (#6623)

Previously the logic for configuring RDS during LDS for L7 upstreams was
overapplied to TCP proxies resulting in a cluster name of <emptystring>
being used incorrectly.

Fixes #6621
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants