-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot use consul connect deployed through nomad with envoy proxy on centos 7 #6580
Comments
Hi @dneray and thanks for reporting this! I'm going to look into seeing if I can replicate this, but seeing as how you're seeing it work on an Ubuntu vagrant environment but not CentOS, I'm wondering if there's something environment specific. Any chance you could share a (redacted) Vagrantfile and/or the Nomad client logs? |
Thanks so much for looking into it @tgross . Vagrantfile is attached in a zip, basically it's just minimal centos with docker, nomad, consul and that cni plugins and everything running in dev mode through systemd. My test was just to ssh in, exec into the dashboard container and run "apk add curl; curl ${COUNTING_SERVICE_URL}" |
Thanks, that helps a lot. I see the following log lines in the nomad.log file:
I suspect that you've encountered the same issue reported in #6567. Let me tag in @nickethier and see whether the logs he was emailed match up. |
Thanks, Indeed, it does appear the first allocation of the second task group failed as mentioned in #6567 which looks like the same bug however it was rescheduled on the same node and the second attempt started successfully but the communication is still broken. |
Ok, I've spent some time digging into this and don't yet have an answer. But I thought I'd dump some of the results of my investigation for discussion here so that I can pull in some of my colleagues @nickethier @shoenig and/or @schmichael as well. My current theory is that there are three distinct issues:
First, I was able to reproduce the behavior using your Vagrantfile and job file. The resulting Nomad alloc status:
And then from inside the dashboard container:
The next thing I did is verify that this job works with our Vagrant Ubuntu environment and using 0.10.0 with Consul 1.6.1 just as you have, so this isn't a regression that we missed since the last time I looked at it a couple weeks ago. Connect works fine there... however, it turns out I can occasionally get the Then I dug into the logs to match against the ones you provided. Here's what this looks like from the CentOS machine: Nomad logs excerpt
The exact initial error message varies slightly on each attempt, but the result ends up being the same. For example, sometimes it looks like this:
My hypothesis here is that when we call out to CNI to set up the veth pair, the veth pair is being set up correctly but then we're getting an error for bridge or iptables rules setup, which bubbles up as an error to Next I wanted to verify that the two processes and their envoy proxies are in the appropriate network namespaces, and that I can reach the envoy proxy from the dashboard container via Unremarkable output of various networking tools
Inside the dashboard container:
ARP table:
Route table:
Next I looked into the iptables rules. I can see that there are rules tagged with the alloc ID for the failed allocation, as well as ones routing for an IP that doesn't belong to either of the two running allocations (and therefore belongs to the failed one). As long as Consul, Envoy, and the dashboard don't try to route to a missing IP (which they shouldn't), these rules should be ok to leave in place. However, in the interest of experimenting, I removed these rules as follows. But no joy. This is another case of #6385 but a red herring for this issue.
Next I checked the Envoy logs for the dashboard and can see there are connection timeouts. The IP address 10.0.2.15 is the correct address for the count-api container. Envoy logs
However, that made me think to check the Consul service catalog for our service, and this is where I found at least some noticeable difference between Ubuntu and CentOS. On CentOS: [
{
"ID": "7d9be3d7-b88c-f61d-8a43-3e542a393f01",
"Node": "test-centos",
"Address": "127.0.0.1",
"Datacenter": "dc1",
"TaggedAddresses": {
"lan": "127.0.0.1",
"wan": "127.0.0.1"
},
"NodeMeta": {
"consul-network-segment": ""
},
"ServiceKind": "connect-proxy",
"ServiceID": "_nomad-task-7cc62290-db29-3194-a73a-bc0f1cafa9cd-group-api-count-api-9001-sidecar-proxy",
"ServiceName": "count-api-sidecar-proxy",
"ServiceTags": [],
"ServiceAddress": "10.0.2.15",
"ServiceWeights": {
"Passing": 1,
"Warning": 1
},
"ServiceMeta": {
"external-source": "nomad"
},
"ServicePort": 26527,
"ServiceEnableTagOverride": false,
"ServiceProxy": {
"DestinationServiceName": "count-api",
"DestinationServiceID": "_nomad-task-7cc62290-db29-3194-a73a-bc0f1cafa9cd-group-api-count-api-9001",
"LocalServiceAddress": "127.0.0.1",
"LocalServicePort": 9001,
"Config": {
"bind_address": "0.0.0.0",
"bind_port": 26527
},
"MeshGateway": {}
},
"ServiceConnect": {},
"CreateIndex": 37,
"ModifyIndex": 37
}
] On Ubuntu: [
{
"ID": "7bebf72b-2c06-a140-270b-feb878a5fde0",
"Node": "linux",
"Address": "10.0.2.15",
"Datacenter": "dc1",
"TaggedAddresses": {
"lan": "10.0.2.15",
"wan": "10.0.2.15"
},
"NodeMeta": {
"consul-network-segment": ""
},
"ServiceKind": "connect-proxy",
"ServiceID": "_nomad-task-85e5658f-ece4-6717-2705-4e9d8b57bf55-group-api-count-api-9001-sidecar-proxy",
"ServiceName": "count-api-sidecar-proxy",
"ServiceTags": [],
"ServiceAddress": "10.0.2.15",
"ServiceWeights": {
"Passing": 1,
"Warning": 1
},
"ServiceMeta": {
"external-source": "nomad"
},
"ServicePort": 29496,
"ServiceEnableTagOverride": false,
"ServiceProxy": {
"DestinationServiceName": "count-api",
"DestinationServiceID": "_nomad-task-85e5658f-ece4-6717-2705-4e9d8b57bf55-group-api-count-api-9001",
"LocalServiceAddress": "127.0.0.1",
"LocalServicePort": 9001,
"Config": {
"bind_address": "0.0.0.0",
"bind_port": 29496
},
"MeshGateway": {}
},
"ServiceConnect": {},
"CreateIndex": 364,
"ModifyIndex": 364
}
] The key difference here being that in the Ubuntu case, the |
I checked the Consul catalog API and I think the address vs service address is a red herring as well. The |
I just took a pass at building
|
Yup, I can see the same! Looks like we're getting closer. Ubuntu:
CentOS:
|
Ah I deleted my reply to update it because I had my tcpdumps mixed up but you beat me, indeed the public ip seems to be replaced by the time the packets reach the upstream proxy in centos. I am looking back at the iptables traces that i posted originally and those also look quite different. There is no mention of the PHYSIN and PHYSOUT in the centos traces referring to the virtual interface, I'll see if I can figure out what that means because I'm not too familiar with iptables. Edit: Seems like this is likely just differences in the iptables version, given the rules look the same, I'm wondering if this could be differences in behaviour between iptables versions. CentOS is using 1.4.21 and ubuntu is using 1.6.0 |
It seems to be the same issue from here: https://bugzilla.redhat.com/show_bug.cgi?id=1703261#c18 can confirm that their fix does work. 'echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables' resolves the issue. Thanks a lot for helping investigate @tgross Maybe a pre deployment check might make sense for this? |
I've verified that toggle fixes the connectivity problem. Thanks @dneray! After a quick test, I then tore down the CentOS VM I was testing on and started over from scratch, and flipped the toggle. I do still see the problem where the API container is initially failing with the Next steps:
Thanks again @dneray for sticking it out and helping through the investigation! |
Aside, now that I know what to look for, I found this in
And this is part of the CentOS sysctl configuration:
|
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.10.0 (25ee121)
Operating system and Environment details
Centos 7 running inside a vagrant environment in a virtualbox with 2 network interfaces attached. I have a 3 node cluster deployed but I set 2 of the nodes to draining so that i could debug this issue on the single node.
Edit: I also tested stock centos 7 with a single interface in a single node cluster with the latest consul and nomad running in dev / dev-connect modes and had the same result, I tested both cni builds 0.8.1 and 0.8.2 and the latest docker from get.docker.com.
Issue
My sidecar containers are unable to communicate with their upstreams using the external ip of the VM. The connection attempt times out and the connection is never received by the upstream proxy. I can verify with 'curl' that requests are not getting through to the upstream envoy proxy.
I can make requests directly to the upstream proxy's internal ip and I can make requests to the upstream proxy from the host machine but it does not work from inside the container.
I have used the same job on the ubuntu xenial vm vagrant vm that is provided by vagrant and this same test works. I suspect the issue lies in the routing or in iptables but I am not sure exactly where. I have attached an iptables trace logs on my requests that are being lost.
not-working-trace.log
ips_and_firewall_rules.txt
working-trace.log
Reproduction steps
I am essentially deploying the example job from https://www.nomadproject.io/guides/integrations/consul-connect/index.html with modifications and then making a request to the "dashboard proxy" which should be routed to the "api service".
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: