Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.10.0 pre-run hook "network" failed #6567

Closed
djenriquez opened this issue Oct 25, 2019 · 10 comments · Fixed by #6921
Closed

0.10.0 pre-run hook "network" failed #6567

djenriquez opened this issue Oct 25, 2019 · 10 comments · Fixed by #6921
Assignees
Labels
theme/dependencies Pull requests that update a dependency file theme/networking type/bug
Milestone

Comments

@djenriquez
Copy link

djenriquez commented Oct 25, 2019

Nomad version

$ nomad --version
Nomad v0.10.0 (25ee121d951939504376c70bf8d7950c1ddb6a82)

Operating system and Environment details

Amazon Linux 2

Issue

An allocation being scheduled failed with:

failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure bridge network: container veth name provided (eth0) already exists

Screen Shot 2019-10-25 at 1 09 24 PM

Reproduction steps

Nothing special, an update to a job was submitted to increase its count and change some meta to force allocation recreate. The rest of the allocations scheduled fine, just saw this one. It did reschedule just fine on the same host, just figured this should be logged.

Nomad Client logs (if appropriate)

$ cat /var/log/messages | grep nomad | grep 917cdc42-b4ba-2004-876a-0aafdbdabf98
Oct 25 20:08:18 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 err="unable to create chain CNI-HOSTPORT-SETMARK: running [/usr/sbin/iptables -t nat -N CNI-HOSTPORT-SETMARK --wait]: exit status 1: iptables: Chain already exists.#012" attempt=1
Oct 25 20:08:20 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 err="container veth name provided (eth0) already exists" attempt=2
Oct 25 20:08:21 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner.runner_hook: failed to configure bridge network: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 err="container veth name provided (eth0) already exists" attempt=3
Oct 25 20:08:21 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner: prerun failed: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 error="pre-run hook "network" failed: failed to configure networking for alloc: failed to configure bridge network: container veth name provided (eth0) already exists"
Oct 25 20:08:21 ip-10-178-43-106.compute.internal nomad[6003]: client.gc: marking allocation for GC: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98
Oct 25 20:08:21 ip-10-178-43-106.compute.internal nomad[6003]: client.alloc_runner.runner_hook: failed to cleanup network for allocation, resources may have leaked: alloc_id=917cdc42-b4ba-2004-876a-0aafdbdabf98 alloc=917cdc42-b4ba-2004-876a-0aafdbdabf98 error="missing prevResult from earlier plugin"

More logs sent via email.

@tgross
Copy link
Member

tgross commented Oct 30, 2019

Hi @djenriquez and thank you for reporting this! Just wanted to let you know that this looks very similar to #6580 and I'm investigating this currently.

@djenriquez
Copy link
Author

Thanks @tgross. I saw it a few more times, but again, not really an issue since it seems to resolve right away on reschedule. Definitely seems to be some kind of race condition.

@tgross
Copy link
Member

tgross commented Nov 1, 2019

Ok, so as described in #6580 (comment), it turns out this is separate from the connectivity problem described there.

My theory from #6580 which I'll investigate here:

My hypothesis here is that when we call out to CNI to set up the veth pair, the veth pair is being set up correctly but then we're getting an error for bridge or iptables rules setup, which bubbles up as an error to networking_bridge_linux.go#L163-L166. In the second and third attempts, we find that the veth pair is already created, which causes the allocation to fail and get rescheduled. Maybe we're handling what the CNI plugin is giving us incorrectly or maybe there's a bug in the CNI plugin, but it's definitely a bug either way.

@tgross
Copy link
Member

tgross commented Nov 4, 2019

tl;dr

There are two things to do here:


Nomad's Connect integration uses CNI plugins to create the network namespaces, the bridge network, container veth pairs, and iptables rules. In the happy path case, we configure CNI using the go-cni library and pass a chain of CNI plugin configurations. When Nomad gets a Connect-enabled job, it hands off control to the go-cni library, which invokes the three plugins as external binaries. These external binaries in turn shell out to tools like iptables on the host. When things go well:

  • CNI bridge plugin creates the nomad bridge and the veth pair for the container, placing one end inside the container as eth0.
  • CNI firewall plugin creates some iptables chains and associated rules.
  • CNI portmap plugin creates some iptables chains and associated rules.

The problem we're seeing is a race condition in creating iptables chains when multiple Connect-enabled task groups are starting on the same client node, which is of course the common case for folks who are trying out our Connect integration in a dev environment.

In order to reproduce this, you need to flush all firewall chains/rules (it's easiest to simply restart the node). One of the allocations will fail. Then Nomad will retry the allocation a few seconds later. At that point everything proceeds as planned, and we'll never hit the race condition again on the same client node.

The error messages we get back from CNI are unhelpful but I was able to determine what was happening by using execsnoop to trace CNI's invocations of iptables.

Here are relevant excerpts from a log where I ran execsnoop -tx | tee execsnoop.log:

# in these logs, PID 2886 is Nomad itself

# this is the portmap execution run for task group 1
$ head -1 execsnoop.log && grep 3332 execsnoop.log
TIME(s) PCOMM            PID    PPID   RET ARGS
2.082   portmap          3332   2886     0 "/opt/cni/bin/portmap"
2.087   iptables         3342   3332     0 "/usr/sbin/iptables --version"
2.089   iptables         3343   3332     0 "/usr/sbin/iptables -t nat -S --wait"
2.090   iptables         3345   3332     0 "/usr/sbin/iptables -t nat -N CNI-HOSTPORT-SETMARK --wait"
2.091   iptables         3346   3332     0 "/usr/sbin/iptables -t nat -C CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.101   iptables         3353   3332     0 "/usr/sbin/iptables -t nat -A CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.103   iptables         3356   3332     0 "/usr/sbin/iptables -t nat -N CNI-HOSTPORT-MASQ --wait"
2.105   iptables         3358   3332     0 "/usr/sbin/iptables -t nat -C CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE --wait"
2.107   iptables         3359   3332     0 "/usr/sbin/iptables -t nat -A CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE --wait"
...

# this is the portmap execution run for task group 2
$ head -1 execsnoop.log && grep 3334 execsnoop.log
2.082   portmap          3334   2886     0 "/opt/cni/bin/portmap"
2.089   iptables         3341   3334     0 "/usr/sbin/iptables --version"
2.091   iptables         3344   3334     0 "/usr/sbin/iptables -t nat -S --wait"
2.092   iptables         3347   3334     0 "/usr/sbin/iptables -t nat -C CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.100   iptables         3352   3334     0 "/usr/sbin/iptables -t nat -A CNI-HOSTPORT-SETMARK -m comment --comment CNI portfwd masquerade mark -j MARK --set-xmark 0x2000/0x2000 --wait"
2.102   iptables         3355   3334     0 "/usr/sbin/iptables -t nat -S --wait"
2.104   iptables         3357   3334     0 "/usr/sbin/iptables -t nat -N CNI-HOSTPORT-MASQ --wait"
# this portmap invocation stops here!

So in this situation, we're seeing a timeline like the following:

  • [group 1] CNI bridge plugin creates the nomad bridge and the veth pair for the container, placing one end inside the container as eth0.
  • [group 2] CNI bridge plugin tries to create the nomad bridge.
  • [group 1] CNI firewall plugin creates some IP tables rules (eliding details here as we missed the race in this run).
  • [group 2] CNI firewall plugin creates some IP tables rules.
  • [group 1] CNI portmap plugin checks for the existence of the CNI-HOSTPORT-MASQ chain and doesn't find it.
  • [group 2] Concurrently, CNI portmap plugin checks for the existence of the CNI-HOSTPORT-MASQ chain and doesn't find it.
  • [group 1] CNI portmap plugin creates the CNI-HOSTPORT-MASQ chain.
  • [group 2] Concurrently, CNI portmap plugin tries to create the CNI-HOSTPORT-MASQ chain but its exists so it fails!
    • At this point, Nomad gets the error back from go-cni and starts over at the top.
    • [group 2] CNI bridge plugin tries to create the nomad bridge and safely no-ops.
    • [group 2] CNI bridge plugin tries to create the veth pair and finds it already exists, so it fails!
    • [group 2] Start over at the top again, get the same error. This is attempt 3, so we fail the allocation.

It turns out (through a bunch of experimenting) that there are two opportunities to hit the race: one in the firewall plugin and one in the portmap plugin.

Unfortunately, because we're handing off control to the chain of CNI plugins at this point, there's not really a good way for us to handle this condition inside Nomad itself. We don't have enough information at the call site to determine what's happened. Any error message that bubbles up from the plugin comes from the iptables call and those aren't really part of the CNI spec API contract. And once we get into this state, no matter how many times we retry we'll get the same result; the only way Nomad can handle this is by giving up on the allocation and starting over.

I dug into the code for the CNI plugins and found that portmap claims to create its iptables chain idempotently but has a TOCTOU race condition. The firewall plugin doesn't even make this claim. I've verified my theory above by running CNI plugins built from the following patch which is, uh, "not production ready" 😀 Using this patch the jobs run every time.

diff --git a/plugins/meta/firewall/iptables.go b/plugins/meta/firewall/iptables.go
index faae35c..ded69ce 100644
--- a/plugins/meta/firewall/iptables.go
+++ b/plugins/meta/firewall/iptables.go
@@ -43,7 +43,9 @@ func ensureChain(ipt *iptables.IPTables, table, chain string) error {
                }
        }

-       return ipt.NewChain(table, chain)
+       // DEBUG(tgross): ignore errors
+       ipt.NewChain(table, chain)
+       return nil
 }

 func generateFilterRule(privChainName string) []string {
diff --git a/plugins/meta/portmap/chain.go b/plugins/meta/portmap/chain.go
index bca8214..129e1c0 100644
--- a/plugins/meta/portmap/chain.go
+++ b/plugins/meta/portmap/chain.go
@@ -41,9 +41,8 @@ func (c *chain) setup(ipt *iptables.IPTables) error {
                return err
        }
        if !exists {
-               if err := ipt.NewChain(c.table, c.name); err != nil {
-                       return err
-               }
+               // DEBUG(tgross): ignore errors
+               ipt.NewChain(c.table, c.name)
        }

        // Add the rules to the chain

I'm going to try to work this up into a production-ready patch for submission to the CNI project.

Also, while the iptables rules for each task group are specific to an allocation, the chains are a client-global resource. So from the Nomad side we could create the nomad bridge and all expected CNI iptables chains on startup, rather than risking concurrent creation by CNI and playing wack-a-mole with their implementation. There's probably some design discussion to be had about whether that's overreaching in terms of changing the the client's state without being explicitly asked, but I'd lean towards doing it. (See #6618 for further discussion.)

cc @nickethier @shoenig and @schmichael as a heads up and for discussion

@tgross
Copy link
Member

tgross commented Nov 4, 2019

I've opened containernetworking/plugins#408 for the upstream fix.

@tgross
Copy link
Member

tgross commented Nov 15, 2019

containernetworking/plugins#408 was merged but just missed 0.8.3, so we'll want to look out for it being released in their 0.8.4 release and then update the documentation for our Connect integration.

I'll keep this issue open until that's been released and we have our docs updated.

@tgross tgross added this to the near-term milestone Nov 15, 2019
@djenriquez
Copy link
Author

djenriquez commented Nov 15, 2019 via email

@tgross
Copy link
Member

tgross commented Nov 15, 2019

Right. No changes to Nomad.

@tgross tgross added the theme/dependencies Pull requests that update a dependency file label Nov 18, 2019
@tgross
Copy link
Member

tgross commented Jan 9, 2020

https://github.com/containernetworking/plugins/releases/tag/v0.8.4 was just released. I'm going to get that into testing and then we'll update the docs to recommend that version.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/dependencies Pull requests that update a dependency file theme/networking type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants