Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Fix incorrect xfrm policy being deleted by ipsec.Destroy(). #3669

Merged
merged 1 commit into from
Sep 2, 2019

Conversation

hpdvanwyk
Copy link

Added a check to make sure the xfrm policy has not yet been updated to
a new spi. The ipsec mutex is held at this point so nothing else should
be modifying this policy.

Fixes #3666
Fix was tested for 12 hours by constantly connecting and disconnecting network interfaces. The "xfrm not my policy" debug log was hit several times.

Added a check to make sure the xfrm policy has not yet been updated to
a new spi. The ipsec mutex is held at this point so nothing else should
be modifying this policy.
Copy link
Contributor

@murali-reddy murali-reddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hpdvanwyk thanks for the PR. Left a comment please check.

if err != nil {
ipsec.log.Warnf("ipsec: xfrm policy get (%s, %s, 0x%x) failed: %s", localIP, remoteIP, outSPIInfo.spi, err)
} else {
if len(policy.Tmpls) == 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be right to loop through the policy templates and pick the matching one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure about that. It looks like weave only creates policies with one template but a range would look a little neater.

Copy link
Contributor

@murali-reddy murali-reddy Jul 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comment #3666 (comment) From my understanding of the logs shared in #3666 it ended up creating two connections and creating corresponding two policies. Eventually one connections is deleted (connection shutting down due to error: Multiple connections to 56:ff:d7:91:ac:16(192.168.128.10) added to 02:df:bc:77:e3:17(192.168.128.11)) which triggered deleting xfrm policy which as side effect was deleting the unintended policy as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see weave does a policy update instead of creating a new policy so there are never two policies for a given src/dst pair. Trying to create two such policies on the terminal fails for me:

[rancher@hostyname0 ~]$ sudo ip xfrm policy list
src 192.168.128.10/32 dst 192.168.128.11/32 proto udp 
        dir out priority 0 ptype main 
        mark 0x20000/0x20000
        tmpl src 192.168.128.10 dst 192.168.128.11
                proto esp spi 0xd3e754be reqid 0 mode transport
[rancher@hostyname0 ~]$ sudo ip xfrm policy add src 192.168.128.10/32 dst 192.168.128.11/32 proto udp dir out mark 0x20000         ptype main         tmpl src 192.168.128.10 dst 192.168.128.11         proto esp spi 0x1234 reqid 0 mode transport
RTNETLINK answers: File exists
[rancher@hostyname0 ~]$ sudo ip xfrm policy update src 192.168.128.10/32 dst 192.168.128.11/32 proto udp dir out mark 0x20000         ptype main         tmpl src 192.168.128.10 dst 192.168.128.11         proto esp spi 0x1234 reqid 0 mode transport
[rancher@hostyname0 ~]$ sudo ip xfrm policy list
src 192.168.128.10/32 dst 192.168.128.11/32 proto udp 
        dir out priority 0 ptype main 
        mark 0x20000/0xffffffff
        tmpl src 192.168.128.10 dst 192.168.128.11
                proto esp spi 0x00001234 reqid 0 mode transport

As far as I can tell XfrmPolicyDel and XfrmPolicyGet can only delete/get a single policy at a time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some debug statements to make sure of this.

	// Destroy outbound
	policies, err := netlink.XfrmPolicyList(syscall.AF_INET)
	if err != nil {
		return errors.Wrap(err, "xfrm policy list")
	}
	for _, p := range policies {
		ipsec.log.Warnf("ipsec: %#v", p)
	}

	if outSPIInfo, ok := ipsec.spiInfo[outSPIID]; ok {
		ipsec.log.Infof("ipsec: destroy: out %s -> %s 0x%x", localIP, remoteIP, outSPIInfo.spi)
		policy, err := netlink.XfrmPolicyGet(xfrmPolicy(localIP, remoteIP, outSPIInfo.spi))
		if err != nil {
			ipsec.log.Warnf("ipsec: xfrm policy get (%s, %s, 0x%x) failed: %s", localIP, remoteIP, outSPIInfo.spi, err)
		} else {
			if len(policy.Tmpls) == 1 {
				if policy.Tmpls[0].Spi == int(outSPIInfo.spi) {
					if err := netlink.XfrmPolicyDel(xfrmPolicy(localIP, remoteIP, outSPIInfo.spi)); err != nil {
						ipsec.log.Warnf("ipsec: xfrm policy del (%s, %s, 0x%x) failed: %s", localIP, remoteIP, outSPIInfo.spi, err)
					}
				} else {
					ipsec.log.Debugf("ipsec: xfrm not my policy (%s, %s, 0x%x) got 0x%x ", localIP, remoteIP, outSPIInfo.spi, policy.Tmpls[0].Spi)
				}
			}
		}

INFO: 2019/07/30 15:36:25.657140 Setting up IPsec between 02:df:bc:77:e3:17(192.168.128.11) and 56:ff:d7:91:ac:16(192.168.128.10)
INFO: 2019/07/30 15:36:25.657207 ipsec: InitSALocal: 192.168.128.10 -> 192.168.128.11 :6784 0x50c4c7a3
[192.168.128.10:6783|56:ff:d7:91:ac:16(192.168.128.10)]: connection shutting down due to error: Multiple connections to 56:ff:d7:91:ac:16(192.168.128.10) added to 02:df:bc:77:e3:17(192.168.128.11)
INFO: 2019/07/30 15:36:25.668534 Destroying IPsec between 02:df:bc:77:e3:17(192.168.128.11) and 56:ff:d7:91:ac:16(192.168.128.10)
INFO: 2019/07/30 15:36:25.684302 fastdp ->[192.168.128.10:6784|56:ff:d7:91:ac:16(192.168.128.10)]: IPSec init SA remote
INFO: 2019/07/30 15:36:25.684318 ipsec: InitSARemote: 192.168.128.11 -> 192.168.128.10 :6784 0x2bc7201
INFO: 2019/07/30 15:36:25.684446 ipsec: destroy: in 192.168.128.10 -> 192.168.128.11 0x189a56b9
WARN: 2019/07/30 15:36:25.693006 ipsec: netlink.XfrmPolicy{Dst:(*net.IPNet)(0xc4217e5e90), Src:(*net.IPNet)(0xc4217e5ec0), Proto:0x11, DstPort:0, SrcPort:0, Dir:0x1, Priority:0, Index:14825, Mark:(*netlink.XfrmMark)(0xc4219e9878), Tmpls:[]netlink.XfrmPolicyTmpl{netlink.XfrmPolicyTmpl{Dst:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xc0, 0xa8, 0x80, 0xa}, Src:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xc0, 0xa8, 0x80, 0xb}, Proto:0x32, Mode:0x0, Spi:45904385, Reqid:0}}}
INFO: 2019/07/30 15:36:25.693033 ipsec: destroy: out 192.168.128.11 -> 192.168.128.10 0x4a6e24ae
DEBU: 2019/07/30 15:36:25.693074 ipsec: xfrm not my policy (192.168.128.11, 192.168.128.10, 0x4a6e24ae) got 0x2bc7201 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see weave does a policy update instead of creating a new policy so there are never two policies for a given src/dst pair.

Ok. I checked the code. You are right. Its just update.

From what I see root cause is connection shutting down due to error: Multiple connections to 56:ff:d7:91:ac:16(192.168.128.10) added to 02:df:bc:77:e3:17(192.168.128.11) so there are two connections gets established between peers. Resulting in two xfrm policies attempted to created/updated. Eventually one version of policy (corresponding to a connection) overwrites the other. When redundant connection is shutdown corresponding xfrm policy is getting destroyed.

It seems to me its still possible we may end up with no xfrm policy. This can happen if we have xfrm policy in place corresponds to connection that is getting shutdown, in which case policy gets deleted due to matching spi.

Will check in case of multiple connection which of the connection (or order) is dropped.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like XfrmPolicyUpdate calls xfrmPolicyAddOrUpdate internally so I doubt we could end up without a policy. I'm pretty sure the case where the first connection is completely shut down when the second connection is created was the working case before my changes.

@bboreham
Copy link
Contributor

I rebased and CI ran successfully at https://circleci.com/workflow-run/6907a8e2-c8dc-4fcf-bd6c-303932e5c2ed

Is this ready to merge?

@hpdvanwyk
Copy link
Author

As far as I we concerned it should be ready to merge. We have been running it in our environment without issues since I made the pull request.

@bboreham bboreham added this to the 2.6 milestone Sep 2, 2019
@bboreham bboreham merged commit 05ab113 into weaveworks:master Sep 2, 2019
@bboreham
Copy link
Contributor

bboreham commented Sep 2, 2019

Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Weave reconnect occasionally fails because of missing xfrm policies.
3 participants