-
Notifications
You must be signed in to change notification settings - Fork 673
Fix incorrect xfrm policy being deleted by ipsec.Destroy(). #3669
Conversation
Added a check to make sure the xfrm policy has not yet been updated to a new spi. The ipsec mutex is held at this point so nothing else should be modifying this policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hpdvanwyk thanks for the PR. Left a comment please check.
if err != nil { | ||
ipsec.log.Warnf("ipsec: xfrm policy get (%s, %s, 0x%x) failed: %s", localIP, remoteIP, outSPIInfo.spi, err) | ||
} else { | ||
if len(policy.Tmpls) == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be right to loop through the policy templates and pick the matching one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unsure about that. It looks like weave only creates policies with one template but a range
would look a little neater.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comment #3666 (comment) From my understanding of the logs shared in #3666 it ended up creating two connections and creating corresponding two policies. Eventually one connections is deleted (connection shutting down due to error: Multiple connections to 56:ff:d7:91:ac:16(192.168.128.10) added to 02:df:bc:77:e3:17(192.168.128.11)
) which triggered deleting xfrm policy which as side effect was deleting the unintended policy as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see weave does a policy update instead of creating a new policy so there are never two policies for a given src/dst pair. Trying to create two such policies on the terminal fails for me:
[rancher@hostyname0 ~]$ sudo ip xfrm policy list
src 192.168.128.10/32 dst 192.168.128.11/32 proto udp
dir out priority 0 ptype main
mark 0x20000/0x20000
tmpl src 192.168.128.10 dst 192.168.128.11
proto esp spi 0xd3e754be reqid 0 mode transport
[rancher@hostyname0 ~]$ sudo ip xfrm policy add src 192.168.128.10/32 dst 192.168.128.11/32 proto udp dir out mark 0x20000 ptype main tmpl src 192.168.128.10 dst 192.168.128.11 proto esp spi 0x1234 reqid 0 mode transport
RTNETLINK answers: File exists
[rancher@hostyname0 ~]$ sudo ip xfrm policy update src 192.168.128.10/32 dst 192.168.128.11/32 proto udp dir out mark 0x20000 ptype main tmpl src 192.168.128.10 dst 192.168.128.11 proto esp spi 0x1234 reqid 0 mode transport
[rancher@hostyname0 ~]$ sudo ip xfrm policy list
src 192.168.128.10/32 dst 192.168.128.11/32 proto udp
dir out priority 0 ptype main
mark 0x20000/0xffffffff
tmpl src 192.168.128.10 dst 192.168.128.11
proto esp spi 0x00001234 reqid 0 mode transport
As far as I can tell XfrmPolicyDel
and XfrmPolicyGet
can only delete/get a single policy at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some debug statements to make sure of this.
// Destroy outbound
policies, err := netlink.XfrmPolicyList(syscall.AF_INET)
if err != nil {
return errors.Wrap(err, "xfrm policy list")
}
for _, p := range policies {
ipsec.log.Warnf("ipsec: %#v", p)
}
if outSPIInfo, ok := ipsec.spiInfo[outSPIID]; ok {
ipsec.log.Infof("ipsec: destroy: out %s -> %s 0x%x", localIP, remoteIP, outSPIInfo.spi)
policy, err := netlink.XfrmPolicyGet(xfrmPolicy(localIP, remoteIP, outSPIInfo.spi))
if err != nil {
ipsec.log.Warnf("ipsec: xfrm policy get (%s, %s, 0x%x) failed: %s", localIP, remoteIP, outSPIInfo.spi, err)
} else {
if len(policy.Tmpls) == 1 {
if policy.Tmpls[0].Spi == int(outSPIInfo.spi) {
if err := netlink.XfrmPolicyDel(xfrmPolicy(localIP, remoteIP, outSPIInfo.spi)); err != nil {
ipsec.log.Warnf("ipsec: xfrm policy del (%s, %s, 0x%x) failed: %s", localIP, remoteIP, outSPIInfo.spi, err)
}
} else {
ipsec.log.Debugf("ipsec: xfrm not my policy (%s, %s, 0x%x) got 0x%x ", localIP, remoteIP, outSPIInfo.spi, policy.Tmpls[0].Spi)
}
}
}
INFO: 2019/07/30 15:36:25.657140 Setting up IPsec between 02:df:bc:77:e3:17(192.168.128.11) and 56:ff:d7:91:ac:16(192.168.128.10)
INFO: 2019/07/30 15:36:25.657207 ipsec: InitSALocal: 192.168.128.10 -> 192.168.128.11 :6784 0x50c4c7a3
[192.168.128.10:6783|56:ff:d7:91:ac:16(192.168.128.10)]: connection shutting down due to error: Multiple connections to 56:ff:d7:91:ac:16(192.168.128.10) added to 02:df:bc:77:e3:17(192.168.128.11)
INFO: 2019/07/30 15:36:25.668534 Destroying IPsec between 02:df:bc:77:e3:17(192.168.128.11) and 56:ff:d7:91:ac:16(192.168.128.10)
INFO: 2019/07/30 15:36:25.684302 fastdp ->[192.168.128.10:6784|56:ff:d7:91:ac:16(192.168.128.10)]: IPSec init SA remote
INFO: 2019/07/30 15:36:25.684318 ipsec: InitSARemote: 192.168.128.11 -> 192.168.128.10 :6784 0x2bc7201
INFO: 2019/07/30 15:36:25.684446 ipsec: destroy: in 192.168.128.10 -> 192.168.128.11 0x189a56b9
WARN: 2019/07/30 15:36:25.693006 ipsec: netlink.XfrmPolicy{Dst:(*net.IPNet)(0xc4217e5e90), Src:(*net.IPNet)(0xc4217e5ec0), Proto:0x11, DstPort:0, SrcPort:0, Dir:0x1, Priority:0, Index:14825, Mark:(*netlink.XfrmMark)(0xc4219e9878), Tmpls:[]netlink.XfrmPolicyTmpl{netlink.XfrmPolicyTmpl{Dst:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xc0, 0xa8, 0x80, 0xa}, Src:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xc0, 0xa8, 0x80, 0xb}, Proto:0x32, Mode:0x0, Spi:45904385, Reqid:0}}}
INFO: 2019/07/30 15:36:25.693033 ipsec: destroy: out 192.168.128.11 -> 192.168.128.10 0x4a6e24ae
DEBU: 2019/07/30 15:36:25.693074 ipsec: xfrm not my policy (192.168.128.11, 192.168.128.10, 0x4a6e24ae) got 0x2bc7201
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see weave does a policy update instead of creating a new policy so there are never two policies for a given src/dst pair.
Ok. I checked the code. You are right. Its just update.
From what I see root cause is connection shutting down due to error: Multiple connections to 56:ff:d7:91:ac:16(192.168.128.10) added to 02:df:bc:77:e3:17(192.168.128.11)
so there are two connections gets established between peers. Resulting in two xfrm policies attempted to created/updated. Eventually one version of policy (corresponding to a connection) overwrites the other. When redundant connection is shutdown corresponding xfrm policy is getting destroyed.
It seems to me its still possible we may end up with no xfrm policy. This can happen if we have xfrm policy in place corresponds to connection that is getting shutdown, in which case policy gets deleted due to matching spi.
Will check in case of multiple connection which of the connection (or order) is dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like XfrmPolicyUpdate
calls xfrmPolicyAddOrUpdate
internally so I doubt we could end up without a policy. I'm pretty sure the case where the first connection is completely shut down when the second connection is created was the working case before my changes.
I rebased and CI ran successfully at https://circleci.com/workflow-run/6907a8e2-c8dc-4fcf-bd6c-303932e5c2ed Is this ready to merge? |
As far as I we concerned it should be ready to merge. We have been running it in our environment without issues since I made the pull request. |
Thanks! |
Added a check to make sure the xfrm policy has not yet been updated to
a new spi. The ipsec mutex is held at this point so nothing else should
be modifying this policy.
Fixes #3666
Fix was tested for 12 hours by constantly connecting and disconnecting network interfaces. The "xfrm not my policy" debug log was hit several times.