Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico unconditionally drops VXLAN packets that are not related to Calico #6752

Open
yoheiueda opened this issue Sep 22, 2022 · 17 comments · May be fixed by #7606
Open

Calico unconditionally drops VXLAN packets that are not related to Calico #6752

yoheiueda opened this issue Sep 22, 2022 · 17 comments · May be fixed by #7606

Comments

@yoheiueda
Copy link

I set up a Kubernetes cluster using Calico with the following configuration.

  • VXLAN mode: Always
  • VXLAN port: 4789 (default)
  • VXLAN ID (VNI): 4096 (default)

I also configured a VXLAN tunnel with the same port number but a different VXLAN ID.

Then, Calico drops all packets for the second VXLAN tunnel, even though I am using a different VXLAN ID for it.

Expected Behavior

Calico does not interfere packets for the second VXLAN tunnel.

Current Behavior

Calico drops all packets for the second VXLAN tunnel, since Calico adds the following iptables rule.

-A cali-INPUT -p udp -m comment --comment "cali:c3JpWekSTKWcdGH0" -m comment --comment "Drop IPv4 VXLAN packets from non-whitelisted hosts" -m multiport --dports 4789 -m addrtype --dst-type LOCAL -j DROP

This rule is generated here.

Rule{
Match: Match().ProtocolNum(ProtoUDP).
DestPorts(uint16(r.Config.VXLANPort)).
DestAddrType(AddrTypeLocal),
Action: DropAction{},
Comment: []string{"Drop IPv4 VXLAN packets from non-whitelisted hosts"},
},

Possible Solution

Revert cce5446 if possible. The commit was introduced to fix errors that occur when the u32 extension is used with iptables-nft.

According to this comment moby/moby#43382 (comment), iptables-nft also supports the u32 extension, when the xt_u32 kernel module is loaded. This kernel module is available by Default in Ubuntu, and available in the kernel-modules-extra package of RHEL.

If revering the commit is not possible, providing it as an optional feature is another possible solution.

Steps to Reproduce (for bugs)

  1. Set up Kubernetes with Calico as described above
  2. Establish a VXLAN tunnel using the ip command at a worker node
  3. Send PING packets from the other end of the VXLAN tunnel

Context

Your Environment

  • Calico version: v3.24.1
  • Orchestrator version: Kubernetes v1.25.1
  • Operating System and version: Ubuntu 20.04.3 LTS
@mgleung
Copy link
Contributor

mgleung commented Oct 18, 2022

Hey @yoheiueda , I'm trying to understand your use case a bit better here, but why do you need a second VXLAN tunnel? I don't think it's a use case we normally see so trying to get a better grasp on your issue.

@yoheiueda
Copy link
Author

@mgleung

I am working on the development of a new mechanism called Peer Pod VMs. With this mechanism, a new VM instance is created per a pod, and a VXLAN tunnel is established between the worker node VM and the new VM instance for the pod.

Please see this architecture diagram.
https://github.com/confidential-containers/cloud-api-adaptor#architecture

@lwr20
Copy link
Member

lwr20 commented Nov 15, 2022

This behaviour was a security fix - one way you might try to inject packets into a (vxlan) pod network would be to send vxlan packets direct to a node, so I don't think we can back it out completely.

However, I guess it might be possible to make the rule more specific - only dropping vxlan packets that match the VNI that Calico's using?

A possible option without a code change - as the rule says, it drops packets from host IPs it doesn't know about. It may be possible to add your IPs to the list? Using the externalNodesCIDRList in felixconfig

@lwr20
Copy link
Member

lwr20 commented Nov 15, 2022

It wasn't clear to me from your diagram if the VXLAN packets were coming from a pod or not. If so, you might use allowVXLANPacketsFromWorkloads (in FelixConfig) to permit that

@yoheiueda
Copy link
Author

@lwr20 Thank you very much for the info. I'll try externalNodesCIDRList as a work around of this issue.

This behaviour was a security fix - one way you might try to inject packets into a (vxlan) pod network would be to send vxlan packets direct to a node, so I don't think we can back it out completely.

Yes, I understand the purpose of the feature. I am saying that the commit cce5446 I mentioned previously was not a security fix or a security enhancement.

Each VXLAN tunnel has a VXLAN ID (VNI), so we can distinguish multiple VXLAN tunnels in a single system. Calico uses 4096 for VXLAN ID, so VXLAN tunnels other than 4096 are not relevant to Calico.

The original Calico implementation respected VXLAN ID, and did not interfere VXLAN packets that are not relevant to Calico. The commit cce5446 has removed the condition that checks VXLAN IDs, and unconditionally drops VXLAN packets that are not relevant to Calico.

I think this behavior is overkill, and should be fixed not to interfere VXLAN packets that are not relevant to Calico.

It wasn't clear to me from your diagram if the VXLAN packets were coming from a pod or not. If so, you might use allowVXLANPacketsFromWorkloads (in FelixConfig) to permit that

From the viewpoint from Calico, VXLAN packets are not coming from a pod. VXLAN tunnels used in my project is not relevant to Calico network, so Calico should not interfere them.

@lwr20
Copy link
Member

lwr20 commented Nov 16, 2022

Calico uses 4096 for VXLAN ID

Doesn't change any of your points, but just to note that this is configurable.

@caseydavenport
Copy link
Member

caseydavenport commented Dec 5, 2022

Yes, I think we should do better here. The linked commit wasn't a security fix, although the dropping of VXLAN packets from unknown hosts is intended for security reasons. The linked commit was to fix an issue where we were using an iptables match criteria that was causing crashes on a number of systems.

We should reinstate it behind feature auto-detection and/or a configuration option in order to enable use-cases like this one with multiple VXLAN networks.

I believe Felix already has a featuregate / auto-detection mechanism we might be able to leverage for this.

@corhere
Copy link

corhere commented Mar 8, 2023

I've been working on a fix for moby/moby#43382 and noticed the mention link back to here. RedHat decided to deprecate the xt_u32 kernel module in RHEL 8 and remove it in RHEL 9 as a way to nudge users towards migrating to native nftables. This is an arbitrary limitation imposed by RedHat; iptables-nft is compatible with all legacy (read: xtables) packet matches which are available for the running kernel. (Though to be honest, I can't entirely fault them. The u32 expression language is not the most pleasant to work with and is lacking a facility to locate data relative to the start of the L4 header on IPv6 packets containing extension headers, whereas nftables supports it natively using @th raw payload expressions.) Notably, RedHat has not deprecated the xt_bpf match as far as I can tell. So I wrote a drop-in replacement for Moby's VXLAN VNI match expression using a BPF filter program. Since the u32 match expression Calico used to use is effectively identical to Moby's, you may be able to leverage the work I've done to resolve this issue.

@lukasstockner
Copy link

I'd like to suggest that this is reclassified as a bug, not an enhancement - I just spent several hours tracking down why our VXLAN setup (separate from Calico) is broken in one deployment, and it turns out that that deployment has a newer Calico version which now drops all the packets.

To me, this seems like a clear regression, even though it might not affect most typical setups.

@caseydavenport
Copy link
Member

Yep. I think that is fair. Reclassified.

@caseydavenport
Copy link
Member

If anyone wants to take a stab at reverting that commit. and adding a feature toggle for whether or not the offending match is used, I'd be very happy to review.

@cyclinder
Copy link
Contributor

I'm interested in this. In a summary, I think these are two ways:

  • Using xt_u32 when xt_u32 module is exists in the host
  • Refer to @corhere 's comment, using xt_bpf

Which one is better? I prefer to option 2, HDYT?

@yoheiueda
Copy link
Author

Just an FYI, @corhere's fix has been merged moby/moby@105b983, and it looks like xt_u32 is tried first, and then xt_bpf is used as a fallback.

@corhere
Copy link

corhere commented Apr 5, 2023

@yoheiueda that's true, but only to avoid switching up required dependencies on unaffected systems in a security patch release. I am currently cooking up a moby/moby PR which drops the xt_u32 version and improves upon the xt_bpf match program. I'm hoping to also add support for IPv6.

@corhere corhere linked a pull request Apr 27, 2023 that will close this issue
3 tasks
@fasaxc
Copy link
Member

fasaxc commented Dec 9, 2024

Looks like nftables has a native vxlan match: https://manpages.debian.org/unstable/nftables/nftables.8.en.html#VXLAN_HEADER_EXPRESSION

@tomastigera
Copy link
Contributor

The problem with Rh is that they are also in a process of dropping iptables and fully switchin gto nftables, which do not support the bpf programs. So it would be a short lived fixed for RH :-(

@caseydavenport
Copy link
Member

Sounds like using the native nftables match and enabling that is the right path forward for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants