Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wireguard not populating with connections to other nodes #3287

Closed
iameli opened this issue May 7, 2021 · 3 comments
Closed

wireguard not populating with connections to other nodes #3287

iameli opened this issue May 7, 2021 · 3 comments

Comments

@iameli
Copy link

iameli commented May 7, 2021

Environmental Info:
K3s Version:

k3s version v1.20.6+k3s1 (8d043282)
go version go1.15.10

Node(s) CPU architecture, OS, and Version:

Linux dp2811 5.4.0-72-generic #80-Ubuntu SMP Mon Apr 12 17:35:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
Four servers, at the moment I'm hitting this problem. (In the process of spinning up the servers.) Config looks like:

ExecStart=/usr/local/bin/k3s server \
  --disable=traefik \
  --kube-apiserver-arg=feature-gates='ServiceTopology=true,EndpointSlice=true,EndpointSliceProxying=true' \
  --cluster-cidr="10.42.0.0/16" \
  --service-cidr="10.43.0.0/16" \
  --tls-san=mdw-admin.livepeer.engineering \
  --default-local-storage-path=/home/data \
  --token="[redacted]" \
   --disable servicelb  \
  --flannel-backend=wireguard \
  --node-external-ip=143.244.61.205

Describe the bug:
I had a single-server k3s cluster with a wireguard backend running last night. I recently tried adding additional servers to it. The additional servers came online, and they have wireguard connections to each other. Here's server 3.3.3.3 successfully connecting to 2.2.2.2 and 4.4.4.4:

> wg show
interface: flannel.1
  public key: BKyc3q6MpDFaVqLYrPU3NX7kmr9RahhgQ7JvYV0XFSg=
  private key: (hidden)
  listening port: 51820

peer: XjnFZcY1o/sbom6/8Z6SSuWbq0cbdMa/w4DOWC5q8Do=
  endpoint: 4.4.4.4:51820
  allowed ips: 10.42.1.0/24
  latest handshake: 25 seconds ago
  transfer: 6.11 KiB received, 4.30 KiB sent
  persistent keepalive: every 25 seconds

peer: pWv5a3iIfdURa/wlPK5wivy9KleCgeWL//ZJ2eAFbyY=
  endpoint: 2.2.2.2:51820
  allowed ips: 10.42.2.0/24
  latest handshake: 50 seconds ago
  transfer: 6.11 KiB received, 4.30 KiB sent
  persistent keepalive: every 25 seconds

The original server supposedly has all of this configuration, but none of the connections are open:

> wg show
interface: flannel.1
  public key: OeFOZblQVwEkYBEwRGx0cefR+ChN+KNYM1vJjYs70w0=
  private key: (hidden)
  listening port: 51820

peer: MwE3kq82mJGbBc55suKWQNq1+Tn5/DjHCFp05BrmalI=
  endpoint: 4.4.4.4:51820
  allowed ips: (none)
  transfer: 0 B received, 78.91 KiB sent
  persistent keepalive: every 25 seconds

peer: BECrn2WJkNGay50K405E7B7OT3TKoRevg9xZw5xupwc=
  endpoint: 2.2.2.2:51820
  allowed ips: (none)
  transfer: 0 B received, 78.48 KiB sent
  persistent keepalive: every 25 seconds

peer: BKyc3q6MpDFaVqLYrPU3NX7kmr9RahhgQ7JvYV0XFSg=
  endpoint: 3.3.3.3:51820
  allowed ips: 10.42.3.0/24
  transfer: 0 B received, 78.05 KiB sent
  persistent keepalive: every 25 seconds

I tried to fix this with a systemctl restart k3s, and now nothing shows up at all on the 1.1.1.1 server:

> wg show
interface: flannel.1
  public key: xOsXaDV5NZx0ZBbpuj9TGvwrFBYHJKs3f3bf0p5i534=
  private key: (hidden)
  listening port: 51820

Steps To Reproduce:
Not sure yet.

Expected behavior:
Established Wireguard connections. Or, presumably, k3s should fix up the local networking environment

Actual behavior:
No connections, all traffic to other nodes gets blackholed.

EDIT: Trying to diagnose further... it looks like the other servers in the cluster can't contact each other either, even though the wireguard connections are up. Flannel clearly having problems here, not sure why yet.

k3s check-config says I'm okay but does come up with this... but aren't those routes the ones that k3s created?

System:
- /usr/sbin iptables v1.8.4 (legacy): ok
- swap: should be disabled
- routes: default CIDRs 10.42.0.0/16 or 10.43.0.0/16 already routed

EDIT 2: Seemingly resolved with rolling systemctl restart k3s on all affected servers... the routes came back one at a time. Interesting.

@ieugen
Copy link

ieugen commented May 7, 2021

We have also encountered issues when adding new nodes with wireguard via ansible.
We believe that changing the wireguard keys (the ansible role does that on update) breakes k3s connections somehow.

Restarting the servers fixed the issue but we are planning on migrating away from wireguard to local private network.

@iameli
Copy link
Author

iameli commented May 11, 2021

@ieugen Interesting. I'm also using Ansible, but using my own playbooks, not https://github.com/k3s-io/k3s-ansible or anything like that. I wonder if it has to do with restarting k3s at the wrong time? Our playbook does restart the k3s service two or three times over the course of the playbook, that could be the issue.

@stale
Copy link

stale bot commented Nov 7, 2021

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@stale stale bot added the status/stale label Nov 7, 2021
@stale stale bot closed this as completed Nov 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants