embedded etcd should not accept connections from other nodes while resetting or restoring #5693

brandond · 2022-06-14T20:43:50Z

We should configure etcd to only listen on loopback while resetting/restoring to avoid having other nodes or the apiserver connect to it while it's in the process of being reconfigured.

mdrahman-suse · 2022-06-24T22:48:42Z

Validated on v1.24.2-rc1+k3s1

Environment Details

Infrastructure

Cloud (AWS)
Hosted

Node(s) CPU architecture, OS, and Version:

"Ubuntu 20.04 LTS"
Linux  5.4.0-1009-aws #9-Ubuntu SMP Sun Apr 12 19:46:01 UTC 2020  x86_64 GNU/Linux

Cluster Configuration:

3 servers, 1 agent

Config.yaml:

# server 1
cluster-init: true
token: faketoken

# servers 2 & 3 and agents
server: https://$SERVER1_IP:6443
token: faketoken

Additional files

# np.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-nodeport-deployment
spec:
  selector:
    matchLabels:
      app: nginx-app-node
  replicas: 4
  template:
    metadata:
      labels:
        app: nginx-app-node
    spec:
      containers:
      - name: nginx
        image: ranchertest/mytestcontainer:unprivileged
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx-app-node
  name: nginx-nodeport-svc
  namespace: default
spec:
  type: NodePort
  ports:
    - port: 8080
      nodePort: 30096
      name: http
  selector:
    app: nginx-app-node

Testing Steps

Install k3s on initial server
Join 2 servers and 1 agent to the cluster
On steps 4-8 below, use server1 only:
Deploy some workloads:

kubectl apply -f np.yaml
kubectl run nginx --image=nginx

Take an etcd snapshot: sudo k3s etcd-snapshot save --snapshot-compress
Deploy another workload that won't be present in the snapshot: kubectl run other --image=nginx
Stop k3s: sudo systemctl stop k3s
Restore the snapshot that was taken: sudo k3s server --cluster-reset --cluster-reset-restore-path="/var/lib/rancher/k3s/server/db/snapshots/on-demand-ip-172-31-1-20-1656105010.zip
(run following on servers 2 and 3) Stop the k3s process on servers 2 and 3 and remove the DB dirs as directed by the end of the cluster-reset command:

sudo k3s-killall.sh
sudo rm -rf /var/lib/rancher/k3s/server/db/

(run following on agents) Stop the k3s process on agent 1: sudo k3s-killall.sh
(on server 1) Restart the k3s process: sudo systemctl start k3s
(on servers 2 and 3) Restart the k3s process: sudo systemctl enable k3s --now
(on agents) Restart the k3s process: sudo systemctl enable k3s-agent --now
Deploy another pod: kubectl run new --image-nginx

Replication Results:
Replicated using v1.24.1+k3s1

On step 8, during the cluster-reset, there are log spams as part of the command, regarding peer servers:

{"level":"warn","ts":"2022-06-24T21:55:53.620Z","caller":"rafthttp/http.go:413","msg":"failed to find remote peer in cluster","local-member-id":"803bdcc5209cca5","remote-peer-id-stream-handler":"803bdcc5209cca5","remote-peer-id-from":"53710951b5b53282","cluster-id":"a6b7ec2dfd49b8d1"}

Looking at the k3s logs (journalctl -u k3s -f) of another server while the initial server is performing the cluster-reset (step 8) also has log spam regarding peer servers:

Jun 24 21:55:33 ip-172-31-1-170 k3s[5184]: {"level":"info","ts":"2022-06-24T21:55:33.465Z","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"803bdcc5209cca5"}
Jun 24 21:55:33 ip-172-31-1-170 k3s[5184]: {"level":"warn","ts":"2022-06-24T21:55:33.563Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"803bdcc5209cca5","error":"failed to dial 803bdcc5209cca5 on stream Message (peer 803bdcc5209cca5 failed to find local node 74fca54f4d84b784)"}

After performing all of the steps through step 13, all of the nodes are back in Ready state, all back joined into the cluster, and the data from np.yaml and the nginx pod is running (but not the other pod).
After performing step 14, the new pod shows up as running (along with everything else) when using kubectl get pod from any of the three server nodes

Validation Results:
Validated using v1.24.2-rc1+k3s1

The below two observations are indication that the issue is resolved:

On step 8, during the cluster-reset, there are NO LONGER log spams as part of the command
Looking at the k3s logs of another server while the initial server is performing the cluster-reset (step 8) NO LONGER has log spam regarding peer servers

These below two observations are indications that cluster-reset continues to function properly when finishing the full flow:

After performing all of the steps through step 13, all of the nodes are back in Ready state, all back joined into the cluster, and the data from np.yaml and the nginx pod is running (but not the other pod).
After performing step 14, the new pod shows up as running (along with everything else) when using kubectl get pod from any of the three server nodes

brandond added this to the v1.24.2+k3s1 milestone Jun 14, 2022

brandond self-assigned this Jun 14, 2022

rancher-max assigned mdrahman-suse Jun 22, 2022

mdrahman-suse closed this as completed Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embedded etcd should not accept connections from other nodes while resetting or restoring #5693

embedded etcd should not accept connections from other nodes while resetting or restoring #5693

brandond commented Jun 14, 2022

mdrahman-suse commented Jun 24, 2022

embedded etcd should not accept connections from other nodes while resetting or restoring #5693

embedded etcd should not accept connections from other nodes while resetting or restoring #5693

Comments

brandond commented Jun 14, 2022

mdrahman-suse commented Jun 24, 2022

Validated on v1.24.2-rc1+k3s1

Environment Details

Testing Steps