Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embedded etcd should not accept connections from other nodes while resetting or restoring #5693

Closed
brandond opened this issue Jun 14, 2022 · 1 comment
Assignees
Milestone

Comments

@brandond
Copy link
Member

We should configure etcd to only listen on loopback while resetting/restoring to avoid having other nodes or the apiserver connect to it while it's in the process of being reconfigured.

@mdrahman-suse
Copy link

Validated on v1.24.2-rc1+k3s1

Environment Details

Infrastructure

  • Cloud (AWS)
  • Hosted

Node(s) CPU architecture, OS, and Version:

"Ubuntu 20.04 LTS"
Linux  5.4.0-1009-aws #9-Ubuntu SMP Sun Apr 12 19:46:01 UTC 2020  x86_64 GNU/Linux

Cluster Configuration:

3 servers, 1 agent

Config.yaml:

# server 1
cluster-init: true
token: faketoken
# servers 2 & 3 and agents
server: https://$SERVER1_IP:6443
token: faketoken

Additional files

# np.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-nodeport-deployment
spec:
  selector:
    matchLabels:
      app: nginx-app-node
  replicas: 4
  template:
    metadata:
      labels:
        app: nginx-app-node
    spec:
      containers:
      - name: nginx
        image: ranchertest/mytestcontainer:unprivileged
        ports:
        - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: nginx-app-node
  name: nginx-nodeport-svc
  namespace: default
spec:
  type: NodePort
  ports:
    - port: 8080
      nodePort: 30096
      name: http
  selector:
    app: nginx-app-node

Testing Steps

  1. Install k3s on initial server
  2. Join 2 servers and 1 agent to the cluster
  3. On steps 4-8 below, use server1 only:
  4. Deploy some workloads:
kubectl apply -f np.yaml
kubectl run nginx --image=nginx
  1. Take an etcd snapshot: sudo k3s etcd-snapshot save --snapshot-compress
  2. Deploy another workload that won't be present in the snapshot: kubectl run other --image=nginx
  3. Stop k3s: sudo systemctl stop k3s
  4. Restore the snapshot that was taken: sudo k3s server --cluster-reset --cluster-reset-restore-path="/var/lib/rancher/k3s/server/db/snapshots/on-demand-ip-172-31-1-20-1656105010.zip
  5. (run following on servers 2 and 3) Stop the k3s process on servers 2 and 3 and remove the DB dirs as directed by the end of the cluster-reset command:
sudo k3s-killall.sh
sudo rm -rf /var/lib/rancher/k3s/server/db/
  1. (run following on agents) Stop the k3s process on agent 1: sudo k3s-killall.sh
  2. (on server 1) Restart the k3s process: sudo systemctl start k3s
  3. (on servers 2 and 3) Restart the k3s process: sudo systemctl enable k3s --now
  4. (on agents) Restart the k3s process: sudo systemctl enable k3s-agent --now
  5. Deploy another pod: kubectl run new --image-nginx

Replication Results:
Replicated using v1.24.1+k3s1

  • On step 8, during the cluster-reset, there are log spams as part of the command, regarding peer servers:
{"level":"warn","ts":"2022-06-24T21:55:53.620Z","caller":"rafthttp/http.go:413","msg":"failed to find remote peer in cluster","local-member-id":"803bdcc5209cca5","remote-peer-id-stream-handler":"803bdcc5209cca5","remote-peer-id-from":"53710951b5b53282","cluster-id":"a6b7ec2dfd49b8d1"}
  • Looking at the k3s logs (journalctl -u k3s -f) of another server while the initial server is performing the cluster-reset (step 8) also has log spam regarding peer servers:
Jun 24 21:55:33 ip-172-31-1-170 k3s[5184]: {"level":"info","ts":"2022-06-24T21:55:33.465Z","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"803bdcc5209cca5"}
Jun 24 21:55:33 ip-172-31-1-170 k3s[5184]: {"level":"warn","ts":"2022-06-24T21:55:33.563Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"803bdcc5209cca5","error":"failed to dial 803bdcc5209cca5 on stream Message (peer 803bdcc5209cca5 failed to find local node 74fca54f4d84b784)"}
  • After performing all of the steps through step 13, all of the nodes are back in Ready state, all back joined into the cluster, and the data from np.yaml and the nginx pod is running (but not the other pod).
  • After performing step 14, the new pod shows up as running (along with everything else) when using kubectl get pod from any of the three server nodes

Validation Results:
Validated using v1.24.2-rc1+k3s1

The below two observations are indication that the issue is resolved:

  • On step 8, during the cluster-reset, there are NO LONGER log spams as part of the command
  • Looking at the k3s logs of another server while the initial server is performing the cluster-reset (step 8) NO LONGER has log spam regarding peer servers

These below two observations are indications that cluster-reset continues to function properly when finishing the full flow:

  • After performing all of the steps through step 13, all of the nodes are back in Ready state, all back joined into the cluster, and the data from np.yaml and the nginx pod is running (but not the other pod).
  • After performing step 14, the new pod shows up as running (along with everything else) when using kubectl get pod from any of the three server nodes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants