After restore fleet agent does not communicate back to fleet controller and shows error in continous delivery in Rancher 2.5.9 #164

Martin-Weiss · 2021-10-05T14:14:26Z

SURE-3497
SURE-3502

We are using fleet with Rancher 2.5.9, we have created several git repos that we deploy via fleet to the local and to the downstream clusters.

Now we created a rancher-backup with the backup operator, destroyed the cluster, redeployed the cluster and followed the recovery procedure described at https://rancher.com/docs/rancher/v2.5/en/backups/migrating-rancher/

After this recovery we have realized that the gitrepo and helmchartrepo secrets were not restored (see issue #163) and we re-created this secrets manually with kubectl apply -f .yaml.

Now we had to realize that the fleet agent does not work / communicate back and complains about a missing secret fleet-system/fleet-agent-bootstrap.

We have tried to create this bootstrap secret and were somewhat successful but even with the agent working somewhat - we see it as "Cluster: local" with status red color "Modified" telling us that it does not trust the CA of the kubernetes cluster which might be the correct message as the cluster was re-deployed and we really have a new CA for the RKE2 cluster that is the basis.

Long story short - how can we restore rancher properly including all the fleet agents and the required secrets so that the agents are happy?

Martin-Weiss · 2021-10-08T14:02:51Z

FYI - it seems that the root cause is related to the missing fleet-agent secret after the restore and the missing service accounts in the namespace cluster-fleet-local-local-1a3d67d0a899. I could get the fleet-agent for local cluster back running with creating the missing secret manually but for the downstream clusters this seems to be a bit more complicated as the service accounts seem to be missing..?

SheilaghM · 2021-10-19T20:57:44Z

This issue should be backported to 2.5 rancher/rancher#33954

PRs to backport:

#148
rancher/charts#1431
#138

StrongMonkey · 2021-11-10T18:50:47Z

Available to test with backup-operator v1.2.1-rc1. The expect result is that after backup and restore, fleet clusters should stay connect with rancher server and all clusters should be active. User should be able to use fleet after restore.

StrongMonkey · 2021-11-10T18:52:16Z

@sowmyav27 Assigning to you right now. Depends on who you want to delegate either from red team QA or QA who has done backup testing before(@anupama2501).

sgapanovich · 2021-11-19T16:06:52Z

Was able to reproduce issues with fleet agents and controller
I have a 2.5.11 Rancher with 2 downstream clusters

created 3 Git repos using:
- public repo
- private repo
- helm chart which needs helmsecret
bundles were created and everything deployed as expected
created a backup using 1.2.0 backup
migrated to a new cluster
- "Unauthorized" errors logged in fleet agents in downstream clusters
- level=error msg="Failed to register agent: looking up secret fleet-system/fleet-agent-bootstrap: looking up secret fleet-system/fleet-agent-bootstrap: secrets \"fleet-agent-bootstrap\" not found" logged in the fleet agent on my local cluster
- level=error msg="error syncing 'fleet-local/helm': handler gitjobs: failed to look up helmSecretName, error: secrets \"helm-secret\" not found, requeuing" level=error msg="error syncing 'fleet-default/helm': handler gitjobs: failed to look up helmSecretName, error: secrets \"helm-secret\" not found, requeuing" logged in the fleet controller on my local clluster
tried to push some code in one of the repo used by fleet (public one) and the Git Repo with Bundle went into "Wait Applied" status

sgapanovich · 2021-11-22T20:58:21Z

Test Environment:

Rancher version: v2.5-a3b524e9d00408bf8da0e46fe5f9f127d7fddd20-head
Rancher cluster type: HA
Docker version: 20.10
Backup operator: backup-restore-operator:v1.2.1-rc1
Fleet agent: fleet-agent:v0.3.5

Downstream cluster type:

1 cluster: 3 nodes ec2 (RKE1)
2 cluster: 3 nodes do (RKE1)

Downstream K8s version: v1.20.12

Testing:

Gitrepos used:
- public git repo A with a deployment which does not require a helm secret
  - added to the local and dowstream clusters (fleet-local and fleet-default namespaces)
- private git repo B with a deployment which does not require a helm secret
  - added to the local and dowstream clusters (fleet-local and fleet-default namespaces)
- public git repo C with a deployment which requires a helm secret
  - added to the local and dowstream clusters (fleet-local and fleet-default namespaces)

After adding git repos (see above)
- all resources created as expected
Using backup-restore-operator:v1.2.1-rc1 created a backup
Migrated to a new cluster (installed backup-restore-operator:v1.2.1-rc1 first and then rancher)
When logged in to Rancher:
- GitRepo page still has errors with secrets missing for repos where helmSecretName and clientSecretName are used Backup/Restore does not include the secrets used for clientSecretName and helmSecretName in fleet gitrepos #163
Fleet:
- fleet-agents on downstream and local clusters are working as expected with no errors logged
- fleet-controller on local cluster is working as expected with no errors logged (it was complaining about secrets first)
without modifying/updating anything
- pushing a commit to the repo A got the deployment updated as expected on local and downstream clusters
- adding a new git repo to the local and dowstream clusters worked as expected with deployments created
after adding missing secrets for repos B and C
- pushing a commit to the repos B and C got the deployments updated as expected on local and dowstream clusters

SheilaghM added [zube]: RT - Sprint Ready internal question Further information is requested and removed question Further information is requested labels Oct 19, 2021

SheilaghM added this to the v2.5.11 milestone Oct 19, 2021

StrongMonkey mentioned this issue Oct 27, 2021

Fix backup for fleet #166

Merged

Jono-SUSE-Rancher modified the milestones: v2.5.11, v2.5.12 Nov 2, 2021

SheilaghM assigned StrongMonkey Nov 5, 2021

SheilaghM added [zube]: Next Up and removed [zube]: RT - Sprint Ready labels Nov 5, 2021

StrongMonkey added [zube]: Waiting for Rancher RC and removed [zube]: Next Up labels Nov 9, 2021

This was referenced Nov 10, 2021

Force sync fleet-agent bundle to remove modified states #170

Merged

Backup v1.2.1 rc1 rancher/charts#1608

Merged

StrongMonkey added [zube]: To Test and removed [zube]: Waiting for Rancher RC labels Nov 10, 2021

StrongMonkey assigned sowmyav27 Nov 10, 2021

sowmyav27 assigned sgapanovich and unassigned sowmyav27 Nov 16, 2021

sgapanovich added [zube]: QA Working and removed [zube]: To Test labels Nov 17, 2021

sgapanovich closed this as completed Nov 22, 2021

sgapanovich removed the [zube]: QA Working label Nov 22, 2021

sgapanovich added the [zube]: Done label Nov 22, 2021

zube bot removed the [zube]: Done label Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After restore fleet agent does not communicate back to fleet controller and shows error in continous delivery in Rancher 2.5.9 #164

After restore fleet agent does not communicate back to fleet controller and shows error in continous delivery in Rancher 2.5.9 #164

Martin-Weiss commented Oct 5, 2021 •

edited by SheilaghM

Loading

Martin-Weiss commented Oct 8, 2021

SheilaghM commented Oct 19, 2021

StrongMonkey commented Nov 10, 2021

StrongMonkey commented Nov 10, 2021

sgapanovich commented Nov 19, 2021 •

edited

Loading

sgapanovich commented Nov 22, 2021

After restore fleet agent does not communicate back to fleet controller and shows error in continous delivery in Rancher 2.5.9 #164

After restore fleet agent does not communicate back to fleet controller and shows error in continous delivery in Rancher 2.5.9 #164

Comments

Martin-Weiss commented Oct 5, 2021 • edited by SheilaghM Loading

Martin-Weiss commented Oct 8, 2021

SheilaghM commented Oct 19, 2021

StrongMonkey commented Nov 10, 2021

StrongMonkey commented Nov 10, 2021

sgapanovich commented Nov 19, 2021 • edited Loading

sgapanovich commented Nov 22, 2021

Test Environment:

Testing:

Martin-Weiss commented Oct 5, 2021 •

edited by SheilaghM

Loading

sgapanovich commented Nov 19, 2021 •

edited

Loading