Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After restore fleet agent does not communicate back to fleet controller and shows error in continous delivery in Rancher 2.5.9 #164

Closed
Martin-Weiss opened this issue Oct 5, 2021 · 6 comments
Assignees
Labels
Milestone

Comments

@Martin-Weiss
Copy link

Martin-Weiss commented Oct 5, 2021

SURE-3497
SURE-3502

We are using fleet with Rancher 2.5.9, we have created several git repos that we deploy via fleet to the local and to the downstream clusters.

Now we created a rancher-backup with the backup operator, destroyed the cluster, redeployed the cluster and followed the recovery procedure described at https://rancher.com/docs/rancher/v2.5/en/backups/migrating-rancher/

After this recovery we have realized that the gitrepo and helmchartrepo secrets were not restored (see issue #163) and we re-created this secrets manually with kubectl apply -f .yaml.

Now we had to realize that the fleet agent does not work / communicate back and complains about a missing secret fleet-system/fleet-agent-bootstrap.

We have tried to create this bootstrap secret and were somewhat successful but even with the agent working somewhat - we see it as "Cluster: local" with status red color "Modified" telling us that it does not trust the CA of the kubernetes cluster which might be the correct message as the cluster was re-deployed and we really have a new CA for the RKE2 cluster that is the basis.

Long story short - how can we restore rancher properly including all the fleet agents and the required secrets so that the agents are happy?

@Martin-Weiss
Copy link
Author

FYI - it seems that the root cause is related to the missing fleet-agent secret after the restore and the missing service accounts in the namespace cluster-fleet-local-local-1a3d67d0a899. I could get the fleet-agent for local cluster back running with creating the missing secret manually but for the downstream clusters this seems to be a bit more complicated as the service accounts seem to be missing..?

@SheilaghM
Copy link

This issue should be backported to 2.5 rancher/rancher#33954

PRs to backport:

#148
rancher/charts#1431
#138

@StrongMonkey
Copy link
Contributor

Available to test with backup-operator v1.2.1-rc1. The expect result is that after backup and restore, fleet clusters should stay connect with rancher server and all clusters should be active. User should be able to use fleet after restore.

@StrongMonkey
Copy link
Contributor

@sowmyav27 Assigning to you right now. Depends on who you want to delegate either from red team QA or QA who has done backup testing before(@anupama2501).

@sgapanovich
Copy link

sgapanovich commented Nov 19, 2021

Was able to reproduce issues with fleet agents and controller
I have a 2.5.11 Rancher with 2 downstream clusters

  1. created 3 Git repos using:
    • public repo
    • private repo
    • helm chart which needs helmsecret
  2. bundles were created and everything deployed as expected
  3. created a backup using 1.2.0 backup
  4. migrated to a new cluster
    • "Unauthorized" errors logged in fleet agents in downstream clusters
    • level=error msg="Failed to register agent: looking up secret fleet-system/fleet-agent-bootstrap: looking up secret fleet-system/fleet-agent-bootstrap: secrets \"fleet-agent-bootstrap\" not found" logged in the fleet agent on my local cluster
    • level=error msg="error syncing 'fleet-local/helm': handler gitjobs: failed to look up helmSecretName, error: secrets \"helm-secret\" not found, requeuing" level=error msg="error syncing 'fleet-default/helm': handler gitjobs: failed to look up helmSecretName, error: secrets \"helm-secret\" not found, requeuing" logged in the fleet controller on my local clluster
  5. tried to push some code in one of the repo used by fleet (public one) and the Git Repo with Bundle went into "Wait Applied" status
    repos
    fleet dowstream
    fleet local
    fleet controller

@sgapanovich
Copy link

Test Environment:

Rancher version: v2.5-a3b524e9d00408bf8da0e46fe5f9f127d7fddd20-head
Rancher cluster type: HA
Docker version: 20.10
Backup operator: backup-restore-operator:v1.2.1-rc1
Fleet agent: fleet-agent:v0.3.5

Downstream cluster type:

  • 1 cluster: 3 nodes ec2 (RKE1)
  • 2 cluster: 3 nodes do (RKE1)

Downstream K8s version: v1.20.12


Testing:

  • Gitrepos used:
    • public git repo A with a deployment which does not require a helm secret
      • added to the local and dowstream clusters (fleet-local and fleet-default namespaces)
    • private git repo B with a deployment which does not require a helm secret
      • added to the local and dowstream clusters (fleet-local and fleet-default namespaces)
    • public git repo C with a deployment which requires a helm secret
      • added to the local and dowstream clusters (fleet-local and fleet-default namespaces)
  1. After adding git repos (see above)
    • all resources created as expected
  2. Using backup-restore-operator:v1.2.1-rc1 created a backup
  3. Migrated to a new cluster (installed backup-restore-operator:v1.2.1-rc1 first and then rancher)
  4. When logged in to Rancher:
  5. Fleet:
    • fleet-agents on downstream and local clusters are working as expected with no errors logged
    • fleet-controller on local cluster is working as expected with no errors logged (it was complaining about secrets first)
  6. without modifying/updating anything
    • pushing a commit to the repo A got the deployment updated as expected on local and downstream clusters
    • adding a new git repo to the local and dowstream clusters worked as expected with deployments created
  7. after adding missing secrets for repos B and C
    • pushing a commit to the repos B and C got the deployments updated as expected on local and dowstream clusters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants