Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Hard crash and reboot of the hubagent if ClusterResourceSnapshot failed to list due to timeout #717

Open
d4rkhunt33r opened this issue Mar 8, 2024 · 16 comments
Labels
bug Something isn't working

Comments

@d4rkhunt33r
Copy link

d4rkhunt33r commented Mar 8, 2024

Describe the bug

Hard crash and reboot of the hubagent if ClusterResourceSnapshot failed to list due to timeout

Environment

Please provide the following:

  • Hub cluster details
    The hub cluster is an aws eks cluster

The hubagent was installed with the following values

replicaCount: 1
logVerbosity: 1
enableWebhook: false
webhookServiceName: fleetwebhook
enableGuardRail: false
webhookClientConnectionType: service
enableV1Alpha1APIs: false
enableV1Beta1APIs: true
resources:
  requests:
    cpu: 2
    memory: 8Gi
  limits:
    cpu: 4
    memory: 16Gi

To Reproduce

Steps to reproduce the behavior:

  • Install hubagent in the cluster
  • Create a lot of objetcs in order to make the request to kubectl get clusterresourcesnapshot take more than 30 seconds

You should see the hubagent container to reboot showing the following errores in the logs

I0308 16:31:41.079738       1 controller/controller.go:190] "Starting controller" controller="cluster-resource-placement-controller-v1beta1"
I0308 16:31:41.147758       1 informer/informermanager.go:152] "Disabled an informer for a disappeared resource" res={"GroupVersionKind":{"Group":"","Version":"v1","Kind":"Event"},"GroupVersionResource":{"Group":"","Version":"v1","Resource":"events"},"IsClusterScoped":false}
W0308 16:32:21.369841       1 cache/reflector.go:535] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1beta1.ClusterResourceSnapshot: the server was unable to return a response in the time allotted, but may still be processing the request (get clusterresourcesnapshots.placement.kubernetes-fleet.io)
I0308 16:32:21.369941       1 trace/trace.go:236] Trace[1547232315]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229 (08-Mar-2024 16:31:21.299) (total time: 60070ms):
Trace[1547232315]: ---"Objects listed" error:the server was unable to return a response in the time allotted, but may still be processing the request (get clusterresourcesnapshots.placement.kubernetes-fleet.io) 60069ms (16:32:21.369)
Trace[1547232315]: [1m0.070028406s] [1m0.070028406s] END
E0308 16:32:21.369964       1 cache/reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1beta1.ClusterResourceSnapshot: failed to list *v1beta1.ClusterResourceSnapshot: the server was unable to return a response in the time allotted, but may still be processing the request (get clusterresourcesnapshots.placement.kubernetes-fleet.io)```

### **Expected behavior**
The pod should not be rebooted

### **Screenshots**
If applicable, add screenshots to help explain your problem.

### **Additional context**
I think this could be solved if there is an option to increase the timeout for the kubernetes client.

Sorry for my english :S

@d4rkhunt33r d4rkhunt33r added the bug Something isn't working label Mar 8, 2024
@d4rkhunt33r d4rkhunt33r changed the title [BUG] [BUG] Hard crash and reboot of the hubagent if ClusterResourceSnapshot failed to list due to timeout Mar 8, 2024
@Haladinoq
Copy link

at the moment I have the same issue in my implementation

@ryanzhang-oss
Copy link
Contributor

@d4rkhunt33r Thank you for reporting this. Can you share a bit more on the environment setup? Did you use our helm chart? How much cpu/memory did the hubagent use when you create "a lot of objects"? What is the limit of the cpu/memory on the EKS api-server?

@Haladinoq I wonder what do you mean by your "in your implementation"?

@ryanzhang-oss
Copy link
Contributor

@d4rkhunt33r You can increase the hubagent performance by adding the following parameters in your helm chart with some value like below.

--set concurrentClusterPlacementSyncs=20
--set hubAPIQPS=10
--set hubAPIBurst=100 \

@d4rkhunt33r
Copy link
Author

d4rkhunt33r commented Mar 12, 2024

@ryanzhang-oss thank you so much for you assitance. i will try to increase those values.
but i think the issue is more related to the lack of an option to specify a timeout for the queries to the kubernetes api server

you can check in the message from the logs that the error is related to the ClusterResourceSnapshot was unable to return a reponse

[controller-runtime] log.SetLogger(...) was never called; logs will not be displayed.
Detected at:
	>  goroutine 88 [running]:
	>  runtime/debug.Stack()
	>  	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
	>  sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
	>  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/log.go:60 +0xcd
	>  sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Error(0xc0000b2140, {0x1c0bb20, 0xc04ae8dec0}, {0x19bcdd2, 0x20}, {0x0, 0x0, 0x0})
	>  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/deleg.go:139 +0x68
	>  github.com/go-logr/logr.Logger.Error({{0x1c27280?, 0xc0000b2140?}, 0x444091?}, {0x1c0bb20, 0xc04ae8dec0}, {0x19bcdd2, 0x20}, {0x0, 0x0, 0x0})
	>  	/go/pkg/mod/github.com/go-logr/[email protected]/logr.go:305 +0xda
	>  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1({0x1c24878?, 0xc00059eaa0?}, 0xc00037e960, {0x1c16440, 0xc0004f7230})
	>  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:203 +0x186
	>  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2(0xc00037e960, {0x1c24878?, 0xc00059eaa0}, 0xc00062e640)
	>  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208 +0x418
	>  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start(0xc00037e960, {0x1c24878, 0xc00059eaa0})
	>  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234 +0x165
	>  sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc0000328e0)
	>  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:223 +0xdb
	>  created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile
	>  	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:207 +0x1ad
I0311 22:03:09.868763       1 controller/controller.go:218] "Shutdown signal received, waiting for all workers to finish" controller="cluster-resource-placement-controller-v1beta1"

@d4rkhunt33r
Copy link
Author

d4rkhunt33r commented Mar 12, 2024

This is another message that i think can be usefull

E0312 15:09:47.679656       1 workgenerator/controller.go:115] "Failed to sync all the works" err="unexpected behavior which cannot be handled by the controller: Timeout: failed waiting for *v1beta1.ClusterResourceSnapshot Informer to sync" resourceBinding="namespaces-resource-placem-istio-replica-2bfd58b0"```

@ryanzhang-oss
Copy link
Contributor

Thanks. I wonder if you can share the entire log and cluster settings? 30 seconds is a very long time so I am not sure we should relax that. It seems that the API-server on the hub cluster is very over-loaded? Can you share its usage?

@d4rkhunt33r
Copy link
Author

Hi i am sorry for the delay but I was finishing some task for the Q.
Currently I am not able to share the entire log because my employer have concerns that some names could give away the name of my employer.
What I can say Is that we have a cluster with hundreds of thousands of objects mainly Istio CRDS.
What we want to achieve is be able to have a backup of all those objects in a secondary cluster.

The reason the Api takes so long to retrieve the ClusterResourceSnapshot is the amount of objects and the load on the server (Very busy one).

I think it will be very helpful to have a parameter to control the timeout before the controller fails due to not having a response from the server.

Next week I will try to look at the code and see if I can find the place where this could be implemented and study a little more the project to see if this is already posible and I am missing something

Sorry for my English is not my native language

@d4rkhunt33r
Copy link
Author

@ryanzhang-oss Good morning I hope you are having a great day.

I am still looking at the issue and I was not able to found a way to increase the Kubernetes client timeout neither the CacheSyncTimeout

I was hopping this could be added to the project in case someone have an already large cluster and the CacheSync could take more than two minutos

// CacheSyncTimeout refers to the time limit set to wait for syncing caches. // Defaults to 2 minutes if not set. // +optional CacheSyncTimeout *time.Durationjson:"cacheSyncTimeout,omitempty"``

@d4rkhunt33r
Copy link
Author

i was able to change the CacheSysncTimeout by adding.
mgr, err := ctrl.NewManager(config, ctrl.Options{ Scheme: scheme, Controller: ctrloption.Controller{ CacheSyncTimeout: 300 * time.Second, },

but I am still seeing the error related to the Timeout listing resources.

E0418 20:02:25.727800 1 controller/controller.go:75] "Error returned by the API server" err="the server was unable to return a response in the time allotted, but may still be processing the request (get clusterresourcesnapshots.placement.kubernetes-fleet.io)" fromCache=false reason="Timeout".

is there a place in the code where I can change the 60 seconds timeout for requests to the Kubernetes API server ?

@d4rkhunt33r
Copy link
Author

d4rkhunt33r commented Apr 24, 2024

Hi i was able to find the issue.

If you create a ClusterResourcePlacement selecting a lot of namespaces to propagate, the resulting ClusterResourceSnapshots are too large for my kuberntes cluster to parse.

That means the timeout was server side as you can see in the following log from my Kubernetes cluster.

{
    "encodeGV": "placement.kubernetes-fleet.io/v1beta1",
    "encoder": "{\"name\":\"json\",\"pretty\":\"false\",\"strict\":\"false\",\"yaml\":\"false\"}",
    "name": "versioning"
}
 (23-Apr-2024 15:48:53.028) (total time: 60095ms):
Trace[372133574]: ---"About to start writing response" size:2273981422 58933ms (15:49:51.962)
Trace[372133574]: ---"Write call failed" writer:*gzip.Writer,size:2273981422,firstWrite:true,err:http: Handler timeout 1161ms (15:49:53.123)
Trace[372133574]: [1m0.095898857s] [1m0.095898857s] END

It is basically a timeout parsing the json output for the request to cluster resourcesnapshot

To fix this, I changed the pagination from 500 ClusterResourceSnapshots to 100 but then I found another issue because the ClusterResourcePlacement was sooo large that updating the object failed due to the limit of bytes to the post operation

I still think it could be a good idea to change the CacheSyncTimeout for the manager in order to give it more time to SyncCaches

My code is this

   mgr, err := ctrl.NewManager(config, ctrl.Options{
   	Scheme: scheme,
   	Controller: ctrloption.Controller{
   		CacheSyncTimeout: 300 * time.Second,
   	},
   	Cache: cache.Options{
   		SyncPeriod: &opts.ResyncPeriod.Duration,
   		HTTPClient: cacheHttpClient,
   	},
   	LeaderElection:             opts.LeaderElection.LeaderElect,
   	LeaderElectionID:           opts.LeaderElection.ResourceName,
   	LeaderElectionNamespace:    opts.LeaderElection.ResourceNamespace,
   	LeaderElectionResourceLock: opts.LeaderElection.ResourceLock,
   	HealthProbeBindAddress:     opts.HealthProbeAddress,
   	Metrics: metricsserver.Options{
   		BindAddress: opts.MetricsBindAddress,
   	},
   	WebhookServer: ctrlwebhook.NewServer(ctrlwebhook.Options{
   		Port:    FleetWebhookPort,
   		CertDir: FleetWebhookCertDir,
   	}),
   })

adding
CacheSyncTimeout: 300 * time.Second,

Allowed the controllers to have more time to finish syncing caches and no more restart happened

@d4rkhunt33r
Copy link
Author

So my next step is to create ClusterResourcePlacement for each namespace in the hopes to not crash the server due to the size of the Snapshots

@zhiying-lin
Copy link
Contributor

@d4rkhunt33r thank you for the updates. Just curious how you're using CRP. Are you selecting multiple applications using one CRP?

It's better to put one application configurations into one CRP, so that the size of selecting resources is under the control.

I still think it could be a good idea to change the CacheSyncTimeout for the manager in order to give it more time to SyncCaches

i think it's a fair request.

@ryanzhang-oss
Copy link
Contributor

CRP is not really designed for this type of scenario unless you are placing the resource to a different cluster each time. I wonder if you might have looked at etcd backup?

@ryanzhang-oss
Copy link
Contributor

So my next step is to create ClusterResourcePlacement for each namespace in the hopes to not crash the server due to the size of the Snapshots

Hi, @d4rkhunt33r, I wonder if you will place each namespace on a different cluster or just copy all things on cluster A to cluster B?

@d4rkhunt33r
Copy link
Author

@ryanzhang-oss yes my objective is to maintain two clusters with pretty much all namespaces and its objects synchronized in order to have an Active/Active backup of my Istio Control Plane.

I was able to create the 9000 ClusterResourcePlacement and the process to sync the objects between clusters took 5 hours for the first run

Pretty much all objects were synchronized correctly and it takes about a minute te reflect the changes in a cluster to the replica after a change

Now I am curious about the observability of the project.

Not sure if there is way to know using metrics if both clusters are in sync. or what objets were not synced

@d4rkhunt33r
Copy link
Author

@d4rkhunt33r thank you for the updates. Just curious how you're using CRP. Are you selecting multiple applications using one CRP?

It's better to put one application configurations into one CRP, so that the size of selecting resources is under the control.

I still think it could be a good idea to change the CacheSyncTimeout for the manager in order to give it more time to SyncCaches

i think it's a fair request.

yes initially I was selecting 9000 namespaces and all its objects in 1 ClusterResourcePlacement and it broke the app :).

Now I am creating 1 ClusterResourcePlacement for each Namespace and the Aplication was able to handle the process.

about etcd backups, I am using another open source project called velero. the thing is that a backup gets old very quickly and Fleet is something I am hoping can give me something more real time o semi real time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants