Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic GKE node pool update fills controller logs with error #1363

Closed
salasberryfin opened this issue Nov 12, 2024 · 1 comment · Fixed by #1364
Closed

Automatic GKE node pool update fills controller logs with error #1363

salasberryfin opened this issue Nov 12, 2024 · 1 comment · Fixed by #1364
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@salasberryfin
Copy link
Contributor

salasberryfin commented Nov 12, 2024

/kind bug

Description

After a new GKE cluster becomes ready, GCP may automatically trigger a node pool update (this doesn't always happen but I've seen it occur consistently - not with Autopilot enabled, though). This makes the controller complain about an existing operation running. Same happens when deleting a cluster and may apply to any other events in which GCP performs managed updates.

The following is the error log from the controller:

reconcile.go:204] "Deleting node pool resources" controller="gcpmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="GCPManagedMachinePool" GCPManagedMachinePool="default/capg-gke-mp-0" namespace="default" name="capg-gke-mp-0" reconcileID="3f75d094-eb40-41ef-84b6-b250d4027202"
gcpmanagedmachinepool_controller.go:383] "Reconcile error" err=<
	rpc error: code = FailedPrecondition desc = Cluster is running incompatible operation operation-1731137433803-29e34938-9333-469b-b14f-5bdf10b143d2.
	error details: name = ErrorInfo reason = CLUSTER_ALREADY_HAS_OPERATION domain = container.googleapis.com metadata = map[]
	error details: name = RequestInfo id = 0xe7ae075979bd6d29 data =
 > controller="gcpmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="GCPManagedMachinePool" GCPManagedMachinePool="default/capg-gke-mp-0" namespace="default" name="capg-gke-mp-0" reconcileID="3f75d094-eb40-41ef-84b6-b250d4027202" controller="gcpmanagedmachinepool" action="delete" reconciler="nodepools"

At some point in the reconciliation loop, a node pool update operation is initiated while the automatic update is still running, causing this error. Once the node pool update completes, the controller resumes normal operation, and the cluster either becomes ready or is deleted successfully.

As a user, I would expect the CAPG controller to handle this error gracefully, avoiding node pool updates for clusters that already have an operation in progress. Initial investigation suggests that this issue may be caused by the controller not being able to unwrap an error during reconciliation.

@salasberryfin
Copy link
Contributor Author

/kind bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants