Automatic GKE node pool update fills controller logs with error #1363

salasberryfin · 2024-11-12T11:34:45Z

/kind bug

Description

After a new GKE cluster becomes ready, GCP may automatically trigger a node pool update (this doesn't always happen but I've seen it occur consistently - not with Autopilot enabled, though). This makes the controller complain about an existing operation running. Same happens when deleting a cluster and may apply to any other events in which GCP performs managed updates.

The following is the error log from the controller:

reconcile.go:204] "Deleting node pool resources" controller="gcpmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="GCPManagedMachinePool" GCPManagedMachinePool="default/capg-gke-mp-0" namespace="default" name="capg-gke-mp-0" reconcileID="3f75d094-eb40-41ef-84b6-b250d4027202"
gcpmanagedmachinepool_controller.go:383] "Reconcile error" err=<
	rpc error: code = FailedPrecondition desc = Cluster is running incompatible operation operation-1731137433803-29e34938-9333-469b-b14f-5bdf10b143d2.
	error details: name = ErrorInfo reason = CLUSTER_ALREADY_HAS_OPERATION domain = container.googleapis.com metadata = map[]
	error details: name = RequestInfo id = 0xe7ae075979bd6d29 data =
 > controller="gcpmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="GCPManagedMachinePool" GCPManagedMachinePool="default/capg-gke-mp-0" namespace="default" name="capg-gke-mp-0" reconcileID="3f75d094-eb40-41ef-84b6-b250d4027202" controller="gcpmanagedmachinepool" action="delete" reconciler="nodepools"

At some point in the reconciliation loop, a node pool update operation is initiated while the automatic update is still running, causing this error. Once the node pool update completes, the controller resumes normal operation, and the cluster either becomes ready or is deleted successfully.

As a user, I would expect the CAPG controller to handle this error gracefully, avoiding node pool updates for clusters that already have an operation in progress. Initial investigation suggests that this issue may be caused by the controller not being able to unwrap an error during reconciliation.

The text was updated successfully, but these errors were encountered:

salasberryfin · 2024-11-12T13:28:30Z

/kind bug

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 12, 2024

salasberryfin mentioned this issue Nov 12, 2024

fix: wrap error when nodepool update/delete fails #1364

Merged

3 tasks

k8s-ci-robot closed this as completed in #1364 Nov 12, 2024

salasberryfin mentioned this issue Jan 15, 2025

[e2e] Flaky e2e_import_gitops_v3 job rancher/turtles#976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic GKE node pool update fills controller logs with error #1363

Automatic GKE node pool update fills controller logs with error #1363

salasberryfin commented Nov 12, 2024 •

edited

Loading

salasberryfin commented Nov 12, 2024

Automatic GKE node pool update fills controller logs with error #1363

Automatic GKE node pool update fills controller logs with error #1363

Comments

salasberryfin commented Nov 12, 2024 • edited Loading

Description

salasberryfin commented Nov 12, 2024

salasberryfin commented Nov 12, 2024 •

edited

Loading