Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that DeleteMachine call is made even in case of a NotFound error from Driver.GetMachineStatus in the deletion flow #936

Closed
rishabh-11 opened this issue Aug 20, 2024 · 2 comments
Assignees
Labels
area/control-plane Control plane related area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)

Comments

@rishabh-11
Copy link
Contributor

rishabh-11 commented Aug 20, 2024

How to categorize this issue?

/area control-plane
/area robustness
/kind enhancement
/priority 2

What would you like to be added:
Remove getVMStatus from triggerDeletionFlow. The consequence is that no other step in the deletion flow will be skipped and the call to Driver.DeleteMachine will be made. This will ensure that there are no orphan resources left and we don't have to rely on the orphan collection logic of MCM.

Why is this needed:

In azure, the creation of VM and NIC is done separately (it cannot be done together as the cloud provider does not have this functionality). In this case, the NIC gets created but the VM does not

2024-08-08 05:45:16	
{"log":"Successfully created NIC: [ResourceGroup: shoot--hc-ap21--340-83-hdl, NIC: [Name: shoot--hc-ap21--340-83-hdl-edge-b-z2-75c67-hqw5h-nic, ID: /subscriptions/a14b094d-362f-4924-b343-f3a65dd9424a/resourceGroups/shoot--hc-ap21--340-83-hdl/providers/Microsoft.Network/networkInterfaces/shoot--hc-ap21--340-83-hdl-edge-b-z2-75c67-hqw5h-nic]]","pid":"1","severity":"INFO","source":"driver.go:391"}

Failed to create VirtualMachine: [ResourceGroup: shoot--hc-ap21--340-83-hdl, Name: shoot--hc-ap21--340-83-hdl-edge-b-z2-75c67-hqw5h], Err: PUT https://management.azure.com/subscriptions/a14b094d-362f-4924-b343-f3a65dd9424a/resourceGroups/shoot--hc-ap21--340-83-hdl/providers/Microsoft.Compute/virtualMachines/shoot--hc-ap21--340-83-hdl-edge-b-z2-75c67-hqw5h\\n--------------------------------------------------------------------------------\\nRESPONSE 409: 409 Conflict\\nERROR CODE: OperationNotAllowed\\n--------------------------------------------------------------------------------\\n{\\n  \\\"error\\\": {\\n    \\\"code\\\": \\\"OperationNotAllowed\\\",\\n    \\\"message\\\": \\\"The specified disk size 20 GB is smaller than the size of the corresponding disk in the VM image: 30 GB. This is not allowed. Please choose equal or greater size or do not specify an explicit size.\\\",\\n    \\\"target\\\": \\\"osDisk.diskSizeGB\\\"\\n  }\\n}\\n--------------------------------------------------------------------------------\\n]\"","pid":"1","severity":"INFO","source":"machine.go:116"}

Now, after 20 minutes MCM marks this machine as Failed

2024-08-08 06:06:17	
{"log":"Machine \"shoot--hc-ap21--340-83-hdl-edge-b-z2-75c67-hqw5h\" , providerID \"\" and backing node \"\" couldn't join in creation timeout of 20m0s. Changing phase to Failed.","pid":"1","severity":"INFO","source":"machine_util.go:573"}

In the deletion flow, we do a GetMachineStatus call (See

statusResp, err := c.driver.GetMachineStatus(ctx, getMachineStatusRequest)
). In case of Azure, the GetMachineStatus implementation does not look for NICs, it only checks if VM is present or not. So in this case, it will return a NotFound error and hence the DeleteMachine call is never made. The deletion flow will continue and the machine object will be deleted, but the NIC will still remain.
In this case, the orphan collection logic of MCM comes into the picture and tries to remove the NIC.

2024-08-08 06:06:29	
{"log":"    \"message\": \"Nic(s) in request is reserved for another Virtual Machine for 180 seconds. Please provide another nic(s) or retry after 180 seconds. Reserved VM: /subscriptions/a14b094d-362f-4924-b343-f3a65dd9424a/resourceGroups/shoot--hc-ap21--340-83-hdl/providers/Microsoft.Compute/virtualMachines/shoot--hc-ap21--340-83-hdl-edge-b-z2-75c67-hqw5h\","}
2024-08-08 06:06:29	
{"log":"SafetyController: Error while trying to DELETE VM on CP - machine codes error: code = [Internal] message = [Errors during deletion of NIC/Disks associated to VM: [ResourceGroup: shoot--hc-ap21--340-83-hdl, Name: shoot--hc-ap21--340-83-hdl-edge-b-z2-75c67-hqw5h], Err: \u003cnil\u003e]. Shall retry in next safety controller sync.","pid":"1","severity":"ERR","source":"machine_safety.go:288"}Show context
2024-08-08 06:06:29	
{"log":"Failed to trigger delete of NIC [ResourceGroup: shoot--hc-ap21--340-83-hdl, Name: shoot--hc-ap21--340-83-hdl-edge-b-z2-75c67-hqw5h-nic] : Azure API Response-Headers: map[x-ms-correlation-request-id:3f341f2c-991a-4c88-a47b-28c4ae18f4f5 x-ms-request-id:0d9bbdfd-9486-497b-b613-0bc48b937a82] Err: DELETE https://management.azure.com/subscriptions/a14b094d-362f-4924-b343-f3a65dd9424a/resourceGroups/shoot--hc-ap21--340-83-hdl/providers/Microsoft.Network/networkInterfaces/shoot--hc-ap21--340-83-hdl-edge-b-z2-75c67-hqw5h-nic","pid":"1","severity":"ERR","source":"errors.go:63"}
2024-08-08 06:06:27	
{"log":"Successfully deleted Disk: shoot--hc-ap21--340-83-hdl-edge-b-z2-75c67-hqw5h-os-disk, for ResourceGroup: shoot--hc-ap21--340-83-hdl","pid":"1","severity":"INFO","source":"disk.go:38"}
2024-08-08 06:06:26	
{"log":"Attempting to delete nic: [ResourceGroup: shoot--hc-ap21--340-83-hdl, NicName: shoot--hc-ap21--340-83-hdl-edge-b-z2-75c67-hqw5h-nic] if it exists","pid":"1","severity":"INFO","source":"driver.go:327"}

MCM got terminated at 2024-08-08 06:08:41 so essentially the orphan collector got only ~2 mins which was insufficient.
Hence we should remove the getVMStatus check in the deletion flow and always call the Driver.DeleteMachine method

@rishabh-11 rishabh-11 added the kind/enhancement Enhancement, improvement, extension label Aug 20, 2024
@gardener-robot gardener-robot added area/control-plane Control plane related area/robustness Robustness, reliability, resilience related priority/2 Priority (lower number equals higher priority) labels Aug 20, 2024
@rishabh-11 rishabh-11 changed the title Remove use of Driver.GetMachineStatus from the deletion flow for machine object Ensure that DeleteMachine call is made even in case of a NotFound error from Driver.GetMachineStatus in the deletion flow Sep 5, 2024
@rishabh-11
Copy link
Contributor Author

After a discussion with the team, it was decided that we must keep the getVMStatus call as updating the node labels is required for a drain to happen. We modify the getVMStatus function to make sure that it does not directly jump to node deletion in case the VM is not found on the provider (Eg:- For Azure, resources like NIC can still be present even if the VM is not found)

@rishabh-11
Copy link
Contributor Author

/close as #940 is merged

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension priority/2 Priority (lower number equals higher priority) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

3 participants