-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify Machine Creation flow to make sure node label is updated before initialization of VM. Modify Deletion flow to call DeleteMachine even if VM is not found. #940
Conversation
@thiyyakat Thank you for your contribution. |
Thank you @thiyyakat for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below. |
@thiyyakat You need rebase this pull request with latest master branch. Please check. |
1970a2a
to
99cf01d
Compare
return retryPeriod, err | ||
} | ||
// Return error even when machine object is updated | ||
err = fmt.Errorf("Machine creation in process. Machine UPDATE successful") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this error msg correct? If the initializeMachine call has succeeded and there is no error then this error is created with a msg that perhaps is not very clear. What is intended here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we look at the call stack, this error message is propagated to reconcileClusterMachineKey
which calls enqueueMachineAfter
where this message is used to print a log specifying the reason for the enqueue operation.
This is not something new that has been introduced. We are preserving the original way. Changing the message to Machine creation in process. Machine initialization (if required) and label update successful
} | ||
} | ||
if uninitializedMachine { | ||
retryPeriod, err := c.initializeMachine(ctx, createMachineRequest.Machine, createMachineRequest.MachineClass, createMachineRequest.Secret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before calling initializeMachine
we now update the labels. This is as we discussed. If there is an update done then the genration of the resource changes, right? Then later you take the machine from the request object (which is now old) and then that is passed into initializeMachine
which also attempt to update the status by making a clone of the now old object in case of errors. This could result in conflict. Have you tested this?
nodes, err = c.nodeLister.List(labels.Everything()) | ||
if err == nil { | ||
for _, node := range nodes { | ||
if node.Labels["node.gardener.cloud/machine-name"] == getMachineStatusRequest.Machine.Name { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in Oliver's PR is there a constant created for this label?
// Node label is required for drain of node, therefore we try to update machine object before proceeding to drain. | ||
isNodeLabelUpdated := false | ||
//check if node name label is already present in machine object | ||
nodeName := getMachineStatusRequest.Machine.Labels[v1alpha1.NodeLabelKey] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
declare this as a variable instead of short assign.
// get all nodes and check if any node has the machine name as label | ||
var nodes []*v1.Node | ||
nodes, err = c.nodeLister.List(labels.Everything()) | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is an error it is totally ignored. I would suggest that you at least log the error for better debuggability.
} else { | ||
// Figure out node label either by checking all nodes for label matching machine name or retrieving it using GetMachineStatus | ||
// get all nodes and check if any node has the machine name as label | ||
var nodes []*v1.Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract the logic to look for nodeName matching the machine name into a separate function. That function should just find the matching nodeName and the update should happen in the caller.
@@ -944,77 +944,94 @@ func (c *controller) setMachineTerminationStatus(ctx context.Context, deleteMach | |||
return machineutils.ShortRetry, err | |||
} | |||
|
|||
// getVMStatus tries to retrive VM status backed by machine | |||
// getVMStatus tries to retrieve VM status backed by machine | |||
func (c *controller) getVMStatus(ctx context.Context, getMachineStatusRequest *driver.GetMachineStatusRequest) (machineutils.RetryPeriod, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function is named getVMStatus
but it does not return any status. This is a bit weird. Shouldn't it be called updateMachineStatusAndNodeLabel
instead?
isNodeLabelUpdated := false | ||
//check if node name label is already present in machine object | ||
nodeName := getMachineStatusRequest.Machine.Labels[v1alpha1.NodeLabelKey] | ||
if isValidNodeName(nodeName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is actually quite useless. It only checks if the node Name is empty. I would just remove this function.
} | ||
} | ||
|
||
if !isNodeLabelUpdated { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to rewrite this function.
An idea:
func (c *controller) updateMachineNodeLabelAndStatus(ctx context.Context, machine *v1alpha1.Machine) {
// calls getNodeName
// if there is no error then updates the node label
// if there is an error either in calling getNodeName or updating the node label then updates the status accordingly
}
func (c *controller) getNodeName(ctx context.Context, machine *v1alpha1.Machine) (string, error) {
nodeName := machine.Labels[v1alpha1.NodeLabelKey]
if len(strings.TrimSpace(nodeName)) > 0 {
return nodeName, nil
}
matchingNode, err := c.fetchMatchingNode(ctx, machine.Name)
if err == nil && matchingNode != nil {
return matchingNode.Name, nil
}
if err != nil {
klog.Errorf("Error trying to get node matching machine %s: %v. Will try to get the node name by calling driver.GetMachineStatus instead.", machine.Name, err)
}
// call GetMachineStatus to get the node name
}
return machineutils.LongRetry, nil | ||
} | ||
|
||
func (c *controller) updateLabelsAndInitializeMachine(ctx context.Context, createMachineRequest *driver.CreateMachineRequest, nodeName, providerID string, shouldInitializeMachine bool) (retryPeriod machineutils.RetryPeriod, err error) { | ||
_, machineNodeLabelPresent := createMachineRequest.Machine.Labels[v1alpha1.NodeLabelKey] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could have used:
metav1.HasAnnotation(createMachineRequest.Machine.ObjectMeta, v1alpha1.NodeLabelKey)
} | ||
} | ||
if shouldInitializeMachine { | ||
retryPeriod, err = c.initializeMachine(ctx, clone, createMachineRequest.MachineClass, createMachineRequest.Secret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have initializeMachine
call from here? I would have just renamed this method to updateMachineLabels
since we already have initializeMachine
. The just call these functions from triggerCreationFlow
. This also removes the need to pass shouldInitializeMachine
as a method argument.
return retryPeriod, err | ||
} | ||
// Return error even when machine object is updated | ||
err = fmt.Errorf("Machine creation in process. Machine initialization (if required) is successful.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
messages given to fmt.Errorf
should not start with a Capital letter and should not end with a dot. You will see warnings shown in the IDE. Please correct it. There are other places as well where i see this issue. Can you at least correct it in the functions that you are touching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@@ -497,6 +492,8 @@ func (c *controller) triggerCreationFlow(ctx context.Context, createMachineReque | |||
case codes.Uninitialized: | |||
uninitializedMachine = true | |||
klog.Infof("VM instance associated with machine %s was created but not initialized.", machine.Name) | |||
nodeName = getMachineStatusResponse.NodeName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed that we will go with this even though this is quite dirty. We add a comment to improve this further.
The easier way to do this would be to add a bool pointer indicating the status of initialization status. If it is nil, then no initialization needs to be done (ignored). If it is false/true then we act upon it. Essentially do not return an error as initialization becomes part of the VM status.
What this PR does / why we need it:
The PR changes
triggerCreationFlow
to update machine labels before proceeding to initialization. This was done to make sure that the labels are always updated even in the case of initialization failures.It also changes
getVMStatus
to update the labels in the following way:node-agent-authorizer
webhook to authenticategardener-node-agents
in shoots gardener#10014 are released)GetMachineStatus
call and populate the node name.In addition,
getVMStatus
will always redirect toinitiateDrain
.A change was also made to remove error logs when
InitializeMachine
is not implemented by a provider.Which issue(s) this PR fixes:
Fixes #934 #936
Fixes part of #933
Special notes for your reviewer:
The changes in the PR were tested by doing the following:
Manually returned an error from machine-controller-manager-provider-azure's
CreateMachine
code after NIC is created, such that the VM is not created. The machine was then marked for deletion. The logs showed that the NIC was deleted successfully.Manually returned codes.Uninitialized error code from InitializeMachine for AWS. On triggering creation of machine, found that initialization is tried in a loop after shortRetry.
Manually returned codes.Uninitialized from
initializeMachine
after the call todriver.initializeMachine
. The logs show that the machine state update is successful. In the next reconciliation, since the machine is found to be initialized, machine goes to running.Release note: