-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CAPT machine power management with Rufio BMCJobs #182
Conversation
}, | ||
Spec: rufiov1.BMCJobSpec{ | ||
BaseboardManagementRef: rufiov1.BaseboardManagementRef{ | ||
Name: hardware.Spec.BMCRef.Name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check if a BMC object with the ref exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there’s a check for this on Rufio controller and job would not go to Running
condition unless the Ref was found.
We could add one here as well but currently I have limited the scope of CAPT to only deal with only BMCJob objects and Rufio would deal with the rest.
// createPowerOffJob creates a BMCJob object with the required tasks for hardware power off. | ||
func (bmrc *baseMachineReconcileContext) createPowerOffJob(hardware *tinkv1.Hardware) error { | ||
powerOffAction := rufiov1.HardPowerOff | ||
bmcJob := &rufiov1.BMCJob{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we de-duplicate some of this construction and perhaps have a separate function set or builder.
func (bmrc *baseMachinereconcileContext) createPowerOffJob(hardware ...) error {
job := newBMCJobBuilder(bmrc.tinkerbellMachine, hardware)
job.WithTask(rufiov1.Task{
PowerAction: *powerOffAction)
})
if err := bmrc.client.Create(bmrc.ctx, job.Build()); err != nil {
...
}
}
This lets us re-use that construction logic elsewhere in the program like in ensureHardware
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff.
It seems like reconciliation does nothing if a BMCJob
exists. How does CAPT know a job is complete and does it need to do anything? I would think we would want to communicate via state or events as we move forwards in the provisioning process.
Could you elaborate on the behavior you're expecting?
Signed-off-by: Aravind Ramalingam <[email protected]>
Signed-off-by: Aravind Ramalingam <[email protected]>
Signed-off-by: Aravind Ramalingam <[email protected]>
Signed-off-by: Aravind Ramalingam <[email protected]>
Signed-off-by: Aravind Ramalingam <[email protected]>
…kerbellMachine for delete Signed-off-by: Aravind Ramalingam <[email protected]>
Signed-off-by: Aravind Ramalingam <[email protected]>
af35926
to
d50b060
Compare
Signed-off-by: Aravind Ramalingam <[email protected]>
controller := true | ||
bmcJob := &rufiov1.BMCJob{ | ||
ObjectMeta: metav1.ObjectMeta{ | ||
Name: fmt.Sprintf("%s-poweroff", bmrc.tinkerbellMachine.Name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having a consistent name seems good so you don't get two running jobs conflicting, but this is never cleaned up. I don't think you need to do that necessarily in this PR, but can you at least open a GH issue stating that it needs to be thought through?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Will do, thanks.
// Uses the BMCRef from the hardware to create a BMCJob. | ||
// The BMCJob is responsible for getting the machine to desired state for provisioning. | ||
// If a BMCJob is already found and has failed, we error. | ||
func (mrc *machineReconcileContext) ensureHardwareProvisionJob(hardware *tinkv1.Hardware) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why the use of HardwareProvisionJob
instead of BMCJob
? This requires a translation that feels like it might be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended it to be like "Job that gets hardware ready for provisioning" since it does multiple actions.
I tried to explain it a bit on the function comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The challenge i see is that a HardwareProvisionJob
is just a pointer to a BMCJob. The definition you give in the comments points us to a BMCJob. So to understand the result or behavior of this function we need to go BMCJob regardless. I know naming can be very subjective and difficult but I'm having trouble seeing the value in creating this new construct of a HardwareProvisionJob
.
} | ||
|
||
// createHardwareProvisionJob creates a BMCJob object with the required tasks for hardware provisioning. | ||
func (mrc *machineReconcileContext) createHardwareProvisionJob(hardware *tinkv1.Hardware, name string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is already very large. This functionality feels like a good opportunity to create a new file in this package just for bmc/rufio functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially going to do that. But i thought we could benefit from a wider refactoring where we move and abstract some of the logic for creating Templates, Workflows and Jobs. And even address reducing the duplication of some of these methods.
Is it alright to work on it on followup PRs so we can move forward with the functionality? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, just wanted to be sure the trade off of growing the tech debt and needed refactor work was brought up.
controllers/base.go
Outdated
if hardware.Spec.BMCRef == nil { | ||
bmrc.log.Info("Hardware BMC reference not present; skipping hardware power off", | ||
"BMCRef", hardware.Spec.BMCRef, "Hardware", hardware.Name) | ||
|
||
return bmrc.removeFinalizer() | ||
} | ||
|
||
// Fetch a poweroff BMCJob for the machine. | ||
// If Job not found, we remove dependencies and create job. | ||
bmcJob := &rufiov1.BMCJob{} | ||
jobName := fmt.Sprintf("%s-poweroff", bmrc.tinkerbellMachine.Name) | ||
|
||
err := bmrc.getBMCJob(jobName, bmcJob) | ||
if err != nil { | ||
if apierrors.IsNotFound(err) { | ||
return bmrc.createPowerOffJob(hardware) | ||
} | ||
|
||
return fmt.Errorf("get bmc job for machine: %w", err) | ||
} | ||
|
||
// Check the Job conditions to ensure the power off job is complete. | ||
if bmcJob.HasCondition(rufiov1.JobCompleted, rufiov1.ConditionTrue) { | ||
return bmrc.removeFinalizer() | ||
} | ||
|
||
if bmcJob.HasCondition(rufiov1.JobFailed, rufiov1.ConditionTrue) { | ||
return fmt.Errorf("bmc job %s/%s failed", bmcJob.Namespace, bmcJob.Name) // nolint:goerr113 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like something that could benefit from being abstracted to its own method. this could improve maintainability; make it easier to remove, modify, etc.
Signed-off-by: Aravind Ramalingam <[email protected]>
0c58b05
to
ebb93c3
Compare
Signed-off-by: Aravind Ramalingam <[email protected]>
Signed-off-by: Aravind Ramalingam <[email protected]>
Signed-off-by: Aravind Ramalingam <[email protected]>
controllerutil.RemoveFinalizer(bmrc.tinkerbellMachine, infrastructurev1.MachineFinalizer) | ||
|
||
bmrc.log.Info("Patching Machine object to remove finalizer") | ||
|
||
return bmrc.patch() | ||
} | ||
|
||
// ensureBMCJobCompletionForDelete ensures the machine power off BMCJob is completed. | ||
// Removes the machint finalizer to let machine delete. | ||
func (bmrc *baseMachineReconcileContext) ensureBMCJobCompletionForDelete(hardware *tinkv1.Hardware) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 'ForDelete` part of this name gives me pause. With this in the name i should be able to infer that this method has some behavior around the deletion of a machine (tinkerbellMachine?). I dont see anything in the code that touches machine deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh actually the return bmrc.removeFinalizer()
is actually what allows the tinkerbellmachine objects to get deleted.
That's why i had it on the name.
We check job completion and then remove the finalizer.
My team has been actively working on this project. My approver role will help expedite the PR review/approve process and help unblock any open issues or feature requests. Org request issue: tinkerbell/org#24 Requirements: * I have reviewed the [community membership guidelines](https://github.com/tinkerbell/proposals/blob/main/proposals/0024/GOVERNANCE.md) * I have [enabled 2FA on my GitHub account](https://github.com/settings/security) * I have subscribed to the [tinkerbell-contributors e-mail list](https://groups.google.com/g/tinkerbell-contributors) * I am actively contributing to 1 or more Tinkerbell subprojects PRs contributions on CAPT: - #130 - #176 PRs reviewed on CAPT: - #160 - #184 - #182 Sponsors: @micahhausler @chrisdoherty4 @jacobweinstock
Description
The PR enables automated power on/off of nodes that are made part of the cluster using Rufio BMCJobs.
BMCJob
is created to get it to a state ready for provisioning with Tinkerbell.BMCJob
is created that would power off the hardware once its been released (owner labels removed).Why is this needed
Tries to address the issues listed on #146.
The issue mentions PBNJ but Rufio is a k8s controller and fits well with CAPT.
How Has This Been Tested?
KinD
cluster and installedRufio
andCAPT
.How are existing users impacted? What migration steps/scripts do we need?
No impact on existing users. The
BMCJob
creates are skipped if the Tinkerbell hardware object on the cluster does not have a.Spec.BMCRef
Checklist:
I have: