Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPT machine power management with Rufio BMCJobs #182

Merged
merged 13 commits into from
Jun 15, 2022

Conversation

pokearu
Copy link
Contributor

@pokearu pokearu commented Jun 1, 2022

Description

The PR enables automated power on/off of nodes that are made part of the cluster using Rufio BMCJobs.

  1. When a hardware is selected to be part of a cluster, a BMCJob is created to get it to a state ready for provisioning with Tinkerbell.
  2. When a cluster is deleted, a BMCJob is created that would power off the hardware once its been released (owner labels removed).
  3. The rufio controller looks for these jobs and executes the listed Tasks.

Why is this needed

Tries to address the issues listed on #146.
The issue mentions PBNJ but Rufio is a k8s controller and fits well with CAPT.

How Has This Been Tested?

  1. Created a KinD cluster and installed Rufio and CAPT.
  2. Applied a cluster manifest as elaborated QUICK-START.md
  3. Monitored the nodes to check power on and PXE booting.
  4. kubectl delete cluster
  5. Monitored the nodes to check power off.

How are existing users impacted? What migration steps/scripts do we need?

No impact on existing users. The BMCJob creates are skipped if the Tinkerbell hardware object on the cluster does not have a .Spec.BMCRef

Checklist:

I have:

  • updated the documentation and/or roadmap (if required)
  • added unit or e2e tests
  • provided instructions on how to upgrade

},
Spec: rufiov1.BMCJobSpec{
BaseboardManagementRef: rufiov1.BaseboardManagementRef{
Name: hardware.Spec.BMCRef.Name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check if a BMC object with the ref exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there’s a check for this on Rufio controller and job would not go to Running condition unless the Ref was found.

We could add one here as well but currently I have limited the scope of CAPT to only deal with only BMCJob objects and Rufio would deal with the rest.

controllers/machine.go Outdated Show resolved Hide resolved
controllers/machine.go Outdated Show resolved Hide resolved
controllers/base.go Outdated Show resolved Hide resolved
controllers/base.go Outdated Show resolved Hide resolved
controllers/base.go Outdated Show resolved Hide resolved
// createPowerOffJob creates a BMCJob object with the required tasks for hardware power off.
func (bmrc *baseMachineReconcileContext) createPowerOffJob(hardware *tinkv1.Hardware) error {
powerOffAction := rufiov1.HardPowerOff
bmcJob := &rufiov1.BMCJob{
Copy link
Member

@chrisdoherty4 chrisdoherty4 Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we de-duplicate some of this construction and perhaps have a separate function set or builder.

func (bmrc *baseMachinereconcileContext) createPowerOffJob(hardware ...) error {
  job := newBMCJobBuilder(bmrc.tinkerbellMachine, hardware)
  job.WithTask(rufiov1.Task{
    PowerAction: *powerOffAction)
  })

  if err := bmrc.client.Create(bmrc.ctx, job.Build()); err != nil {
    ...
  }
}

This lets us re-use that construction logic elsewhere in the program like in ensureHardware.

Copy link
Member

@chrisdoherty4 chrisdoherty4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff.

It seems like reconciliation does nothing if a BMCJob exists. How does CAPT know a job is complete and does it need to do anything? I would think we would want to communicate via state or events as we move forwards in the provisioning process.

Could you elaborate on the behavior you're expecting?

@pokearu pokearu force-pushed the capt-rufio-bmc-jobs branch from af35926 to d50b060 Compare June 15, 2022 18:24
controller := true
bmcJob := &rufiov1.BMCJob{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("%s-poweroff", bmrc.tinkerbellMachine.Name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having a consistent name seems good so you don't get two running jobs conflicting, but this is never cleaned up. I don't think you need to do that necessarily in this PR, but can you at least open a GH issue stating that it needs to be thought through?

Copy link
Contributor Author

@pokearu pokearu Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Will do, thanks.

// Uses the BMCRef from the hardware to create a BMCJob.
// The BMCJob is responsible for getting the machine to desired state for provisioning.
// If a BMCJob is already found and has failed, we error.
func (mrc *machineReconcileContext) ensureHardwareProvisionJob(hardware *tinkv1.Hardware) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why the use of HardwareProvisionJob instead of BMCJob? This requires a translation that feels like it might be unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended it to be like "Job that gets hardware ready for provisioning" since it does multiple actions.
I tried to explain it a bit on the function comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The challenge i see is that a HardwareProvisionJob is just a pointer to a BMCJob. The definition you give in the comments points us to a BMCJob. So to understand the result or behavior of this function we need to go BMCJob regardless. I know naming can be very subjective and difficult but I'm having trouble seeing the value in creating this new construct of a HardwareProvisionJob.

}

// createHardwareProvisionJob creates a BMCJob object with the required tasks for hardware provisioning.
func (mrc *machineReconcileContext) createHardwareProvisionJob(hardware *tinkv1.Hardware, name string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is already very large. This functionality feels like a good opportunity to create a new file in this package just for bmc/rufio functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially going to do that. But i thought we could benefit from a wider refactoring where we move and abstract some of the logic for creating Templates, Workflows and Jobs. And even address reducing the duplication of some of these methods.

Is it alright to work on it on followup PRs so we can move forward with the functionality? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, just wanted to be sure the trade off of growing the tech debt and needed refactor work was brought up.

Comment on lines 248 to 276
if hardware.Spec.BMCRef == nil {
bmrc.log.Info("Hardware BMC reference not present; skipping hardware power off",
"BMCRef", hardware.Spec.BMCRef, "Hardware", hardware.Name)

return bmrc.removeFinalizer()
}

// Fetch a poweroff BMCJob for the machine.
// If Job not found, we remove dependencies and create job.
bmcJob := &rufiov1.BMCJob{}
jobName := fmt.Sprintf("%s-poweroff", bmrc.tinkerbellMachine.Name)

err := bmrc.getBMCJob(jobName, bmcJob)
if err != nil {
if apierrors.IsNotFound(err) {
return bmrc.createPowerOffJob(hardware)
}

return fmt.Errorf("get bmc job for machine: %w", err)
}

// Check the Job conditions to ensure the power off job is complete.
if bmcJob.HasCondition(rufiov1.JobCompleted, rufiov1.ConditionTrue) {
return bmrc.removeFinalizer()
}

if bmcJob.HasCondition(rufiov1.JobFailed, rufiov1.ConditionTrue) {
return fmt.Errorf("bmc job %s/%s failed", bmcJob.Namespace, bmcJob.Name) // nolint:goerr113
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like something that could benefit from being abstracted to its own method. this could improve maintainability; make it easier to remove, modify, etc.

Signed-off-by: Aravind Ramalingam <[email protected]>
go.mod Show resolved Hide resolved
@pokearu pokearu force-pushed the capt-rufio-bmc-jobs branch from 0c58b05 to ebb93c3 Compare June 15, 2022 20:55
Signed-off-by: Aravind Ramalingam <[email protected]>
controllerutil.RemoveFinalizer(bmrc.tinkerbellMachine, infrastructurev1.MachineFinalizer)

bmrc.log.Info("Patching Machine object to remove finalizer")

return bmrc.patch()
}

// ensureBMCJobCompletionForDelete ensures the machine power off BMCJob is completed.
// Removes the machint finalizer to let machine delete.
func (bmrc *baseMachineReconcileContext) ensureBMCJobCompletionForDelete(hardware *tinkv1.Hardware) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 'ForDelete` part of this name gives me pause. With this in the name i should be able to infer that this method has some behavior around the deletion of a machine (tinkerbellMachine?). I dont see anything in the code that touches machine deletion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh actually the return bmrc.removeFinalizer() is actually what allows the tinkerbellmachine objects to get deleted.
That's why i had it on the name.
We check job completion and then remove the finalizer.

@jacobweinstock jacobweinstock added the ready-to-merge Signal to Mergify to merge the PR. label Jun 15, 2022
@mergify mergify bot merged commit 9e9c2a3 into tinkerbell:main Jun 15, 2022
mergify bot added a commit that referenced this pull request Aug 23, 2022
My team has been actively working on this project. My approver role will help expedite the PR review/approve process and help unblock any open issues or feature requests.

Org request issue: tinkerbell/org#24

Requirements:

* I have reviewed the [community membership guidelines](https://github.com/tinkerbell/proposals/blob/main/proposals/0024/GOVERNANCE.md)
* I have [enabled 2FA on my GitHub account](https://github.com/settings/security)
* I have subscribed to the [tinkerbell-contributors e-mail list](https://groups.google.com/g/tinkerbell-contributors)
* I am actively contributing to 1 or more Tinkerbell subprojects

PRs contributions on CAPT:
- #130
- #176

PRs reviewed on CAPT:
- #160
- #184
- #182

Sponsors:
@micahhausler @chrisdoherty4 @jacobweinstock
@displague displague added this to the v0.2 milestone Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge Signal to Mergify to merge the PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants