Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

template: fix panic in change_mode=script on client restart #24057

Merged
merged 1 commit into from
Sep 25, 2024

Conversation

tgross
Copy link
Member

@tgross tgross commented Sep 24, 2024

When we introduced change_mode=script to templates, we passed the driver handle down into the template manager so we could call its Exec method directly. But the lifecycle of the driver handle is managed by the taskrunner and isn't available when the template manager is first created. This has led to a series of patches trying to fixup the behavior (#15915, #15192, #23663, #23917). Part of the challenge in getting this right is using an interface to avoid the circular import of the driver handle.

But the taskrunner already has a way to deal with this problem using a "lazy handle". The other template change modes already use this indirectly through the Lifecycle interface. Change the driver handle Exec call in the template manager to a new Lifecycle.Exec call that reuses the existing behavior. This eliminates the need for the template manager to know anything at all about the handle state.

Fixes: #24051

@@ -1220,29 +1209,6 @@ OUTER:
t.Fatal(t, "should have received an event")
}
}

// remove the reference to the task handle and update the template contents
Copy link
Member Author

@tgross tgross Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: this is the test we added in #23917 that we knew even at the time didn't really test anything useful. Now that we've fixed the problem by fixing the architecture, this doesn't even have anything left to test so I'm removing it. There's no way to "remove the task handle" anymore, as the template manager doesn't have a reference to it at all. The code for this PR is covered in task_runner_hook_test.go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to remove this, but I don't see a test file called task_runner_hook_test.go?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that's a typo, it's this test: https://github.com/hashicorp/nomad/blob/main/client/allocrunner/taskrunner/template_hook_test.go#L277 Note that only an integration test that spins up a real task and can restart a client can actually exercise this bug. So that falls under the general pile of "things we'd really like to do when we have solid E2E upgrade/restart testing".

@@ -80,6 +83,9 @@ func (h *DriverHandle) ExecStreaming(ctx context.Context,
command []string,
tty bool,
stream drivers.ExecTaskStream) error {
if h == nil {
return ErrTaskNotRunning
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes aren't really necessary now, but they'll provide belt-and-suspenders against future bugs.

When we introduced change_mode=script to templates, we passed the driver handle
down into the template manager so we could call its `Exec` method directly. But
the lifecycle of the driver handle is managed by the taskrunner and isn't
available when the template manager is first created. This has led to a series
of patches trying to fixup the behavior (#15915, #15192, #23663, #23917). Part
of the challenge in getting this right is using an interface to avoid the
circular import of the driver handle.

But the taskrunner already has a way to deal with this problem using a "lazy
handle". The other template change modes already use this indirectly through the
`Lifecycle` interface. Change the driver handle `Exec` call in the template
manager to a new `Lifecycle.Exec` call that reuses the existing behavior. This
eliminates the need for the template manager to know anything at all about the
handle state.

Fixes: #24051
Copy link
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

return
}
_, exitCode, err := tm.handle.Exec(script.Timeout, script.Command, script.Args)
_, exitCode, err := tm.config.Lifecycle.Exec(script.Timeout, script.Command, script.Args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels so much nicer 👏

@@ -1220,29 +1209,6 @@ OUTER:
t.Fatal(t, "should have received an event")
}
}

// remove the reference to the task handle and update the template contents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to remove this, but I don't see a test file called task_runner_hook_test.go?

@tgross tgross merged commit cc9227b into main Sep 25, 2024
26 checks passed
@tgross tgross deleted the b-panic-template-change-script branch September 25, 2024 12:59
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 24, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/1.8.x backport to 1.8.x release line theme/crash theme/template type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nil pointer crash due to template change_mode script processing (1.8.3+)
2 participants