Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

template: lock task handle before trying script check #23917

Merged
merged 1 commit into from
Sep 12, 2024

Conversation

tgross
Copy link
Member

@tgross tgross commented Sep 4, 2024

In #23663 we fixed the template hook so that change_mode="script" didn't lose track of the task handle during restores. But this revealed a second bug which is that access to the handle is not locked while in use, which can allow it to be removed concurrently.

Fixes: #23875

Note for reviewers: we're still waiting on we can get some reproduction hints from the reporter, but the missing lock is clearly a bug anyways.

@tgross tgross added type/bug theme/crash theme/template backport/ent/1.6.x+ent Changes are backported to 1.6.x+ent backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/1.8.x backport to 1.8.x release line labels Sep 4, 2024
@tgross tgross added this to the 1.8.4 milestone Sep 4, 2024
In #23663 we fixed the template hook so that `change_mode="script"` didn't lose
track of the task handle during restores. But this revealed a second bug which
is that access to the handle is not locked while in use, which can allow it to
be removed concurrently.

Fixes: #23875
Comment on lines +1224 to +1245
// remove the reference to the task handle and update the template contents
// again
harness.manager.SetDriverHandle(nil)
harness.consul.SetKV(t, key1, []byte(content1_3))
timeout = time.After(time.Duration(5*testutil.TestMultiplier()) * time.Second)

OUTER2:
for {
select {
case <-harness.mockHooks.RestartCh:
t.Fatal(t, "restart not expected")
case ev := <-harness.mockHooks.EmitEventCh:
if strings.Contains(
ev.DisplayMessage, "task driver handle is not available") {
break OUTER2
}
case <-harness.mockHooks.SignalCh:
t.Fatal(t, "signal not expected")
case <-timeout:
t.Fatal(t, "should have received an event that task driver handle is unavailable")
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: unfortunately this test doesn't really exercise the bug that must have happened in #23875, because we'd have to concurrently make the change between the nil check in processScript and the subsequent Exec call, which is a very short race. Locally this test doesn't add anything meaningful to the time it takes the test to complete (always around ~3.2s on my machine), but I simultaneously hate to have a test that doesn't really test anything and a bugfix without a new test. I could go either way on removing it.

@tgross tgross marked this pull request as ready for review September 11, 2024 13:35
Copy link
Contributor

@pkazmierczak pkazmierczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the missing lock for sure. I say leave the test, but I'm not a 100% sure I understand how this bug happens.

@tgross tgross merged commit 07aca67 into main Sep 12, 2024
19 checks passed
@tgross tgross deleted the template-handle-lock branch September 12, 2024 12:41
tgross added a commit that referenced this pull request Sep 24, 2024
When we introduced change_mode=script to templates, we passed the driver handle
down into the template manager so we could call its `Exec` method directly. But
the lifecycle of the driver handle is managed by the taskrunner and isn't
available when the template manager is first created. This has led to a series
of patches trying to fixup the behavior (#15915, #15192, #23663, #23917). Part
of the challenge in getting this right is using an interface to avoid the
circular import of the driver handle.

But the taskrunner already has a way to deal with this problem using a "lazy
handle". The other template change modes already use this indirectly through the
`Lifecycle` interface. Change the driver handle `Exec` call in the template
manager to a new `Lifecycle.Exec` call that reuses the existing behavior. This
eliminates the need for the template manager to know anything at all about the
handle state.

Fixes: #24051
tgross added a commit that referenced this pull request Sep 25, 2024
When we introduced change_mode=script to templates, we passed the driver handle
down into the template manager so we could call its `Exec` method directly. But
the lifecycle of the driver handle is managed by the taskrunner and isn't
available when the template manager is first created. This has led to a series
of patches trying to fixup the behavior (#15915, #15192, #23663, #23917). Part
of the challenge in getting this right is using an interface to avoid the
circular import of the driver handle.

But the taskrunner already has a way to deal with this problem using a "lazy
handle". The other template change modes already use this indirectly through the
`Lifecycle` interface. Change the driver handle `Exec` call in the template
manager to a new `Lifecycle.Exec` call that reuses the existing behavior. This
eliminates the need for the template manager to know anything at all about the
handle state.

Fixes: #24051
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/ent/1.6.x+ent Changes are backported to 1.6.x+ent backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/1.8.x backport to 1.8.x release line theme/crash theme/template type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

crash when making script check after client restart
2 participants