-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix server shutdown not waiting for worker run completion #19560
Merged
shoenig
merged 8 commits into
hashicorp:main
from
marvinchin:fix-server-shutdown-not-waiting-for-worker-run-completion
Jan 5, 2024
Merged
Changes from 7 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
46992a9
Move group into a separate helper module for reuse
marvinchin 60cd074
Add shutdownCh to worker
marvinchin 6953216
Make server shutdown block on workers' shutdownCh
marvinchin 87aedac
Fix waiting for eval broker state change blocking indefinitely
marvinchin 81dcbc2
Bound the amount of time server shutdown waits on worker completion
marvinchin 3c20f0c
Fix lostcancel linter error
marvinchin 4dca4ce
Fix worker test using unexpected worker constructor
marvinchin 356ea03
Add changelog
marvinchin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
// Copyright (c) HashiCorp, Inc. | ||
// SPDX-License-Identifier: MPL-2.0 | ||
|
||
package group | ||
|
||
import ( | ||
"context" | ||
"sync" | ||
) | ||
|
||
// group wraps a func() in a goroutine and provides a way to block until it | ||
// exits. Inspired by https://godoc.org/golang.org/x/sync/errgroup | ||
type Group struct { | ||
wg sync.WaitGroup | ||
} | ||
|
||
// Go starts f in a goroutine and must be called before Wait. | ||
func (g *Group) Go(f func()) { | ||
g.wg.Add(1) | ||
go func() { | ||
defer g.wg.Done() | ||
f() | ||
}() | ||
} | ||
|
||
func (g *Group) AddCh(ch <-chan struct{}) { | ||
g.Go(func() { | ||
<-ch | ||
}) | ||
} | ||
|
||
// Wait for all goroutines to exit. Must be called after all calls to Go | ||
// complete. | ||
func (g *Group) Wait() { | ||
g.wg.Wait() | ||
} | ||
|
||
// Wait for all goroutines to exit, or for the context to finish. | ||
// Must be called after all calls to Go complete. | ||
func (g *Group) WaitWithContext(ctx context.Context) { | ||
doneCh := make(chan struct{}) | ||
go func() { | ||
defer close(doneCh) | ||
g.Wait() | ||
}() | ||
select { | ||
case <-doneCh: | ||
case <-ctx.Done(): | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a question for the general Nomad maintainers, but my prior is that the Notifier is designed to be context-free and focus on channels. The shutdown channel being passed into
Run
was inline with that design decision and I think we could continue with it here. Is there a deeper reason for passing the context in that I'm missing?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I ran into with that was that I needed a way to coordinate between
WaitForChange
andRun
.In particular, the race condition that I encountered was:
WaitForChange
tries to write to theunsubscribeCh
and blocks because it is fullRun
becauseshutdownCh
was closed and the function returnsRun
has terminated nothing reads fromunsubscribeCh
and causesWaitForChange
to block indefinitelySo, I think we need a way to have
WaitForChange
to unblock when it detects that the notifier has shut down. I suppose we could also represent this with anothershutdownCh
for the notifier itself, but I'm not sure which one is more idiomatic. Hopefully maintainers can chime in on which approach they prefer (or if there is a better way to do this!)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I understand why using
ctx
here resolves the race which you describe. Doesn't it still occur based onsubscribeCh
in the same way, except on a different channel:shutdownCtx
was closedWaitForChange
tries to write tosubscribeCh
and blocks because it is full (line 90)I think the scenario in thish the context does help as you imply is one in which we may miss the notification on the
shutdownCh
and thus hang awaiting that event. That is the scenario of:WaitForChange
times outshutdownCh
is signalledRun
quitsWaitForChange
is entered againIn this scenario 4 misses the notification on 2 because it wasn't actually awaiting a signal (my limited understanding of channels is such that they do not buffer events but rather notify the subscribers at the time of emission). This is where the context allows you to carry a signal in a more persistent way which ensures that this other race cannot happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, I missed the other write to
subscribeCh
on L90. I'll push a fix.shutdownCh
is closed when shutdown starts (rather than signaled). So, I don't think that's possible.The race I was thinking is more long the lines of the producer not realizing that the consumer has stopped, and thus blocks indefinitely waiting for the consumer to do work.
I think in the scenario you described, the events I'm worried about is:
WaitForChange
blocks waiting for timeoutshutdownCh
is closedRun
quitsWaitForChange
hits the timeout and unblocks, then as part of the deferred function it tries to to write tounsubscribeCh
and gets blocked because it is fullRun
has quit, nothing reads fromunsubscribeCh
and soWaitForChange
blocks foreverSo, the shared context acts as a way for the producer to detect that the consumer has (or is soon going to) shut down and to not block waiting for it to do work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Right, my bad. Makes sense :)
Yup +1. I think this is what I had in mind, but I failed to describe it (you've captured it better). Agreed on the problem, up to the Hashi team to decide which way they would rather go :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the channel-heavy pattern is an unfortunate result of Nomad predating the context package by about a year. It would be nice to go back and clean things up but that was never a priority, and now it's channels all the way down.