-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hang boot podman #22985
Fix hang boot podman #22985
Conversation
a81a4b9
to
2d23994
Compare
769801d
to
e9dd3f0
Compare
@mheon ptal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently go does not panic when read/writing to nil channel and just blocks. And given we increment the counter as you said but then never can complete any job the counter will never go to zero so it just hang on exit there as you noticed correctly.
However fixing this like this is dangerous. We initialize the chan size to 10 which means after 10 sends it starts blocking. Now the blocking part is not an issue currently because we do the write to the channel in a separate go routine for some reason?
go func() {
r.workerChannel <- f
}()
Thus blocking there would not cause an issue and your fix works correctly.
However I am unsure if this does not start biting us in the future, @mheon WDYT?
I don't actually know why we're adding jobs to the channel in a goroutine, but it does make things convenient for this case. I imagine it ought to be all right - we're effectively queuing jobs for completion once the parallel executor comes up. Though perhaps we should just make the parallel executor come up sooner in runtime init? I don't think it has any dependencies on things like c/storage - is there a reason not to initialize it as the very first thing a runtime does? |
I think his was caused by us adding removal code to the refresh logic: 9983e87 My best guess is that we remove some container in a pod that then causes the container/pod stop/removal logic to add jobs to the queue.
Well it depends what jobs are submitted, I think it makes sense to wait at least for the In general it is really not well defined how the queue is supposed to work. |
At first, I did just move the setup worker function up, but podman crashed in the jobs, hence why I moved the queue setup in a separate function. |
@lambinoo Can you rebase and please include the PR description in the commit message as well. I think the code is fine |
Will do! |
61b6aea
to
cfa5813
Compare
That should do it @Luap99, let me know if there's anything else I should do |
It seems that if some background tasks are queued in libpod's Runtime before the worker's channel is set up (eg. in the refresh phase), they are not executed later on, but the workerGroup's counter is still ticked up. This leads podman to hang when the imageEngine is shutdown, since it waits for the workerGroup to be done. fixes containers#22984 Signed-off-by: Farya Maerten <[email protected]>
cfa5813
to
c819c7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lambinoo, Luap99 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
This is an attempt at fixing #22984.
It seems that if some background tasks are queued in libpod's
Runtime
before the worker's channel is set up (eg. in the refresh phase), they are not executed later on, but the workerGroup's counter is still ticked up. This leads podman to hang when the imageEngine is shutdown, since it waits for the workerGroup to be done.I'm not really sure why the function to queue a job works, as the queue object shouldn't exist, but I think that might be caused by my very sparse Go knowledge.
I would probably need some guidance to write a test for that one, as my test workflow has been:
reboot -f
journalctl list-jobs
to monitor the boot processAnd I'm honestly a bit confused at how do the reboot part in the tests.