Fix hang boot podman #22985

lambinoo · 2024-06-12T15:22:11Z

This is an attempt at fixing #22984.

It seems that if some background tasks are queued in libpod's Runtime before the worker's channel is set up (eg. in the refresh phase), they are not executed later on, but the workerGroup's counter is still ticked up. This leads podman to hang when the imageEngine is shutdown, since it waits for the workerGroup to be done.

I'm not really sure why the function to queue a job works, as the queue object shouldn't exist, but I think that might be caused by my very sparse Go knowledge.

I would probably need some guidance to write a test for that one, as my test workflow has been:

Make a change to podman
reboot -f
Login and run journalctl list-jobs to monitor the boot process

And I'm honestly a bit confused at how do the reboot part in the tests.

Fixed a bug that could cause a the first podman command after boot to hang when using transient mode.

baude · 2024-06-14T11:01:05Z

@mheon ptal

Luap99

Apparently go does not panic when read/writing to nil channel and just blocks. And given we increment the counter as you said but then never can complete any job the counter will never go to zero so it just hang on exit there as you noticed correctly.

However fixing this like this is dangerous. We initialize the chan size to 10 which means after 10 sends it starts blocking. Now the blocking part is not an issue currently because we do the write to the channel in a separate go routine for some reason?

go func() {
		r.workerChannel <- f
}()

Thus blocking there would not cause an issue and your fix works correctly.
However I am unsure if this does not start biting us in the future, @mheon WDYT?

mheon · 2024-06-18T12:15:46Z

I don't actually know why we're adding jobs to the channel in a goroutine, but it does make things convenient for this case. I imagine it ought to be all right - we're effectively queuing jobs for completion once the parallel executor comes up.

Though perhaps we should just make the parallel executor come up sooner in runtime init? I don't think it has any dependencies on things like c/storage - is there a reason not to initialize it as the very first thing a runtime does?

Luap99 · 2024-06-18T12:53:25Z

I don't actually know why we're adding jobs to the channel in a goroutine, but it does make things convenient for this case. I imagine it ought to be all right - we're effectively queuing jobs for completion once the parallel executor comes up.

I think his was caused by us adding removal code to the refresh logic: 9983e87

My best guess is that we remove some container in a pod that then causes the container/pod stop/removal logic to add jobs to the queue.

Though perhaps we should just make the parallel executor come up sooner in runtime init? I don't think it has any dependencies on things like c/storage - is there a reason not to initialize it as the very first thing a runtime does?

Well it depends what jobs are submitted, I think it makes sense to wait at least for the runtime.valid = true step. But one could also say nobody should submit jobs that depend on the runtime unless they already have a valid runtime...

In general it is really not well defined how the queue is supposed to work.

lambinoo · 2024-06-18T13:52:20Z

I don't actually know why we're adding jobs to the channel in a goroutine, but it does make things convenient for this case. I imagine it ought to be all right - we're effectively queuing jobs for completion once the parallel executor comes up.

Though perhaps we should just make the parallel executor come up sooner in runtime init? I don't think it has any dependencies on things like c/storage - is there a reason not to initialize it as the very first thing a runtime does?

At first, I did just move the setup worker function up, but podman crashed in the jobs, hence why I moved the queue setup in a separate function.

Luap99 · 2024-07-08T09:49:12Z

@lambinoo Can you rebase and please include the PR description in the commit message as well.

I think the code is fine

lambinoo · 2024-07-08T09:59:52Z

Will do!

lambinoo · 2024-07-09T09:15:10Z

That should do it @Luap99, let me know if there's anything else I should do

It seems that if some background tasks are queued in libpod's Runtime before the worker's channel is set up (eg. in the refresh phase), they are not executed later on, but the workerGroup's counter is still ticked up. This leads podman to hang when the imageEngine is shutdown, since it waits for the workerGroup to be done. fixes containers#22984 Signed-off-by: Farya Maerten <[email protected]>

Luap99

LGTM

openshift-ci · 2024-07-09T09:17:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lambinoo, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mheon · 2024-07-09T12:21:48Z

/lgtm

openshift-ci bot added the do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None label Jun 12, 2024

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2024

github-actions bot added machine kind/api-change Change to remote API; merits scrutiny labels Jun 12, 2024

lambinoo force-pushed the fix-hang-boot-podman branch from a81a4b9 to 2d23994 Compare June 12, 2024 15:22

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2024

lambinoo force-pushed the fix-hang-boot-podman branch 2 times, most recently from 769801d to e9dd3f0 Compare June 12, 2024 15:27

lambinoo mentioned this pull request Jun 12, 2024

System boot hangs indefinitely on unclean shutdown with transient mode #22984

Closed

Luap99 reviewed Jun 18, 2024

View reviewed changes

Luap99 added the No New Tests Allow PR to proceed without adding regression tests label Jul 8, 2024

lambinoo force-pushed the fix-hang-boot-podman branch 2 times, most recently from 61b6aea to cfa5813 Compare July 9, 2024 09:14

lambinoo force-pushed the fix-hang-boot-podman branch from cfa5813 to c819c7a Compare July 9, 2024 09:15

Luap99 approved these changes Jul 9, 2024

View reviewed changes

openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Jul 9, 2024

openshift-ci bot assigned mheon Jul 9, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 9, 2024

openshift-merge-bot bot merged commit 6221e63 into containers:main Jul 9, 2024
82 checks passed

edsantiago mentioned this pull request Jul 9, 2024

podman stop: rootless netns ref counter out of sync, counter is at -1, resetting it back to 0 #21569

Closed

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 8, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hang boot podman #22985

Fix hang boot podman #22985

lambinoo commented Jun 12, 2024 •

edited by Luap99

Loading

baude commented Jun 14, 2024

Luap99 left a comment

mheon commented Jun 18, 2024

Luap99 commented Jun 18, 2024

lambinoo commented Jun 18, 2024

Luap99 commented Jul 8, 2024

lambinoo commented Jul 8, 2024

lambinoo commented Jul 9, 2024

Luap99 left a comment

openshift-ci bot commented Jul 9, 2024

mheon commented Jul 9, 2024

Fix hang boot podman #22985

Fix hang boot podman #22985

Conversation

lambinoo commented Jun 12, 2024 • edited by Luap99 Loading

baude commented Jun 14, 2024

Luap99 left a comment

Choose a reason for hiding this comment

mheon commented Jun 18, 2024

Luap99 commented Jun 18, 2024

lambinoo commented Jun 18, 2024

Luap99 commented Jul 8, 2024

lambinoo commented Jul 8, 2024

lambinoo commented Jul 9, 2024

Luap99 left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jul 9, 2024

mheon commented Jul 9, 2024

lambinoo commented Jun 12, 2024 •

edited by Luap99

Loading