Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hang boot podman #22985

Merged
merged 1 commit into from
Jul 9, 2024

Conversation

lambinoo
Copy link
Contributor

@lambinoo lambinoo commented Jun 12, 2024

This is an attempt at fixing #22984.

It seems that if some background tasks are queued in libpod's Runtime before the worker's channel is set up (eg. in the refresh phase), they are not executed later on, but the workerGroup's counter is still ticked up. This leads podman to hang when the imageEngine is shutdown, since it waits for the workerGroup to be done.

I'm not really sure why the function to queue a job works, as the queue object shouldn't exist, but I think that might be caused by my very sparse Go knowledge.

I would probably need some guidance to write a test for that one, as my test workflow has been:

  1. Make a change to podman
  2. reboot -f
  3. Login and run journalctl list-jobs to monitor the boot process

And I'm honestly a bit confused at how do the reboot part in the tests.

Fixed a bug that could cause a the first podman command after boot to hang when using transient mode.

@openshift-ci openshift-ci bot added the do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None label Jun 12, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2024
@github-actions github-actions bot added machine kind/api-change Change to remote API; merits scrutiny labels Jun 12, 2024
@lambinoo lambinoo force-pushed the fix-hang-boot-podman branch from a81a4b9 to 2d23994 Compare June 12, 2024 15:22
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2024
@lambinoo lambinoo force-pushed the fix-hang-boot-podman branch 2 times, most recently from 769801d to e9dd3f0 Compare June 12, 2024 15:27
@baude
Copy link
Member

baude commented Jun 14, 2024

@mheon ptal

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently go does not panic when read/writing to nil channel and just blocks. And given we increment the counter as you said but then never can complete any job the counter will never go to zero so it just hang on exit there as you noticed correctly.

However fixing this like this is dangerous. We initialize the chan size to 10 which means after 10 sends it starts blocking. Now the blocking part is not an issue currently because we do the write to the channel in a separate go routine for some reason?

go func() {
		r.workerChannel <- f
}()

Thus blocking there would not cause an issue and your fix works correctly.
However I am unsure if this does not start biting us in the future, @mheon WDYT?

@mheon
Copy link
Member

mheon commented Jun 18, 2024

I don't actually know why we're adding jobs to the channel in a goroutine, but it does make things convenient for this case. I imagine it ought to be all right - we're effectively queuing jobs for completion once the parallel executor comes up.

Though perhaps we should just make the parallel executor come up sooner in runtime init? I don't think it has any dependencies on things like c/storage - is there a reason not to initialize it as the very first thing a runtime does?

@Luap99
Copy link
Member

Luap99 commented Jun 18, 2024

I don't actually know why we're adding jobs to the channel in a goroutine, but it does make things convenient for this case. I imagine it ought to be all right - we're effectively queuing jobs for completion once the parallel executor comes up.

I think his was caused by us adding removal code to the refresh logic: 9983e87

My best guess is that we remove some container in a pod that then causes the container/pod stop/removal logic to add jobs to the queue.

Though perhaps we should just make the parallel executor come up sooner in runtime init? I don't think it has any dependencies on things like c/storage - is there a reason not to initialize it as the very first thing a runtime does?

Well it depends what jobs are submitted, I think it makes sense to wait at least for the runtime.valid = true step. But one could also say nobody should submit jobs that depend on the runtime unless they already have a valid runtime...

In general it is really not well defined how the queue is supposed to work.

@lambinoo
Copy link
Contributor Author

I don't actually know why we're adding jobs to the channel in a goroutine, but it does make things convenient for this case. I imagine it ought to be all right - we're effectively queuing jobs for completion once the parallel executor comes up.

Though perhaps we should just make the parallel executor come up sooner in runtime init? I don't think it has any dependencies on things like c/storage - is there a reason not to initialize it as the very first thing a runtime does?

At first, I did just move the setup worker function up, but podman crashed in the jobs, hence why I moved the queue setup in a separate function.

@Luap99 Luap99 added the No New Tests Allow PR to proceed without adding regression tests label Jul 8, 2024
@Luap99
Copy link
Member

Luap99 commented Jul 8, 2024

@lambinoo Can you rebase and please include the PR description in the commit message as well.

I think the code is fine

@lambinoo
Copy link
Contributor Author

lambinoo commented Jul 8, 2024

Will do!

@lambinoo lambinoo force-pushed the fix-hang-boot-podman branch 2 times, most recently from 61b6aea to cfa5813 Compare July 9, 2024 09:14
@lambinoo
Copy link
Contributor Author

lambinoo commented Jul 9, 2024

That should do it @Luap99, let me know if there's anything else I should do

It seems that if some background tasks are queued in libpod's Runtime before the worker's channel is set up (eg. in the refresh phase), they are not executed later on, but the workerGroup's counter is still ticked up. This leads podman to hang when the imageEngine is shutdown, since it waits for the workerGroup to be done.

fixes containers#22984

Signed-off-by: Farya Maerten <[email protected]>
@lambinoo lambinoo force-pushed the fix-hang-boot-podman branch from cfa5813 to c819c7a Compare July 9, 2024 09:15
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

openshift-ci bot commented Jul 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lambinoo, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Jul 9, 2024
@mheon
Copy link
Member

mheon commented Jul 9, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 9, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 6221e63 into containers:main Jul 9, 2024
82 checks passed
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Oct 8, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Oct 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/api-change Change to remote API; merits scrutiny lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine No New Tests Allow PR to proceed without adding regression tests release-note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants