Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thaw a paused container in cgroup v1 when it is forcely deleted. #1204

Merged
merged 2 commits into from
Oct 31, 2022

Conversation

cyyzero
Copy link
Contributor

@cyyzero cyyzero commented Sep 21, 2022

If force is not given, and cgroups is v1 and container is frozen, display error saying could not be killed as it was forzen. If force option is given, and cgroups is v1 and container is frozen, thaw it and send the kill signal. If cgroups is v2 , nothing special needs to be done.

Fix: #1129

Signed-off-by: Chen Yiyang [email protected]

If force is not given, and cgroups is v1 and container is frozen, display
error saying could not be killed as it was forzen. If force option is given,
and cgroups is v1 and container is frozen, thaw it and send the kill signal.
If cgroups is v2 , nothing special needs to be done.

Fix: youki-dev#1129

Signed-off-by: Chen Yiyang <[email protected]>
@cyyzero cyyzero marked this pull request as draft September 21, 2022 10:04
@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2022

Codecov Report

Merging #1204 (2a711e9) into main (281c0a9) will decrease coverage by 0.52%.
The diff coverage is 0.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1204      +/-   ##
==========================================
- Coverage   69.29%   68.77%   -0.53%     
==========================================
  Files         118      119       +1     
  Lines       12474    12595     +121     
==========================================
+ Hits         8644     8662      +18     
- Misses       3830     3933     +103     

@YJDoc2
Copy link
Collaborator

YJDoc2 commented Sep 29, 2022

Hey @cyyzero , I was going through implementation of runc of this, and one thing I noticed is that for single or multiple (https://github.com/opencontainers/runc/blob/main/libcontainer/container_linux.go#L370) , they first signal the processes, in the frozen state ( in fact they seem to freeze the cgroup before sending kill signal in case of all https://github.com/opencontainers/runc/blob/1c3b8dbaf440d16653d834b612258bbb28268730/libcontainer/init_linux.go#L530) then send kill signal, and then thaw the cgroup

In the PR you seem to thaw the cgroup before sending the kill signal. One of the issues with this is that between the time you get all processes from the cgroup and the time you send signal, there might be additional processes spawned as cgroup is not frozen. The runc's way seems to avoid it.

Also once you are done with the initial implementation that is in the current state, can I ask you to add --all option in the cmd interface? It is a pretty small change, and the value of it will be given as all argument to the kill. That will help us to pass one of the containerd test.

Thank you!

@cyyzero
Copy link
Contributor Author

cyyzero commented Oct 5, 2022

I might know why integration test TestContainerKillInitPidHost fails. I output command lines of each execution of youki:

runc --root /run/containerd/runc/testing --log /run/containerd-test/io.containerd.runtime.v2.task/testing/TestContainerKillInitPidHost/log.json --log-format json create --bundle /run/containerd-test/io.containerd.runtime.v2.task/testing/TestContainerKillInitPidHost --pid-file /run/containerd-test/io.containerd.runtime.v2.task/testing/TestContainerKillInitPidHost/init.pid TestContainerKillInitPidHost
runc --root /run/containerd/runc/testing --log /run/containerd-test/io.containerd.runtime.v2.task/testing/TestContainerKillInitPidHost/log.json --log-format json start TestContainerKillInitPidHost
runc --root /run/containerd/runc/testing --log /run/containerd-test/io.containerd.runtime.v2.task/testing/TestContainerKillInitPidHost/log.json --log-format json kill TestContainerKillInitPidHost 9
runc --root /run/containerd/runc/testing --log /run/containerd-test/io.containerd.runtime.v2.task/testing/TestContainerKillInitPidHost/log.json --log-format json kill --all TestContainerKillInitPidHost 9
runc --root /run/containerd/runc/testing --log /run/containerd-test/io.containerd.runtime.v2.task/testing/TestContainerKillInitPidHost/log.json --log-format json kill --all TestContainerKillInitPidHost 9
runc --root /run/containerd/runc/testing --log /run/containerd-test/io.containerd.runtime.v2.task/testing/TestContainerKillInitPidHost/log.json --log-format json delete TestContainerKillInitPidHost
runc --root /run/containerd/runc/testing --log /run/containerd-test/io.containerd.runtime.v2.task/testing/TestContainerKillInitPidHost/log.json --log-format json delete --force TestContainerKillInitPidHost

TestContainerKillInitPidHost runs the container without creating pid namespace, so the child process will not exit automatically after killing the init process, it is necessary to use kill -all.

https://github.com/containerd/containerd/blob/31f9d13f0cc3a7d60e009cb8e76ccb188d4a76d7/integration/client/container_linux_test.go#L1120

I haven't read the containerd or shim-v2 source code , so I'm not quite clear why the kill command is called multiple times. It seems the first kill has no options. In the previous implementation, the container status becomes stopped after kill, and subsequent kill --alls can't pass the can_kill check, so the child process could not be killed correctly.

In fact, in the runc source code, it doesn't do status check for --all option.
https://github.com/opencontainers/runc/blob/1102f3fc9d5208961f19d70618c3be251466ee3f/libcontainer/container_linux.go#L361-L371

@cyyzero
Copy link
Contributor Author

cyyzero commented Oct 5, 2022

The latest commit seems to be able to pass TestContainerKillInitPidHost and TestContainerKillInitKillsChildWhenNotHostPid.

cc @YJDoc2

@YJDoc2
Copy link
Collaborator

YJDoc2 commented Oct 6, 2022

Hey @cyyzero Thanks a lot for taking time and updating the PR! It is great that more integration tests are passing now!
I have been running into same issues with runc impl for making some other tests pass, where it does not exactly align with the "standard spec". However, as runc is quite famous and is used a lot, it has become a sort of "standard" itself 😅

I'll take a look at the code around weekend, and get back here. Thanks :)

Copy link
Collaborator

@YJDoc2 YJDoc2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, great work! I have left some comments, please take a look.

}
}
self.set_status(ContainerStatus::Stopped).save()?;
std::process::exit(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we had exit here in the original implementation... libcontainers is a library and should not call exit unless extreme conditions. Ideally we should let users of the library decide what to do after calling this function. It would be a better idea to return Ok(()) from here. @Furisto if you get time, can you check and let know if there was any particular reasoning behind exiting from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. return Ok(()) is better than exiting here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make that change, where we return the result instead of exiting? Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YJDoc2 Sure😄

@YJDoc2
Copy link
Collaborator

YJDoc2 commented Oct 25, 2022

@cyyzero Apologies for the delay in my response. I have updated the comments, and if you can change the process exit to return result and add the --all option to the cli part (liboci-cli and the youki actually invoking the command), this PR is fine. Can you make that change and un-draft this PR, so we can approve and merge it? Thanks a lot!

@utam0k , as this does change the behavior of kill slightly, and adds an --all flag, should we bump the version of youki, liboci-cli and libcontainers?

Again, apologies to both for the delay on my side 😅

@utam0k
Copy link
Member

utam0k commented Oct 25, 2022

@utam0k , as this does change the behavior of kill slightly, and adds an --all flag, should we bump the version of youki, liboci-cli and libcontainers?

I'm planning to release v0.0.4 soon if the contained test is passed with youki after merging this PR, so there is no problem. Thanks for your concern ;)

To pass containerd integration test TestContainerKillInitPidHost, we
need to allow a container to be killed --all again when its status is
stopped.

Fix: youki-dev#1225

Signed-off-by: Chen Yiyang <[email protected]>
@cyyzero
Copy link
Contributor Author

cyyzero commented Oct 29, 2022

https://github.com/containers/youki/blob/6d05dd2f60198aaecb7c93d08c08d7db4fbc6600/crates/liboci-cli/src/kill.rs#L5-L11
@YJDoc2 Thanks for your review! It seems that the --all option has been added to the liboci-cli before. And I have changed some exit(0) to return Ok(()).

@cyyzero cyyzero marked this pull request as ready for review October 29, 2022 11:03
@utam0k utam0k requested a review from YJDoc2 October 29, 2022 12:50
Copy link
Collaborator

@YJDoc2 YJDoc2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

@YJDoc2
Copy link
Collaborator

YJDoc2 commented Oct 31, 2022

Hey @cyyzero Thanks for your contribution and efforts ! :)

@YJDoc2 YJDoc2 merged commit 0b90f8f into youki-dev:main Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't kill a paused container in cgroup v1 environment.
4 participants