Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot open /proc/sys/kernel/ns_last_pid #1199

Closed
ashwani29 opened this issue Sep 4, 2020 · 28 comments
Closed

cannot open /proc/sys/kernel/ns_last_pid #1199

ashwani29 opened this issue Sep 4, 2020 · 28 comments

Comments

@ashwani29
Copy link

I'm experimenting with a restored container which includes reading and changing the ns_last_pid value. the container is paused and I'm changing/writing a new value to ns_last_pid, but it gives me an error of Read-only file system. How criu is able to achieve this to do PID dance?

I did exactly as explained in this tutorial: https://criu.org/Pid_restore , i.e same way of opening and writing the file.

@ashwani29
Copy link
Author

I just tested this for a different program running on the host, that program changes the value of ns_last_pid on the host kernel.
but when a program joins the restored container namespace and tries to set the ns_last_pid field, it is giving the above error.
How can I do this for a different namespace then?

@ashwani29
Copy link
Author

@adrianreber ? help pls.

@ashwani29
Copy link
Author

can you please tell how criu is doing it?

@adrianreber
Copy link
Member

This is unrelated to CRIU if the file is read-only. Your container is probably mounting it as read-only. See readonlyPaths of config.json.

@ashwani29
Copy link
Author

Wait don't close it, I did gone through read only path section in config. json and removed /proc/sys line from there but still it was giving the error of read only file system.

@ashwani29
Copy link
Author

And if my file was read only then how come criu is able to edit it to restore same pids?

@cyphar
Copy link

cyphar commented Sep 5, 2020

(Sorry, I forgot to press "send".)

The reason this isn't working is that /proc/sys is mounted read-only inside containers as a general security hardening measure (almost no containers need to be able to write to it, and in the past some sysctls were writeable from inside a container when they shouldn't have been). In addition, the default AppArmor profile for containers also blocks this.

If you're using Docker you can disable these things by doing --security-opt systempaths=unconfined and --security-opt apparmor=unconfined. Note that your joining process needs CAP_SYS_ADMIN rights to write to ns_last_pid (if you use user namespaces you only need CAP_SYS_ADMIN in the user namespace that owns the PID namespace, rather than globally).

% docker run --security-opt systempaths=unconfined --security-opt apparmor=unconfined -it ubuntu bash
# echo 999 > /proc/sys/kernel/ns_last_pid # this works now

@ashwani29
Copy link
Author

I'm using runc as I've to work with criu so runc seems to be of better use. So what can I do in runc to solve this, as I said already, I removed /proc/sys line from read only path section of config.json but it doesn't work.

@cyphar
Copy link

cyphar commented Sep 6, 2020

Add CAP_SYS_ADMIN to the set of capabilities (add to all of the arrays), remove readonly: true from the root section.

@ashwani29
Copy link
Author

ashwani29 commented Sep 6, 2020

still the same error, is it because when the process joins the container namespace, it joins every other namespace except user one, for that it shows Invalid argument?

@ashwani29
Copy link
Author

@cyphar ??

@ashwani29
Copy link
Author

I tested with docker too with same above commands :

% docker run --security-opt systempaths=unconfined --security-opt apparmor=unconfined -it ubuntu bash
# echo 999 > /proc/sys/kernel/ns_last_pid

gives below error:

user@user-HP-Pavilion-Notebook:~$ docker run --security-opt systempaths=unconfined --security-opt apparmor=unconfined -it ubuntu bash
root@5bee85f3e838:/# echo 999 > /proc/sys/kernel/ns_last_pid
bash: echo: write error: Operation not permitted
root@5bee85f3e838:/# 

@avagin
Copy link
Member

avagin commented Sep 9, 2020

@ashwani29 docker run --cap-add=SYS_ADMIN --rm -it ubuntu bash -c 'echo 999 > /proc/sys/kernel/ns_last_pid'

@ashwani29
Copy link
Author

usr@usr:~/Desktop$ docker run --cap-add=SYS_ADMIN --rm -it ubuntu bash -c 'echo 999 > /proc/sys/kernel/ns_last_pid'
bash: /proc/sys/kernel/ns_last_pid: Read-only file system

@avagin still same error

@rst0git
Copy link
Member

rst0git commented Sep 16, 2020

@ashwani29 you may need to add --privileged as well:

docker run --privileged --cap-add=SYS_ADMIN --rm -it ubuntu bash -c 'echo 999 > /proc/sys/kernel/ns_last_pid'

@ashwani29
Copy link
Author

@rst0git It worked, thanks.
Can you tell how to do it for runc container, there it throws the error of Read-only Filesystem.
above suggested solution not working for runc and I can't rely on docker for long as i've no idea how can i checkpoint and restore docker with criu and play with it.

@adrianreber
Copy link
Member

@ashwani29 Does your container have the necessary capabilities to write to ns_last_pid? You need CAP_SYS_ADMIN for that.

Please show the capabilities you defined in config.json.

What do you mean with this:

I can't rely on docker for long as i've no idea how can i checkpoint and restore docker with criu and play with it.

@ashwani29
Copy link
Author

Yes @adrianreber
Here is the config.json file:

{
	"ociVersion": "1.0.2-dev",
	"process": {
		"terminal": false,
		"user": {
			"uid": 0,
			"gid": 0
		},
		"args": [
			"sh",
			"./script.sh"
		],
		"env": [
			"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
			"TERM=xterm"
		],
		"cwd": "/",
		"capabilities": {
			"bounding": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
				"CAP_SYS_ADMIN"				
			],
			"effective": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
				"CAP_SYS_ADMIN"				
			],
			"inheritable": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
				"CAP_SYS_ADMIN"				
			],
			"permitted": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
				"CAP_SYS_ADMIN"
			],
			"ambient": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE",
				"CAP_SYS_ADMIN"
			]
		},
		"rlimits": [
			{
				"type": "RLIMIT_NOFILE",
				"hard": 1024,
				"soft": 1024
			}
		],
		"noNewPrivileges": true
	},
	"root": {
		"path": "rootfs",
		"readonly": false
	},
	"hostname": "runc",
	"mounts": [
		{
			"destination": "/proc",
			"type": "proc",
			"source": "proc"			
		},
		{
			"destination": "/dev",
			"type": "tmpfs",
			"source": "tmpfs",
			"options": [
				"nosuid",
				"strictatime",
				"mode=755",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/pts",
			"type": "devpts",
			"source": "devpts",
			"options": [
				"nosuid",
				"noexec",
				"newinstance",
				"ptmxmode=0666",
				"mode=0620",
				"gid=5",
				"rw"
			]
		},
		{
			"destination": "/dev/shm",
			"type": "tmpfs",
			"source": "shm",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"mode=1777",
				"size=65536k",
				"rw"
			]
		},
		{
			"destination": "/dev/mqueue",
			"type": "mqueue",
			"source": "mqueue",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"rw"
			]
		},
		{
			"destination": "/sys",
			"type": "sysfs",
			"source": "sysfs",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"rw"
			]
		},
		{
			"destination": "/sys/fs/cgroup",
			"type": "cgroup",
			"source": "cgroup",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"relatime",
				"rw"
			]
		}
	],
	"linux": {
		"resources": {
			"devices": [
				{
					"allow": true,
					"access": "rwm"
				}
			]
		},	
		"namespaces": [
			{
				"type": "pid"
			},
			{
				"type": "network"
			},
			{
				"type": "ipc"
			},
			{
				"type": "uts"
			},
			{
				"type": "mount"
			}
		],
		"maskedPaths": [
			"/proc/acpi",
			"/proc/asound",
			"/proc/kcore",
			"/proc/keys",
			"/proc/latency_stats",
			"/proc/timer_list",
			"/proc/timer_stats",
			"/proc/sched_debug",
			"/sys/firmware",
			"/proc/scsi"
		],
		"readonlyPaths": [
			"/proc/bus",
			"/proc/irq",
			"/proc/sysrq-trigger"
		]
	}
}

I removed options like ro and added rw there, removed most /proc paths from readonlyPaths. even then it's not working.

I mean to say I don't know docker that much and never C/Red it with criu. runc is more understandable to me as it's been months using it.

What do you mean with this:

I can't rely on docker for long as I've no idea how can I checkpoint and restore docker with criu and play with it.

@ashwani29
Copy link
Author

this is what I'm trying to do.

I just tested this for a different program running on the host, that program changes the value of ns_last_pid on the host kernel.
but when a program joins the restored container namespace and tries to set the ns_last_pid field, it is giving the above error.

@adrianreber
Copy link
Member

Works for me:

# runc exec test-container bash
echo 1 > /proc/sys/kernel/ns_last_pid
cat /proc/sys/kernel/ns_last_pid
2

Not sure why it does not work for you.

@ashwani29
Copy link
Author

This is what I'm getting now:

user@user:~/ctr2$ sudo runc exec ctr2 bash
[sudo] password for user: 
FATA[0000] join_namespaces:561 nsenter: failed to setns to /proc/9894/ns/user: Invalid argument 
ERRO[0000] exec failed: container_linux.go:353: starting container process caused: process_linux.go:99: executing setns process caused: exit status 1 

can you share the filesystem you are using for the container? I want to know more about these environments runc make for running a container.

@adrianreber
Copy link
Member

Why is it mentioning user namespace there? There is no user namespace mentioned in config.json. My container root file system is running on NFS.

@ashwani29
Copy link
Author

Oh sorry, in between I made some changes again to the config.json. let me revert back and try again.

@ashwani29
Copy link
Author

user@user:~/ctr2$ sudo runc exec ctr2 bash
ERRO[0000] exec failed: container_linux.go:353: starting container process caused: exec: "bash": executable file not found in $PATH 

and that is what a busy box or a ubuntu image?

My container root file system is running on NFS.

@adrianreber
Copy link
Member

Then you probably have another shell in your container. You need to know what is in your container. It is not like there are thousand types of different shells. The content of the container is not really relevant for this exercise, but my container is running RHEL 7.

@ashwani29
Copy link
Author

I've found that there is sh file under bin directory. so command I used is :
sudo runc exec ctr2 sh
but then it looks like it got hanged, just blinking the cursor but not proceeding further

@adrianreber
Copy link
Member

It does not hang. Just type something.

@ashwani29
Copy link
Author

@adrianreber a process joining ipc, net, pid, mnt, cgroup namespaces of the container is unable to write the ns_last_pid file, It doesn't throw any error but the value written(a high value 5000) is not reflected on reading whereas runc exec command is able to do that and the value is reflected on reading.
didn't understand why is this happening?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants