Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limit of file descriptors inside a container always is 1024 #2532

Closed
unixlab opened this issue Nov 11, 2021 · 6 comments · Fixed by #2565
Closed

limit of file descriptors inside a container always is 1024 #2532

unixlab opened this issue Nov 11, 2021 · 6 comments · Fixed by #2565
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Milestone

Comments

@unixlab
Copy link

unixlab commented Nov 11, 2021

What happened:
Since #2321 is merged (done via #2465) the limit of file descriptors inside a container always is 1024.

What you expected to happen:
The same behavior as before the merge: the limit is inherited from the containerd process.

How to reproduce it (as minimally and precisely as possible):
kind build from branch v0.11.1

15:06:07 /tmp/kind [(HEAD detached at v0.11.1)] $ bin/kind version
kind v0.11.1 go1.16.4 linux/amd64
15:06:12 /tmp/kind [(HEAD detached at v0.11.1)] $ bin/kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.21.1) 🖼 
 ✓ Preparing nodes 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a nice day! 👋
15:08:04 /tmp/kind [(HEAD detached at v0.11.1)] $ kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'
1073741816
pod "test" deleted

kind build from current main

15:11:16 /tmp/kind [main] $ bin/kind version
kind v0.12.0-alpha+40cca930158358 go1.17.2 linux/amd64
15:11:18 /tmp/kind [main] $ bin/kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.22.2) 🖼
 ✓ Preparing nodes 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a nice day! 👋
15:11:41 /tmp/kind [main] $ kubectl run -it --restart=Never --rm test --image=alpine -- ash -c 'ulimit -n'
1024
pod "test" deleted

Anything else we need to know?:
I assume this is because #2321 introduced a base spec.
The base spec is configured here[0] and generated during the docker image build here[1].
ctr oci spec generates a default spec, which for unix is defined here[2] and has RLIMIT_NOFILE set to 1024.

Containerd honors the base spec if one is set[3] and clears the settings if non is set.
So if there is no base spec WithoutDefaultSecuritySettings is called which clears the limit[4].
If the limit is cleared containerd falls back to the limit from the containerd process[5].
Due to this systemd setting[6] containerd itself fell back to the host systems value.
This was the case before the merge.

But as there now is a base spec we never reach this code path which mean that always the value from the base spec is used (1024).

I have at least one use case were I need a higher value than 1024.
My current workaround is to manitpulate the base spec in the running containers:

docker exec -ti <container id of worker node> bash -c 'sed -i "s/1024/65536/g" /etc/containerd/cri-base.json && systemctl restart containerd'

I currently don't know what the best way to fix this would be, but I'm happy to help if someone has an idea.

Environment:

  • kind version: kind v0.12.0-alpha+40cca930158358 go1.17.2 linux/amd64
  • Kubernetes version: 1.22.2
  • Docker version: 20.10.9
  • OS: Manjaro Linux with Kernel 5.13.19-2-MANJARO

[0] https://github.com/kubernetes-sigs/kind/blob/main/images/base/files/etc/containerd/config.toml#L21
[1] https://github.com/kubernetes-sigs/kind/blob/main/images/base/Dockerfile#L152
[2] https://github.com/containerd/containerd/blob/main/oci/spec.go#L132
[3] https://github.com/containerd/containerd/blob/main/pkg/cri/server/container_create_linux.go#L127
[4] https://github.com/containerd/containerd/blob/main/pkg/cri/opts/spec_linux.go#L89
[5] containerd/cri#515 (comment)
[6] https://github.com/kubernetes-sigs/kind/blob/main/images/base/files/etc/systemd/system/containerd.service#L23

@unixlab unixlab added the kind/bug Categorizes issue or PR as related to a bug. label Nov 11, 2021
@aojea
Copy link
Contributor

aojea commented Nov 11, 2021

great catch

cc @BenTheElder

@aojea aojea added kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Nov 11, 2021
@BenTheElder
Copy link
Member

This is pretty frustrating, is there a way to get the actual default base spec?

@BenTheElder BenTheElder added this to the v0.12.0 milestone Nov 15, 2021
@aojea
Copy link
Contributor

aojea commented Nov 16, 2021

@AkihiroSuda can you advise here?

@BenTheElder
Copy link
Member

I filed containerd/containerd#6262 to discuss being able to obtain the CRI base spec upstream, in the meantime we should probably carefully inspect where CRI deviates from the OCI base spec and duplicate this behavior, I guess ...

@BenTheElder
Copy link
Member

So currently ctr oci spec is:

{
    "ociVersion": "1.0.2-dev",
    "process": {
        "user": {
            "uid": 0,
            "gid": 0
        },
        "cwd": "/",
        "capabilities": {
            "bounding": [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "effective": [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "inheritable": [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ],
            "permitted": [
                "CAP_CHOWN",
                "CAP_DAC_OVERRIDE",
                "CAP_FSETID",
                "CAP_FOWNER",
                "CAP_MKNOD",
                "CAP_NET_RAW",
                "CAP_SETGID",
                "CAP_SETUID",
                "CAP_SETFCAP",
                "CAP_SETPCAP",
                "CAP_NET_BIND_SERVICE",
                "CAP_SYS_CHROOT",
                "CAP_KILL",
                "CAP_AUDIT_WRITE"
            ]
        },
        "rlimits": [
            {
                "type": "RLIMIT_NOFILE",
                "hard": 1024,
                "soft": 1024
            }
        ],
        "noNewPrivileges": true
    },
    "root": {
        "path": "rootfs"
    },
    "mounts": [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc",
            "options": [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/dev",
            "type": "tmpfs",
            "source": "tmpfs",
            "options": [
                "nosuid",
                "strictatime",
                "mode=755",
                "size=65536k"
            ]
        },
        {
            "destination": "/dev/pts",
            "type": "devpts",
            "source": "devpts",
            "options": [
                "nosuid",
                "noexec",
                "newinstance",
                "ptmxmode=0666",
                "mode=0620",
                "gid=5"
            ]
        },
        {
            "destination": "/dev/shm",
            "type": "tmpfs",
            "source": "shm",
            "options": [
                "nosuid",
                "noexec",
                "nodev",
                "mode=1777",
                "size=65536k"
            ]
        },
        {
            "destination": "/dev/mqueue",
            "type": "mqueue",
            "source": "mqueue",
            "options": [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/sys",
            "type": "sysfs",
            "source": "sysfs",
            "options": [
                "nosuid",
                "noexec",
                "nodev",
                "ro"
            ]
        },
        {
            "destination": "/run",
            "type": "tmpfs",
            "source": "tmpfs",
            "options": [
                "nosuid",
                "strictatime",
                "mode=755",
                "size=65536k"
            ]
        }
    ],
    "linux": {
        "resources": {
            "devices": [
                {
                    "allow": false,
                    "access": "rwm"
                }
            ]
        },
        "cgroupsPath": "/default",
        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            },
            {
                "type": "network"
            }
        ],
        "maskedPaths": [
            "/proc/acpi",
            "/proc/asound",
            "/proc/kcore",
            "/proc/keys",
            "/proc/latency_stats",
            "/proc/timer_list",
            "/proc/timer_stats",
            "/proc/sched_debug",
            "/sys/firmware",
            "/proc/scsi"
        ],
        "readonlyPaths": [
            "/proc/bus",
            "/proc/fs",
            "/proc/irq",
            "/proc/sys",
            "/proc/sysrq-trigger"
        ]
    }
}

This doesn't contain selinux or apparmor profiles, we can just delete the rlimits for now and continue discussing long term options in containerd/containerd#6262

@BenTheElder
Copy link
Member

should be fixed in the latest image listed in v0.11.1 and in kind @ HEAD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants