-
Notifications
You must be signed in to change notification settings - Fork 30
very high memory usage due to kernfs_node_cache slabs #1927
Comments
I was able to reproduce this in qemu. However, dropping caches with |
This is also an issue for us. See below for details on the setup. We are running on AWS on m4.xlarge. Free memory keeps getting lower and lower. All we are running on the machine is etcd 3.1.8 and datadog agent. top/ps are showing resident memory use of around 100MB for etcd and its the biggest memory usage, everything else is a lot smaller. However the kernfs_node_cache was also really high. When we did
|
@chobomuffin, is this issue affecting the reliability or performance of the system, or does it appear to be cosmetic only? |
Well on our dev environment the etcd process ended up getting killed. On our staging machines they are 16gig, so it is taking a while to get as low as the dev boxes got. |
Could you post the kernel logs from that event? |
I can't yet, those machines were all rebooted since. I will wait for it too happen again (took around 5 days of uptime) and post the logs. |
If the AWS instances still exist, you can get the kernel logs for the previous boot with |
Motivation is to avoid serious memory leak on etcd nodes as in * coreos/bugs#1927
Motivation is to avoid serious memory leak on etcd nodes as in * coreos/bugs#1927
Motivation is to avoid serious memory leak on etcd nodes as in * coreos/bugs#1927
JFYI: that what we did to mitigate the issue in kube-aws context kubernetes-retired/kube-aws#705 |
Motivation is to avoid serious memory leak on etcd nodes as in * coreos/bugs#1927
Closing due to inactivity. |
It's unfortunate that this issue was not researched further. I have the same issue (albeit not using rkt but an in-house container system) where we have kernfs_node_cache is growing until it put too much pressure on the page table application caches hence strongly degrading performances. If @ytsarev says that switching to docker solves this problem then there's something docker must be doing in order to prevent this kernfs_node_cache usage that rkt does not. |
After some review I believe the underlying issue is sshd socket activation: systemd/systemd#6567 |
This requires a kernel fix to avoid unbounded cache growth. |
@riking : are you referring to a particular kernel fix that has been already merged in a recent version ? |
Issue Report
Bug
Container Linux Version
Environment
AWS
Expected Behavior
Slab allocations stay on a steady level
Actual Behavior
We found high memory usage on all our etcd servers which periodically run healthcheck in rkt containers. rkt run using default stage1 seems to cause
kernfs_node_cache
allocations leakReproduction Steps
then observe
grep Slab: /proc/meminfo
to steadily growOther Information
on a server running for just over a day with healthcheck executing every 20-30 seconds:
The text was updated successfully, but these errors were encountered: