Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: group memory.stats sock metric #3642

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cafkafk
Copy link

@cafkafk cafkafk commented Jan 6, 2025

This adds the cgroup stat sock from the memory.stats metric to
cAdvisor.

The motivation is that we've seen numerous examples at DBC Digital of
application developers creating applications that exhaust socket memory,
e.g. by accidentally creating too many TCP connections and not closing
them, or keeping around a few large allocations, or many other such
issues.

Because cAdvisor currently doesn't report socket memory usage, this has
been hard to monitor, and will only be observed when the OOM killer is
reached.

By adding this metric, it will be possible to proactively handle socket
memory exhaustion (which is really kernel memory exhaustion), before it
becomes a potential incident, and to create alerting and enhance

Signed-off-by: Christina Sørensen [email protected]


Notice: I've been unable to figure out how to regenerate the snapshot tests,
I've opened an issue #3632 for this, but have yet to recieve any replies.

I'm hoping making this PR will bring more attention to this change, so it can
recieve feedback.

This adds the cgroup stat `sock` from the `memory.stats` metric to
cAdvisor.

The motivation is that we've seen numerous examples at DBC Digital of
application developers creating applications that exhaust socket memory,
e.g. by accidentally creating too many TCP connections and not closing
them, or keeping around a few large allocations, or many other such
issues.

Because cAdvisor currently doesn't report socket memory usage, this has
been hard to monitor, and will only be observed when the OOM killer is
reached.

By adding this metric, it will be possible to proactively handle socket
memory exhaustion (which is really kernel memory exhaustion), before it
becomes a potential incident, and to create alerting and enhance
observability of this failure mode.

Signed-off-by: Christina Sørensen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant