Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[disk] Blacklist certain partitions by default #2492

Closed
ofek opened this issue Oct 31, 2018 · 6 comments
Closed

[disk] Blacklist certain partitions by default #2492

ofek opened this issue Oct 31, 2018 · 6 comments

Comments

@ofek
Copy link
Contributor

ofek commented Oct 31, 2018

Continuation of DataDog/datadog-agent#1961
Advanced filtering logic introduced in #2483

Adding new things to blacklist by default will be a breaking change and will likely require a major Agent release. As such, let's take the time to compile everything that should be excluded by default.

So far we have:

It should also be noted that since blacklists take precedence over whitelists, users would need to update both to re-enable something. Therefore, reducing the occurrence of that should be a goal. Implemented a better way: #7648

cc @DataDog/agent-integrations @DataDog/agent-core @DataDog/container-integrations

cc original participants @techdragon @amineo @coreypobrien @sudermanjr @steinnes @j-vizcaino @nerdinand

@j-vizcaino
Copy link
Contributor

j-vizcaino commented Oct 31, 2018

As far as filesystem metrics are concerned, using a whitelist should be the easiest and safest way to achieve what a user wants. I suspect the list of FS to consider this way would be pretty small (ext*, xfs, zfs, btrfs).
Given the myriad of pseudo filesystems currently available and mounted in Linux, building an exhaustive blacklist would be a daunting task. Even worse, since the family of pseudo-filesystem is in constant evolution, that would require maintenance over time.

It's also worth noting mount_point_blacklist default only addresses the non-containerized agent deployment. Again, FS whitelisting should be the answer here, automagically skipping binfmt_misc pseudo FS.

It should also be noted that since blacklists take precedence over whitelists, users would need to update both to re-enable something.

As a user, this behaviour seems a bit awkward: could we make it blacklist first, then whitelist entries after that? That would allow us to blacklist every FS by default, then "punch holes" by whitelisting a small set of FS that provide meaningful stats.

@zippolyte
Copy link
Contributor

As a user, this behaviour seems a bit awkward: could we make it blacklist first, then whitelist entries after that? That would allow us to blacklist every FS by default, then "punch holes" by whitelisting a small set of FS that provide meaningful stats.

Well if you want just a specific set of FS, you just need to whitelist them, and only the whitelisted ones will be considered, no need to explicitly blacklist the ones you don't want.
The precedence of the blacklist over the whitelist only applies when there is an intersection between whitelisted elements, and blacklisted ones. In this case then the elements that match both the whitelist and the blacklist will be blacklisted. Does that make sense ?

This is the way we have implemented blacklist/whitelist in other integrations as well.

@sandstrom
Copy link

sandstrom commented Apr 4, 2019

I was bitten by this recently. Mine looked something like this:

udev             1977284       0   1977284   0% /dev
tmpfs             397864     744    397120   1% /run
/dev/nvme0n1p1  24329532 7438244  16874904  31% /
tmpfs            1989304       0   1989304   0% /dev/shm
tmpfs               5120       0      5120   0% /run/lock
tmpfs            1989304       0   1989304   0% /sys/fs/cgroup
/dev/loop5         16896   16896         0 100% /snap/amazon-ssm-agent/784
/dev/loop2         18432   18432         0 100% /snap/amazon-ssm-agent/930
…

And the end-result in Datadog was that the value system.disk.in_use would be the average of several devices (some are already filtered by the Datadog agent per default, but not all of them). Since some were at 0%, it would deflate the actual value, so our trigger at 80% usage wouldn't fire even though the main disk was at ~99% and the system crashed.

For a tool such as DataDog, I don't think I'm alone in assuming that a monitor with something like warn if disk.in_use > 80 would "do the right thing" out of the box.

Just wanted to share this example.

@grv231
Copy link

grv231 commented May 5, 2020

Faced similar issues (related to certain partitions) in our agent (version 7.17). Had to go with the workaround of adding disk.yaml configmap in addition to the in our Kube DD agent daemonsets manifests. I would really like this feature to be there by default so that the kube manifests files are much cleaner. In addition to this, documentation on DD does mention the issue, but never really tells on what has to be done (tasks). Could have easily saved a lot of time yesterday if at least the GitHub issue links were mentioned (just my 2 cents)

@KIVagant
Copy link

KIVagant commented May 27, 2020

Does anyone have a ready-to-use example of Helm values that one can use to ignore /host/proc/sys/fs/binfmt_misc?

@mx-psi
Copy link
Member

mx-psi commented Sep 24, 2020

#7378 adds an option to ignore non physical file systems that is relevant for this goal. We still track all devices by default to avoid breaking changes.

@hithwen hithwen closed this as completed May 6, 2021
@DataDog DataDog locked and limited conversation to collaborators May 6, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

9 participants