Improve check performance by filtering it's input before parsing #1875
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is a port of #1872 to
master
What does this PR do?
The kubelet's
/metrics/cadvisor
payload contains statistics on all cgroups, including system slices that are of no use for the kubelet check.The
kubelet
check currently filters these samples by looking of a non-emptycontainer_name
label. But this happens after we incurred the parsing, conversion and lookup costs.On systems running a lot of system slices, this makes the
kubelet
check run in more than 15 seconds and use a lot of memory. One "pathological" host goes up to 40 seconds and > 1 GB memory used. 99,5 % of the kubelet payload is system slices.This PR injects a simple text filtering component before the
prometheus_client
parsing logic, to remove these lines before incurring the parsing / conversion / lookup costs. For simplicity and performance, it is implemented as a list of blacklisted strings instead of regexp. If no blacklist is setup, the filtering logic is bypassed completely.Average
kubelet
check run on this test payload goes from 47784ms down to 842ms. These is some CPU overhead to the filtering (with a pre-filtered payload, the check run time is ~500ms), but it is significantly amortised on even a few system slices. More info in this test notebookOn a "regular" host with 15 containers and a dozen system slices, the patch lowers the CPU usage, while keeping the memory usage constant.
We still need to optimise the processing pipeline to handle a large amount of containers, this PR does not address this.
This fix is backported on top of
6.3.2
on thedatadog/agent-dev:xvello-kubelet-input-filter
image (with it's jmx variant too)Motivation
Make the
kubelet
check usable on hosts with lots of system slices.Review checklist
no-changelog
label attached- [ ] If PR impacts documentation, docs team has been notified or an issue has been opened on the documentation repo