-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(bpf): Fix overhead when sampling #1685
fix(bpf): Fix overhead when sampling #1685
Conversation
Update the register counters one step before metrics sample is taken instead of updating registers every time which is increasing overhead Signed-off-by: Vimal Kumar <[email protected]>
🤖 SeineSailor Here's a concise summary of the pull request changes: Summary: This pull request optimizes the BPF program by reducing overhead when sampling. Key changes include updating register counters one step before taking metrics samples, resulting in significant overhead reduction. The Impact: This change improves performance by reducing overhead, as demonstrated by the average nanoseconds per count with Observations: The fix resolves issue #1607 and is a valuable optimization for the BPF program. It's essential to ensure that the updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vimalk78 Per the comment - and further discussion in #1607 (comment)
Each time the sched_switch probe is hit we must:
- (Assuming we've already recorded the process going off cpu)
- Read the hardware counters for this CPU (Instructions/Cycles/Cache Miss)
- Calculate the delta between the on-cpu and off-cpu readings and record that in the
processes
map - Register the process going on-cpu
The reason that sampling was removed was due to the following case:
- PID 1234 gets registered as going on CPU 1
- We skip some samples... during this period PID 1234 gets migrated to CPU 4, and PID 5678 goes on CPU1
- When our probe next triggers, we don't record any CPU activity for the 2 processes since as far as we know, PID 1234 is still on CPU1 and PID 5678 never went on a CPU.
In other words, the method we're using for calculating cpu cycles, instructions, cache misses and clock time relies on us knowing:
- Exactly when the process went on CPU
- Exactly when the process went off CPU
From looking at your code, the assumption seems to be as follows:
- If our sample rate is
100
, we're discarding99
samples - If we measure at
$sample_rate - 1
and again at$sample_rate
then surely we can calculate meaningful deltas
Unfortunately that assumption doesn't hold true given the nature of the events we're sampling - task switches.
The sample at $sample_rate - 1
and $sample_rate
will more likley resemble:
- CPU 1: Prev Task: PID 1234, Next Task: PID 0
- CPU 7: Prev Task: PID 0, Next Task: PID 5678
Aside:
Can you check the benchmarks please? It seems that the mean execution time is pretty much the same for both cases.
• [10.003 seconds]
BPF Exporter efficiently collects hardware counter metrics for sched_switch events [perf_event]
/home/runner/work/kepler/kepler/pkg/bpftest/bpf_suite_test.go:278
Report Entries >>
sched_switch tracepoint - /home/runner/work/kepler/kepler/pkg/bpftest/bpf_suite_test.go:280 @ 08/09/24 09:32:47.29
sched_switch tracepoint
Name | N | Min | Median | Mean | StdDev | Max
======================================================================================================
sampled sched_switch tracepoint [duration] | 764924 | 2.885µs | 7.383µs | 6.491µs | 2.374µs | 272.97µs
<< Report Entries
------------------------------
• [10.004 seconds]
BPF Exporter uses sample rate to reduce CPU time [perf_event]
/home/runner/work/kepler/kepler/pkg/bpftest/bpf_suite_test.go:320
Report Entries >>
sampled sched_switch tracepoint - /home/runner/work/kepler/kepler/pkg/bpftest/bpf_suite_test.go:322 @ 08/09/24 09:32:57.358
sampled sched_switch tracepoint
Name | N | Min | Median | Mean | StdDev | Max
=======================================================================================================
sampled sched_switch tracepoint [duration] | 763483 | 1.673µs | 6.312µs | 6.229µs | 2.345µs | 197.489µs
<< Report Entries
------------------------------
The average in both cases is exactly the same 😢
Looking at the branches in the code I'd think the story told in the benchmarks is indeed accurate.
I am getting some drop in benchmark test.
Purpose of this PR is not to fix sampling, rather only to avoid cost of discarded deltas that get computed. We can close the PR if it is not adding any value. |
Sampling doesn't record the same underlying details as probing. The experimental nature needs to be studied from the validation process that is currently done in the metal CI. I suggest we keep the true behavior of sampling and study the outcome of the accuracy in CI. This is also suggested as a research item by @mcalman |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vimalk78 when ready, can you add a scenario in the validation CI to compare the results between sampling and probing? cc @sthaha @KaiyiLiu1234 @vprashar2929
As discussed offline, we'll have this option and keep it experimental status till we have more test results from @vimalk78 |
Update the register counters one step before metrics sample is taken instead of updating registers every time which is increasing overhead Signed-off-by: Vimal Kumar <[email protected]>
Update the register counters one step before metrics sample is taken
instead of updating registers every time which is increasing overhead
EXPERIMENTAL_BPF_SAMPLE_RATE = 0
EXPERIMENTAL_BPF_SAMPLE_RATE = 1000
Fixes: #1607
Can you please try it @rootfs @dave-tucker