fix(bpf): Fix overhead when sampling #1685

vimalk78 · 2024-08-09T09:31:13Z

Update the register counters one step before metrics sample is taken
instead of updating registers every time which is increasing overhead

with EXPERIMENTAL_BPF_SAMPLE_RATE = 0

[root@vimalkum-thinkpadp1gen4i ~]# bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } '
rt_ns: 301913885 count:  95144 avg:  3173.23

[root@vimalkum-thinkpadp1gen4i ~]# bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } '
rt_ns: 212656178 count:  68417 avg:  3108.24

with EXPERIMENTAL_BPF_SAMPLE_RATE = 1000

[root@vimalkum-thinkpadp1gen4i ~]# bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } '
rt_ns: 33681094 count:  76518 avg:  440.172

[root@vimalkum-thinkpadp1gen4i ~]# bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } '
rt_ns: 45678100 count:  105344 avg:  433.609

Fixes: #1607

Can you please try it @rootfs @dave-tucker

Update the register counters one step before metrics sample is taken instead of updating registers every time which is increasing overhead Signed-off-by: Vimal Kumar <[email protected]>

github-actions · 2024-08-09T09:31:51Z

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Summary: This pull request optimizes the BPF program by reducing overhead when sampling. Key changes include updating register counters one step before taking metrics samples, resulting in significant overhead reduction. The do_kepler_sched_switch_trace function is also modified to update hardware counters before collecting metrics.

Impact: This change improves performance by reducing overhead, as demonstrated by the average nanoseconds per count with EXPERIMENTAL_BPF_SAMPLE_RATE set to 0 and 1000. The external interface and behavior of the code remain unchanged.

Observations: The fix resolves issue #1607 and is a valuable optimization for the BPF program. It's essential to ensure that the updated do_kepler_sched_switch_trace function does not introduce any unintended side effects or affect the accuracy of metrics collection.

dave-tucker

@vimalk78 Per the comment - and further discussion in #1607 (comment)

Each time the sched_switch probe is hit we must:

(Assuming we've already recorded the process going off cpu)
Read the hardware counters for this CPU (Instructions/Cycles/Cache Miss)
Calculate the delta between the on-cpu and off-cpu readings and record that in the processes map
Register the process going on-cpu

The reason that sampling was removed was due to the following case:

PID 1234 gets registered as going on CPU 1
We skip some samples... during this period PID 1234 gets migrated to CPU 4, and PID 5678 goes on CPU1
When our probe next triggers, we don't record any CPU activity for the 2 processes since as far as we know, PID 1234 is still on CPU1 and PID 5678 never went on a CPU.

In other words, the method we're using for calculating cpu cycles, instructions, cache misses and clock time relies on us knowing:

Exactly when the process went on CPU
Exactly when the process went off CPU

From looking at your code, the assumption seems to be as follows:

If our sample rate is 100, we're discarding 99 samples
If we measure at $sample_rate - 1 and again at $sample_rate then surely we can calculate meaningful deltas

Unfortunately that assumption doesn't hold true given the nature of the events we're sampling - task switches.
The sample at $sample_rate - 1 and $sample_rate will more likley resemble:

CPU 1: Prev Task: PID 1234, Next Task: PID 0
CPU 7: Prev Task: PID 0, Next Task: PID 5678

Aside:

Can you check the benchmarks please? It seems that the mean execution time is pretty much the same for both cases.

• [10.003 seconds]
BPF Exporter efficiently collects hardware counter metrics for sched_switch events [perf_event]
/home/runner/work/kepler/kepler/pkg/bpftest/bpf_suite_test.go:278

  Report Entries >>
  sched_switch tracepoint - /home/runner/work/kepler/kepler/pkg/bpftest/bpf_suite_test.go:280 @ 08/09/24 09:32:47.29
    sched_switch tracepoint
    Name                                       | N      | Min     | Median  | Mean    | StdDev  | Max     
    ======================================================================================================
    sampled sched_switch tracepoint [duration] | 764924 | 2.885µs | 7.383µs | 6.491µs | 2.374µs | 272.97µs
  << Report Entries
------------------------------
• [10.004 seconds]
BPF Exporter uses sample rate to reduce CPU time [perf_event]
/home/runner/work/kepler/kepler/pkg/bpftest/bpf_suite_test.go:320

  Report Entries >>
  sampled sched_switch tracepoint - /home/runner/work/kepler/kepler/pkg/bpftest/bpf_suite_test.go:322 @ 08/09/24 09:32:57.358
    sampled sched_switch tracepoint
    Name                                       | N      | Min     | Median  | Mean    | StdDev  | Max      
    =======================================================================================================
    sampled sched_switch tracepoint [duration] | 763483 | 1.673µs | 6.312µs | 6.229µs | 2.345µs | 197.489µs
  << Report Entries
------------------------------

The average in both cases is exactly the same 😢

Looking at the branches in the code I'd think the story told in the benchmarks is indeed accurate.

vimalk78 · 2024-08-09T14:32:32Z

I am getting some drop in benchmark test.

• [4.558 seconds]
BPF Exporter efficiently collects hardware counter metrics for sched_switch events [perf_event]
/home/vimalkum/src/powermon/kepler/pkg/bpftest/bpf_suite_test.go:278

  Report Entries >>
  sched_switch tracepoint - /home/vimalkum/src/powermon/kepler/pkg/bpftest/bpf_suite_test.go:280 @ 08/09/24 19:56:57.457
    sched_switch tracepoint
    Name                                       | N       | Min   | Median  | Mean    | StdDev | Max      
    =====================================================================================================
    sampled sched_switch tracepoint [duration] | 1000000 | 916ns | 2.018µs | 2.044µs | 693ns  | 158.839µs
  << Report Entries
------------------------------
• [3.750 seconds]
BPF Exporter uses sample rate to reduce CPU time [perf_event]
/home/vimalkum/src/powermon/kepler/pkg/bpftest/bpf_suite_test.go:320

  Report Entries >>
  sampled sched_switch tracepoint - /home/vimalkum/src/powermon/kepler/pkg/bpftest/bpf_suite_test.go:322 @ 08/09/24 19:57:02.083
    sampled sched_switch tracepoint
    Name                                       | N       | Min   | Median  | Mean    | StdDev | Max      
    =====================================================================================================
    sampled sched_switch tracepoint [duration] | 1000000 | 610ns | 1.399µs | 1.368µs | 621ns  | 163.194µs
  << Report Entries
------------------------------

Purpose of this PR is not to fix sampling, rather only to avoid cost of discarded deltas that get computed.

We can close the PR if it is not adding any value.

rootfs · 2024-08-13T14:08:41Z

Sampling doesn't record the same underlying details as probing. The experimental nature needs to be studied from the validation process that is currently done in the metal CI. I suggest we keep the true behavior of sampling and study the outcome of the accuracy in CI.

This is also suggested as a research item by @mcalman

rootfs

@vimalk78 when ready, can you add a scenario in the validation CI to compare the results between sampling and probing? cc @sthaha @KaiyiLiu1234 @vprashar2929

rootfs · 2024-08-14T14:25:29Z

The research idea is actually from here #836 cc @eklee15

rootfs · 2024-08-14T17:17:56Z

As discussed offline, we'll have this option and keep it experimental status till we have more test results from @vimalk78

Update the register counters one step before metrics sample is taken instead of updating registers every time which is increasing overhead Signed-off-by: Vimal Kumar <[email protected]>

fix(bpf): Fix overhead when sampling

088ab58

Update the register counters one step before metrics sample is taken instead of updating registers every time which is increasing overhead Signed-off-by: Vimal Kumar <[email protected]>

dave-tucker requested changes Aug 9, 2024

View reviewed changes

rootfs approved these changes Aug 13, 2024

View reviewed changes

rootfs merged commit eb5a72a into sustainable-computing-io:main Aug 14, 2024
20 of 21 checks passed

dave-tucker mentioned this pull request Sep 17, 2024

feat(bpf): use time window for bpf sampling to replace per call based sampling #1723

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bpf): Fix overhead when sampling #1685

fix(bpf): Fix overhead when sampling #1685

vimalk78 commented Aug 9, 2024 •

edited

Loading

github-actions bot commented Aug 9, 2024 •

edited

Loading

dave-tucker left a comment

vimalk78 commented Aug 9, 2024

rootfs commented Aug 13, 2024

rootfs left a comment

rootfs commented Aug 14, 2024

rootfs commented Aug 14, 2024

fix(bpf): Fix overhead when sampling #1685

fix(bpf): Fix overhead when sampling #1685

Conversation

vimalk78 commented Aug 9, 2024 • edited Loading

github-actions bot commented Aug 9, 2024 • edited Loading

dave-tucker left a comment

Choose a reason for hiding this comment

vimalk78 commented Aug 9, 2024

rootfs commented Aug 13, 2024

rootfs left a comment

Choose a reason for hiding this comment

rootfs commented Aug 14, 2024

rootfs commented Aug 14, 2024

vimalk78 commented Aug 9, 2024 •

edited

Loading

github-actions bot commented Aug 9, 2024 •

edited

Loading