Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bpf): Fix overhead when sampling #1685

Merged
merged 1 commit into from
Aug 14, 2024

Conversation

vimalk78
Copy link
Collaborator

@vimalk78 vimalk78 commented Aug 9, 2024

Update the register counters one step before metrics sample is taken
instead of updating registers every time which is increasing overhead

  • with EXPERIMENTAL_BPF_SAMPLE_RATE = 0
[root@vimalkum-thinkpadp1gen4i ~]# bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } '
rt_ns: 301913885 count:  95144 avg:  3173.23

[root@vimalkum-thinkpadp1gen4i ~]# bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } '
rt_ns: 212656178 count:  68417 avg:  3108.24
  • with EXPERIMENTAL_BPF_SAMPLE_RATE = 1000
[root@vimalkum-thinkpadp1gen4i ~]# bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } '
rt_ns: 33681094 count:  76518 avg:  440.172

[root@vimalkum-thinkpadp1gen4i ~]# bpftool prog show name kepler_sched_switch_trace | head -n 1 | awk '{print "rt_ns:", $(NF-2), "count: ", $NF, "avg: ", $(NF-2)/$NF } '
rt_ns: 45678100 count:  105344 avg:  433.609

Fixes: #1607

Can you please try it @rootfs @dave-tucker

  Update the register counters one step before metrics sample is taken
  instead of updating registers every time which is increasing overhead

Signed-off-by: Vimal Kumar <[email protected]>
Copy link
Contributor

github-actions bot commented Aug 9, 2024

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Summary: This pull request optimizes the BPF program by reducing overhead when sampling. Key changes include updating register counters one step before taking metrics samples, resulting in significant overhead reduction. The do_kepler_sched_switch_trace function is also modified to update hardware counters before collecting metrics.

Impact: This change improves performance by reducing overhead, as demonstrated by the average nanoseconds per count with EXPERIMENTAL_BPF_SAMPLE_RATE set to 0 and 1000. The external interface and behavior of the code remain unchanged.

Observations: The fix resolves issue #1607 and is a valuable optimization for the BPF program. It's essential to ensure that the updated do_kepler_sched_switch_trace function does not introduce any unintended side effects or affect the accuracy of metrics collection.

Copy link
Collaborator

@dave-tucker dave-tucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vimalk78 Per the comment - and further discussion in #1607 (comment)

Each time the sched_switch probe is hit we must:

  1. (Assuming we've already recorded the process going off cpu)
  2. Read the hardware counters for this CPU (Instructions/Cycles/Cache Miss)
  3. Calculate the delta between the on-cpu and off-cpu readings and record that in the processes map
  4. Register the process going on-cpu

The reason that sampling was removed was due to the following case:

  1. PID 1234 gets registered as going on CPU 1
  2. We skip some samples... during this period PID 1234 gets migrated to CPU 4, and PID 5678 goes on CPU1
  3. When our probe next triggers, we don't record any CPU activity for the 2 processes since as far as we know, PID 1234 is still on CPU1 and PID 5678 never went on a CPU.

In other words, the method we're using for calculating cpu cycles, instructions, cache misses and clock time relies on us knowing:

  • Exactly when the process went on CPU
  • Exactly when the process went off CPU

From looking at your code, the assumption seems to be as follows:

  • If our sample rate is 100, we're discarding 99 samples
  • If we measure at $sample_rate - 1 and again at $sample_rate then surely we can calculate meaningful deltas

Unfortunately that assumption doesn't hold true given the nature of the events we're sampling - task switches.
The sample at $sample_rate - 1 and $sample_rate will more likley resemble:

  • CPU 1: Prev Task: PID 1234, Next Task: PID 0
  • CPU 7: Prev Task: PID 0, Next Task: PID 5678

Aside:

Can you check the benchmarks please? It seems that the mean execution time is pretty much the same for both cases.

• [10.003 seconds]
BPF Exporter efficiently collects hardware counter metrics for sched_switch events [perf_event]
/home/runner/work/kepler/kepler/pkg/bpftest/bpf_suite_test.go:278

  Report Entries >>
  sched_switch tracepoint - /home/runner/work/kepler/kepler/pkg/bpftest/bpf_suite_test.go:280 @ 08/09/24 09:32:47.29
    sched_switch tracepoint
    Name                                       | N      | Min     | Median  | Mean    | StdDev  | Max     
    ======================================================================================================
    sampled sched_switch tracepoint [duration] | 764924 | 2.885µs | 7.383µs | 6.491µs | 2.374µs | 272.97µs
  << Report Entries
------------------------------
• [10.004 seconds]
BPF Exporter uses sample rate to reduce CPU time [perf_event]
/home/runner/work/kepler/kepler/pkg/bpftest/bpf_suite_test.go:320

  Report Entries >>
  sampled sched_switch tracepoint - /home/runner/work/kepler/kepler/pkg/bpftest/bpf_suite_test.go:322 @ 08/09/24 09:32:57.358
    sampled sched_switch tracepoint
    Name                                       | N      | Min     | Median  | Mean    | StdDev  | Max      
    =======================================================================================================
    sampled sched_switch tracepoint [duration] | 763483 | 1.673µs | 6.312µs | 6.229µs | 2.345µs | 197.489µs
  << Report Entries
------------------------------

The average in both cases is exactly the same 😢

Looking at the branches in the code I'd think the story told in the benchmarks is indeed accurate.

@vimalk78
Copy link
Collaborator Author

vimalk78 commented Aug 9, 2024

I am getting some drop in benchmark test.

• [4.558 seconds]
BPF Exporter efficiently collects hardware counter metrics for sched_switch events [perf_event]
/home/vimalkum/src/powermon/kepler/pkg/bpftest/bpf_suite_test.go:278

  Report Entries >>
  sched_switch tracepoint - /home/vimalkum/src/powermon/kepler/pkg/bpftest/bpf_suite_test.go:280 @ 08/09/24 19:56:57.457
    sched_switch tracepoint
    Name                                       | N       | Min   | Median  | Mean    | StdDev | Max      
    =====================================================================================================
    sampled sched_switch tracepoint [duration] | 1000000 | 916ns | 2.018µs | 2.044µs | 693ns  | 158.839µs
  << Report Entries
------------------------------
• [3.750 seconds]
BPF Exporter uses sample rate to reduce CPU time [perf_event]
/home/vimalkum/src/powermon/kepler/pkg/bpftest/bpf_suite_test.go:320

  Report Entries >>
  sampled sched_switch tracepoint - /home/vimalkum/src/powermon/kepler/pkg/bpftest/bpf_suite_test.go:322 @ 08/09/24 19:57:02.083
    sampled sched_switch tracepoint
    Name                                       | N       | Min   | Median  | Mean    | StdDev | Max      
    =====================================================================================================
    sampled sched_switch tracepoint [duration] | 1000000 | 610ns | 1.399µs | 1.368µs | 621ns  | 163.194µs
  << Report Entries
------------------------------

Purpose of this PR is not to fix sampling, rather only to avoid cost of discarded deltas that get computed.

We can close the PR if it is not adding any value.

@rootfs
Copy link
Contributor

rootfs commented Aug 13, 2024

Sampling doesn't record the same underlying details as probing. The experimental nature needs to be studied from the validation process that is currently done in the metal CI. I suggest we keep the true behavior of sampling and study the outcome of the accuracy in CI.

This is also suggested as a research item by @mcalman

Copy link
Contributor

@rootfs rootfs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vimalk78 when ready, can you add a scenario in the validation CI to compare the results between sampling and probing? cc @sthaha @KaiyiLiu1234 @vprashar2929

@rootfs
Copy link
Contributor

rootfs commented Aug 14, 2024

The research idea is actually from here #836 cc @eklee15

@rootfs
Copy link
Contributor

rootfs commented Aug 14, 2024

As discussed offline, we'll have this option and keep it experimental status till we have more test results from @vimalk78

@rootfs rootfs merged commit eb5a72a into sustainable-computing-io:main Aug 14, 2024
20 of 21 checks passed
arthurus-rex pushed a commit to arthurus-rex/kepler that referenced this pull request Aug 19, 2024
Update the register counters one step before metrics sample is taken
  instead of updating registers every time which is increasing overhead

Signed-off-by: Vimal Kumar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bpf sampling no longer reduces overhead
3 participants