chore(bpf): account all running state processes #1546

rootfs · 2024-06-17T15:55:39Z

The process state can have multiple states, running state is one of them, yet most of the user space processes are spending CPU on interruptible state.

The process can voluntarily changes its state fro running to interruptible, so by the time the sched_switch tracepoint enters, the state may not be in running.

github-actions · 2024-06-17T15:58:44Z

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Summary: The pull request refactors the Kepler eBPF scheduler's state-tracking logic to accurately account for processes in running, interruptible, and uninterruptible states. Key changes include:

Introducing counter_sched_switch to manage sampling rates for scheduling switch events
Adding conditional logic to skip certain samples and updating variable comparisons
Defining new states TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE for granular process state insights
Removing TASK_RUNNING constant definition and condition, enabling metrics collection for all process states
Simplifying code by removing get_task_state function and directly checking task state in kepler_sched_switch_trace

Impact: These changes improve the accuracy and granularity of process state tracking, enabling more effective performance optimization and debugging. The removal of TASK_RUNNING condition allows for more comprehensive metrics collection. The code simplification enhances readability without affecting the external interface or behavior.

Observations/Suggestions:

The introduction of new states and conditional logic may require additional testing to ensure correct behavior in various scenarios.
Consider adding comments or documentation to explain the reasoning behind the changes and the expected behavior of the new logic.
The removal of get_task_state function may have implications for other parts of the codebase that rely on it; ensure that these dependencies are addressed.

marceloamaral

Unfortunately, this changes will introduce accuracy problems in the counters.

marceloamaral · 2024-06-17T16:07:31Z

bpfassets/libbpf/src/kepler.bpf.c

@@ -201,6 +201,18 @@ struct task_struct {
 SEC("tp_btf/sched_switch")
 int kepler_sched_switch_trace(u64 *ctx)
 {
+	// Skip some samples to minimize overhead


We should not skip resetting the counters on a CPU during the schedule switch. Let’s consider an example to show why this is crucial:

Suppose task1 is running on CPU1, and we decide to skip the metric collection that resets the counters. Then, task2 starts running on CPU1, and we collect the counters at this point. The collected data will now include the hardware counters and CPU time for both task1 and task2. This will lead to inaccurate metrics, as the counters will reflect cumulative data for multiple tasks.

Therefore, we must reset the counters before skipping the schedule switch to ensure that each task's metrics are correctly isolated. Moving the counter reset to before collect_metrics_and_reset_counters would invalidate this approach, leading to inaccurate and misleading data.

marceloamaral · 2024-06-17T16:11:36Z

bpfassets/libbpf/src/kepler.bpf.c

-	}
-
-	if (prev_state == TASK_RUNNING) {
+	if (prev_state | TASK_RUNNING) {


We must only collect metrics if the task was running on the CPU. We want to gather CPU metrics (hardware counters and CPU time) exclusively when the task was actively running on the CPU.

This is important because if task1 goes from idle to running (in sched_switch) and we collect metrics from CPU1 while task3 was actually running on the CPU, we will mistakenly attribute the metrics of task3 to task1. To ensure accuracy, metrics collection should occur only for tasks that were running on the CPU.

rootfs · 2024-06-17T16:38:54Z

with this change:

kepler_process_bpf_cpu_time_ms_total{command="pmdaxfs",container_id="system_processes",pid="1872",source="bpf",vm_id=""} 0.475
kepler_process_bpf_cpu_time_ms_total{command="PanGPA",container_id="system_processes",pid="10220",source="bpf",vm_id=""} 1.205
kepler_process_bpf_cpu_time_ms_total{command="sssd_kcm",container_id="system_processes",pid="307528",source="bpf",vm_id=""} 1.211
kepler_process_bpf_cpu_time_ms_total{command="avahi-daemon",container_id="system_processes",pid="1098",source="bpf",vm_id=""} 1.441
kepler_process_bpf_cpu_time_ms_total{command="PanGPS",container_id="system_processes",pid="1133",source="bpf",vm_id=""} 1.464
kepler_process_bpf_cpu_time_ms_total{command="pmlogger",container_id="system_processes",pid="3429",source="bpf",vm_id=""} 2.026
kepler_process_bpf_cpu_time_ms_total{command="pmcd",container_id="system_processes",pid="1775",source="bpf",vm_id=""} 2.032
kepler_process_bpf_cpu_time_ms_total{command="pool-gsd-smartc",container_id="system_processes",pid="10514",source="bpf",vm_id=""} 2.22
kepler_process_bpf_cpu_time_ms_total{command="in:imjournal",container_id="system_processes",pid="1356",source="bpf",vm_id=""} 2.274
kepler_process_bpf_cpu_time_ms_total{command="goa-identity-se",container_id="system_processes",pid="10439",source="bpf",vm_id=""} 2.784
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4076450",source="bpf",vm_id=""} 3.326
kepler_process_bpf_cpu_time_ms_total{command="PanGPA",container_id="system_processes",pid="2301192",source="bpf",vm_id=""} 5.521
kepler_process_bpf_cpu_time_ms_total{command="rhsm-service",container_id="system_processes",pid="10614",source="bpf",vm_id=""} 8.082
kepler_process_bpf_cpu_time_ms_total{command="NetworkManager",container_id="system_processes",pid="1097",source="bpf",vm_id=""} 8.272
kepler_process_bpf_cpu_time_ms_total{command="gnome-shell",container_id="system_processes",pid="10267",source="bpf",vm_id=""} 10.949
kepler_process_bpf_cpu_time_ms_total{command="PanGPUI",container_id="system_processes",pid="10518",source="bpf",vm_id=""} 13.37
kepler_process_bpf_cpu_time_ms_total{command="bash",container_id="system_processes",pid="4080906",source="bpf",vm_id=""} 14.201
kepler_process_bpf_cpu_time_ms_total{command="nv-hostengine",container_id="system_processes",pid="1106",source="bpf",vm_id=""} 15.403
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4076006",source="bpf",vm_id=""} 15.545
kepler_process_bpf_cpu_time_ms_total{command="pmdalinux",container_id="system_processes",pid="1958",source="bpf",vm_id=""} 29.806
kepler_process_bpf_cpu_time_ms_total{command="cuda-EvtHandlr",container_id="system_processes",pid="3785672",source="bpf",vm_id=""} 34.385
kepler_process_bpf_cpu_time_ms_total{command="cuda-EvtHandlr",container_id="system_processes",pid="3760629",source="bpf",vm_id=""} 34.508
kepler_process_bpf_cpu_time_ms_total{command="pmproxy",container_id="system_processes",pid="2651",source="bpf",vm_id=""} 48.283
kepler_process_bpf_cpu_time_ms_total{command="bash",container_id="system_processes",pid="4079648",source="bpf",vm_id=""} 54.951
kepler_process_bpf_cpu_time_ms_total{command="kepler",container_id="system_processes",pid="4112987",source="bpf",vm_id=""} 55.331
kepler_process_bpf_cpu_time_ms_total{command="redis-server",container_id="system_processes",pid="1354",source="bpf",vm_id=""} 115.842
kepler_process_bpf_cpu_time_ms_total{command="tokio-runtime-w",container_id="system_processes",pid="4075904",source="bpf",vm_id=""} 158.379
kepler_process_bpf_cpu_time_ms_total{command="sshd",container_id="system_processes",pid="4075833",source="bpf",vm_id=""} 177.245
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4076019",source="bpf",vm_id=""} 339.208
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4075941",source="bpf",vm_id=""} 364.404
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4075991",source="bpf",vm_id=""} 399.046
kepler_process_bpf_cpu_time_ms_total{command="ksoftirqd/11",container_id="kernel_processes",pid="85",source="bpf",vm_id=""} 40349.952

Without this change

# curl localhost:8888/metrics |grep cpu_time |grep kepler_process |sort -k 2 -g
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70505    0 70505    0     0  9836k      0 --:--:-- --:--:-- --:--:-- 9836k
# HELP kepler_process_bpf_cpu_time_ms_total Aggregated value in bpf_cpu_time_ms value from bpf
# TYPE kepler_process_bpf_cpu_time_ms_total counter
kepler_process_bpf_cpu_time_ms_total{command="bash",container_id="system_processes",pid="4097293",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="PanGPA",container_id="system_processes",pid="10220",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="pmdalinux",container_id="system_processes",pid="1958",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="ps",container_id="system_processes",pid="4097278",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="sh",container_id="system_processes",pid="4097275",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="sh",container_id="system_processes",pid="4097283",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="cuda-EvtHandlr",container_id="system_processes",pid="3785672",source="bpf",vm_id=""} 0.046
kepler_process_bpf_cpu_time_ms_total{command="PanGPS",container_id="system_processes",pid="1133",source="bpf",vm_id=""} 0.058
kepler_process_bpf_cpu_time_ms_total{command="tokio-runtime-w",container_id="system_processes",pid="4075904",source="bpf",vm_id=""} 0.163
kepler_process_bpf_cpu_time_ms_total{command="cuda-EvtHandlr",container_id="system_processes",pid="3760629",source="bpf",vm_id=""} 0.185
kepler_process_bpf_cpu_time_ms_total{command="sshd",container_id="system_processes",pid="4075833",source="bpf",vm_id=""} 0.243
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4075991",source="bpf",vm_id=""} 1.532
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4076019",source="bpf",vm_id=""} 1.738
kepler_process_bpf_cpu_time_ms_total{command="redis-server",container_id="system_processes",pid="1354",source="bpf",vm_id=""} 1.883
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4075941",source="bpf",vm_id=""} 3.701
kepler_process_bpf_cpu_time_ms_total{command="ps",container_id="system_processes",pid="4097286",source="bpf",vm_id=""} 4.021
kepler_process_bpf_cpu_time_ms_total{command="pmproxy",container_id="system_processes",pid="2651",source="bpf",vm_id=""} 9.215
kepler_process_bpf_cpu_time_ms_total{command="top",container_id="system_processes",pid="4080966",source="bpf",vm_id=""} 10.923
kepler_process_bpf_cpu_time_ms_total{command="pmdaproc",container_id="system_processes",pid="1869",source="bpf",vm_id=""} 19.466
kepler_process_bpf_cpu_time_ms_total{command="swapper/10",container_id="kernel_processes",pid="0",source="bpf",vm_id=""} 50115.935

This from top output

dave-tucker · 2024-06-18T11:05:26Z

@rootfs I'm with @marceloamaral on this one.
The BPF probes are now counting the time spent oncpu for each task. Time waiting for CPU shouldn't be attributed as it's not consuming power.
It would be interesting to know more about how top is calcuating CPU utilization and TIME+ from the stats in /proc/$pid/stat. There are some details in man top.

rootfs · 2024-06-27T11:51:06Z

@dave-tucker @marceloamaral @sunya-ch First we should take ebpf changes with care. The current power models built by model server doesn't have the process state check. If we change the way ebpf counts cpu time, that all the models must be re-trained to use the same cpu accounting method. Second, I am not convinced checking the last process state is necessary, since processes can change states during runtime and we don't have a way to tell if the last state is the only state that the prev task is before being scheduled out.

rootfs · 2024-06-27T11:58:08Z

@rootfs I'm with @marceloamaral on this one. The BPF probes are now counting the time spent oncpu for each task. Time waiting for CPU shouldn't be attributed as it's not consuming power. It would be interesting to know more about how top is calcuating CPU utilization and TIME+ from the stats in /proc/$pid/stat. There are some details in man top.

@dave-tucker I agree we only count oncpu tasks. But oncpu doesn't mean task state is TASK_RUNNING. In fact, according to some references (e.g. the Linux Kernel Development book):

TASK_RUNNING: The process is runnable; it is either currently running or on a runqueue waiting to run. This is the only possible state for a process executing in user-space; it can also apply to a process in kernel-space that is actively running.

Even TASK_RUNNING doesn't guarantee the task is on CPU.

rootfs · 2024-06-27T12:04:26Z

@marceloamaral @dave-tucker For reference, this is what on/off cpu distribution is counted in bcc. It is consistent with what we did before. The process state is never used in cpu time accounting.
https://github.com/iovisor/bcc/blob/master/tools/cpudist.py

dave-tucker · 2024-06-27T13:49:52Z

@marceloamaral @dave-tucker For reference, this is what on/off cpu distribution is counted in bcc. It is consistent with what we did before. The process state is never used in cpu time accounting. https://github.com/iovisor/bcc/blob/master/tools/cpudist.py

I counter you with the same tool, but ported to libbpf... in the same repo at the same revision 🤯
https://github.com/iovisor/bcc/blob/master/libbpf-tools/cpudist.bpf.c#L100

@dave-tucker I agree we only count oncpu tasks. But oncpu doesn't mean task state is TASK_RUNNING. In fact, according to some references (e.g. the Linux Kernel Development book)

If you are looking at tasks in isolation this is true. However when the sched_switch tracepoint is hit we're given prev_task - the task that was previously scheduled... and the next_task that's being switched in. If we're checking prev_task->state and it's TASK_RUNNING I think it's pretty safe to assume that it was indeed executing previously on the CPU. It could be though that the previous task on the CPU was sleeping (i.e in state TASK_INTERRUPTIBLE) where it wouldn't make sense to account for its cycles there.

rootfs · 2024-06-27T14:33:53Z

@dave-tucker what about this transition:

stateDiagram
    direction LR
    Fork() --> TASK_RUNNING
    TASK_RUNNING --> TASK_INTERRUPTIBLE
    TASK_INTERRUPTIBLE --> tp/sched_switch()

By the time tp/sched_switch() checks the prev->state, it sees only TASK_INTERRUPTIBLE then ignores all the cpu time while in the TASK_RUNNING

dave-tucker

Lines 201:215 in kepler.bpf.c can also be deleted:

struct task_struct___o {
	unsigned int state;
};
struct task_struct___x {
	unsigned int __state;
};
static unsigned int get_task_state(void *task)
{
	struct task_struct___x *t = task;
	if (bpf_core_field_exists(t->__state))
		return t->__state;
	return ((struct task_struct___o *)task)->state;
}

dave-tucker · 2024-06-28T16:47:31Z

By the time tp/sched_switch() checks the prev->state, it sees only TASK_INTERRUPTIBLE then ignores all the cpu time while in the TASK_RUNNING

For those following along here's a bpftrace script you can run:

BEGIN {
    @states[0x0] = "TASK_RUNNING";
    @states[0x1] = "TASK_INTERRUPTIBLE";
    @states[0x2] = "TASK_UNINTERRUPTIBLE";
    @states[0x4] = "TASK_STOPPED";
    @states[0x8] = "TASK_TRACED";
    @states[0x10] = "EXIT_DEAD";
    @states[0x20] = "EXIT_ZOMBIE";
    @states[0x40] = "TASK_PARKED";
    @states[0x80] = "TASK_DEAD";
    @states[0x100] = "TASK_WAKEKILL";
    @states[0x200] = "TASK_WAKING"; 
    @states[0x400] = "TASK_NOLOAD";
    @states[0x800] = "TASK_NEW";
    @states[0x1000] = "TASK_RTLOCK_WAIT";
    @states[0x2000] = "TASK_FREEZABLE";
    @states[0x8000] = "TASK_FROZEN";
}

rawtracepoint:sched_switch {
    $prev = (struct task_struct *)arg1;
    $next = (struct task_struct *)arg2;
    printf("%s\nSwitch on CPU %d\nOut: %d %s\nIn: %d %s\n\n", strftime("%H:%M:%S.%f", nsecs), cpu,$prev->pid, @states[$prev->__state], $next->pid, @states[$next->__state]);
}

This shows the following:

15:56:25.396583
Switch on CPU 1
Out: 0 TASK_RUNNING
In: 2317812 TASK_RUNNING

15:56:25.396605
Switch on CPU 1
Out: 2317812 TASK_INTERRUPTIBLE
In: 0 TASK_RUNNING

15:56:25.396612
Switch on CPU 1
Out: 0 TASK_RUNNING
In: 2317812 TASK_RUNNING

15:56:25.396632
Switch on CPU 1
Out: 2317812 TASK_INTERRUPTIBLE
In: 0 TASK_RUNNING

... some event omitted and then finally

15:56:25.404417
Switch on CPU 1
Out: 2317812 TASK_RUNNING
In: 25 TASK_RUNNING

In the specific case that @rootfs mentioned you can see outlined here ☝️

What I'm observing on my system is that when a task gets marked as sleeping the CPU is then idle (TGID 0 and TASK_RUNNING is used to represent the CPU being Idle). That sleeping process then wakes up on the same CPU a little while later.... but (as seen above) then goes straight back to sleep... until eventually it gets swapped out for another task.

Without understanding the intricacies of the Linux Scheduler more I'm not really sure what the "right" answer is.

But the difference is:

The current code won't account any CPU time/cycles/instructions to that process until it goes offcpu. When a task gets switched out in TASK_INTERRUPTIBLE state it's ignored - in other words we still count it as being running.
This PR updates CPU time/cycles/instructions on every task switch. When a task gets switched out in TASK_INTERRUPTIBLE state, we consider it no longer running and if the CPU is going idle then that gets accounted to PID 0

🤔

rootfs · 2024-07-02T13:50:31Z

Per 07/02 community meeting, the task state transition is not tracked during context switch. Tracking CPU states in sched_switch is not feasible. Future enhancement will include IRQ features in CPU models.

dave-tucker · 2024-07-02T15:19:21Z

@rootfs needs rebasing on top of #1554
then if you can squash your commits I think this is ready to merge 🎉

Signed-off-by: Huamin Chen <[email protected]>

rootfs requested review from dave-tucker and marceloamaral June 17, 2024 15:55

marceloamaral requested changes Jun 17, 2024

View reviewed changes

rootfs force-pushed the bpf-cpu-time branch from cb0a94d to 30e75ce Compare June 17, 2024 16:21

rootfs force-pushed the bpf-cpu-time branch 8 times, most recently from e8bd1d0 to 0f7dfa1 Compare June 17, 2024 19:00

dave-tucker mentioned this pull request Jun 17, 2024

Use raw tracebacks to avoid functions with padding #1547

Closed

rootfs force-pushed the bpf-cpu-time branch from b56e968 to f609857 Compare June 27, 2024 19:55

dave-tucker requested changes Jun 28, 2024

View reviewed changes

rootfs force-pushed the bpf-cpu-time branch 3 times, most recently from 847704d to b851c10 Compare July 2, 2024 15:17

rootfs force-pushed the bpf-cpu-time branch from b851c10 to e94e8c9 Compare July 3, 2024 13:07

bpf: account all running state processes

2f7f207

Signed-off-by: Huamin Chen <[email protected]>

rootfs force-pushed the bpf-cpu-time branch from e94e8c9 to 2f7f207 Compare July 3, 2024 13:09

dave-tucker approved these changes Jul 3, 2024

View reviewed changes

rootfs merged commit 9c80387 into sustainable-computing-io:main Jul 3, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(bpf): account all running state processes #1546

chore(bpf): account all running state processes #1546

rootfs commented Jun 17, 2024 •

edited

Loading

github-actions bot commented Jun 17, 2024 •

edited

Loading

marceloamaral left a comment

marceloamaral Jun 17, 2024

marceloamaral Jun 17, 2024

rootfs commented Jun 17, 2024 •

edited

Loading

dave-tucker commented Jun 18, 2024

rootfs commented Jun 27, 2024

rootfs commented Jun 27, 2024 •

edited

Loading

rootfs commented Jun 27, 2024

dave-tucker commented Jun 27, 2024 •

edited

Loading

rootfs commented Jun 27, 2024

dave-tucker left a comment

dave-tucker commented Jun 28, 2024

rootfs commented Jul 2, 2024

dave-tucker commented Jul 2, 2024

chore(bpf): account all running state processes #1546

chore(bpf): account all running state processes #1546

Conversation

rootfs commented Jun 17, 2024 • edited Loading

github-actions bot commented Jun 17, 2024 • edited Loading

marceloamaral left a comment

Choose a reason for hiding this comment

marceloamaral Jun 17, 2024

Choose a reason for hiding this comment

marceloamaral Jun 17, 2024

Choose a reason for hiding this comment

rootfs commented Jun 17, 2024 • edited Loading

with this change:

Without this change

dave-tucker commented Jun 18, 2024

rootfs commented Jun 27, 2024

rootfs commented Jun 27, 2024 • edited Loading

rootfs commented Jun 27, 2024

dave-tucker commented Jun 27, 2024 • edited Loading

rootfs commented Jun 27, 2024

dave-tucker left a comment

Choose a reason for hiding this comment

dave-tucker commented Jun 28, 2024

rootfs commented Jul 2, 2024

dave-tucker commented Jul 2, 2024

rootfs commented Jun 17, 2024 •

edited

Loading

github-actions bot commented Jun 17, 2024 •

edited

Loading

rootfs commented Jun 17, 2024 •

edited

Loading

rootfs commented Jun 27, 2024 •

edited

Loading

dave-tucker commented Jun 27, 2024 •

edited

Loading