Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(bpf): account all running state processes #1546

Merged
merged 1 commit into from
Jul 3, 2024

Conversation

rootfs
Copy link
Contributor

@rootfs rootfs commented Jun 17, 2024

The process state can have multiple states, running state is one of them, yet most of the user space processes are spending CPU on interruptible state.

The process can voluntarily changes its state fro running to interruptible, so by the time the sched_switch tracepoint enters, the state may not be in running.

Copy link
Contributor

github-actions bot commented Jun 17, 2024

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Summary: The pull request refactors the Kepler eBPF scheduler's state-tracking logic to accurately account for processes in running, interruptible, and uninterruptible states. Key changes include:

  • Introducing counter_sched_switch to manage sampling rates for scheduling switch events
  • Adding conditional logic to skip certain samples and updating variable comparisons
  • Defining new states TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE for granular process state insights
  • Removing TASK_RUNNING constant definition and condition, enabling metrics collection for all process states
  • Simplifying code by removing get_task_state function and directly checking task state in kepler_sched_switch_trace

Impact: These changes improve the accuracy and granularity of process state tracking, enabling more effective performance optimization and debugging. The removal of TASK_RUNNING condition allows for more comprehensive metrics collection. The code simplification enhances readability without affecting the external interface or behavior.

Observations/Suggestions:

  • The introduction of new states and conditional logic may require additional testing to ensure correct behavior in various scenarios.
  • Consider adding comments or documentation to explain the reasoning behind the changes and the expected behavior of the new logic.
  • The removal of get_task_state function may have implications for other parts of the codebase that rely on it; ensure that these dependencies are addressed.

Copy link
Collaborator

@marceloamaral marceloamaral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this changes will introduce accuracy problems in the counters.

@@ -201,6 +201,18 @@ struct task_struct {
SEC("tp_btf/sched_switch")
int kepler_sched_switch_trace(u64 *ctx)
{
// Skip some samples to minimize overhead
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not skip resetting the counters on a CPU during the schedule switch. Let’s consider an example to show why this is crucial:

Suppose task1 is running on CPU1, and we decide to skip the metric collection that resets the counters. Then, task2 starts running on CPU1, and we collect the counters at this point. The collected data will now include the hardware counters and CPU time for both task1 and task2. This will lead to inaccurate metrics, as the counters will reflect cumulative data for multiple tasks.

Therefore, we must reset the counters before skipping the schedule switch to ensure that each task's metrics are correctly isolated. Moving the counter reset to before collect_metrics_and_reset_counters would invalidate this approach, leading to inaccurate and misleading data.

}

if (prev_state == TASK_RUNNING) {
if (prev_state | TASK_RUNNING) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must only collect metrics if the task was running on the CPU. We want to gather CPU metrics (hardware counters and CPU time) exclusively when the task was actively running on the CPU.

This is important because if task1 goes from idle to running (in sched_switch) and we collect metrics from CPU1 while task3 was actually running on the CPU, we will mistakenly attribute the metrics of task3 to task1. To ensure accuracy, metrics collection should occur only for tasks that were running on the CPU.

@rootfs
Copy link
Contributor Author

rootfs commented Jun 17, 2024

with this change:

kepler_process_bpf_cpu_time_ms_total{command="pmdaxfs",container_id="system_processes",pid="1872",source="bpf",vm_id=""} 0.475
kepler_process_bpf_cpu_time_ms_total{command="PanGPA",container_id="system_processes",pid="10220",source="bpf",vm_id=""} 1.205
kepler_process_bpf_cpu_time_ms_total{command="sssd_kcm",container_id="system_processes",pid="307528",source="bpf",vm_id=""} 1.211
kepler_process_bpf_cpu_time_ms_total{command="avahi-daemon",container_id="system_processes",pid="1098",source="bpf",vm_id=""} 1.441
kepler_process_bpf_cpu_time_ms_total{command="PanGPS",container_id="system_processes",pid="1133",source="bpf",vm_id=""} 1.464
kepler_process_bpf_cpu_time_ms_total{command="pmlogger",container_id="system_processes",pid="3429",source="bpf",vm_id=""} 2.026
kepler_process_bpf_cpu_time_ms_total{command="pmcd",container_id="system_processes",pid="1775",source="bpf",vm_id=""} 2.032
kepler_process_bpf_cpu_time_ms_total{command="pool-gsd-smartc",container_id="system_processes",pid="10514",source="bpf",vm_id=""} 2.22
kepler_process_bpf_cpu_time_ms_total{command="in:imjournal",container_id="system_processes",pid="1356",source="bpf",vm_id=""} 2.274
kepler_process_bpf_cpu_time_ms_total{command="goa-identity-se",container_id="system_processes",pid="10439",source="bpf",vm_id=""} 2.784
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4076450",source="bpf",vm_id=""} 3.326
kepler_process_bpf_cpu_time_ms_total{command="PanGPA",container_id="system_processes",pid="2301192",source="bpf",vm_id=""} 5.521
kepler_process_bpf_cpu_time_ms_total{command="rhsm-service",container_id="system_processes",pid="10614",source="bpf",vm_id=""} 8.082
kepler_process_bpf_cpu_time_ms_total{command="NetworkManager",container_id="system_processes",pid="1097",source="bpf",vm_id=""} 8.272
kepler_process_bpf_cpu_time_ms_total{command="gnome-shell",container_id="system_processes",pid="10267",source="bpf",vm_id=""} 10.949
kepler_process_bpf_cpu_time_ms_total{command="PanGPUI",container_id="system_processes",pid="10518",source="bpf",vm_id=""} 13.37
kepler_process_bpf_cpu_time_ms_total{command="bash",container_id="system_processes",pid="4080906",source="bpf",vm_id=""} 14.201
kepler_process_bpf_cpu_time_ms_total{command="nv-hostengine",container_id="system_processes",pid="1106",source="bpf",vm_id=""} 15.403
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4076006",source="bpf",vm_id=""} 15.545
kepler_process_bpf_cpu_time_ms_total{command="pmdalinux",container_id="system_processes",pid="1958",source="bpf",vm_id=""} 29.806
kepler_process_bpf_cpu_time_ms_total{command="cuda-EvtHandlr",container_id="system_processes",pid="3785672",source="bpf",vm_id=""} 34.385
kepler_process_bpf_cpu_time_ms_total{command="cuda-EvtHandlr",container_id="system_processes",pid="3760629",source="bpf",vm_id=""} 34.508
kepler_process_bpf_cpu_time_ms_total{command="pmproxy",container_id="system_processes",pid="2651",source="bpf",vm_id=""} 48.283
kepler_process_bpf_cpu_time_ms_total{command="bash",container_id="system_processes",pid="4079648",source="bpf",vm_id=""} 54.951
kepler_process_bpf_cpu_time_ms_total{command="kepler",container_id="system_processes",pid="4112987",source="bpf",vm_id=""} 55.331
kepler_process_bpf_cpu_time_ms_total{command="redis-server",container_id="system_processes",pid="1354",source="bpf",vm_id=""} 115.842
kepler_process_bpf_cpu_time_ms_total{command="tokio-runtime-w",container_id="system_processes",pid="4075904",source="bpf",vm_id=""} 158.379
kepler_process_bpf_cpu_time_ms_total{command="sshd",container_id="system_processes",pid="4075833",source="bpf",vm_id=""} 177.245
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4076019",source="bpf",vm_id=""} 339.208
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4075941",source="bpf",vm_id=""} 364.404
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4075991",source="bpf",vm_id=""} 399.046
kepler_process_bpf_cpu_time_ms_total{command="ksoftirqd/11",container_id="kernel_processes",pid="85",source="bpf",vm_id=""} 40349.952

Without this change

# curl localhost:8888/metrics |grep cpu_time |grep kepler_process |sort -k 2 -g
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70505    0 70505    0     0  9836k      0 --:--:-- --:--:-- --:--:-- 9836k
# HELP kepler_process_bpf_cpu_time_ms_total Aggregated value in bpf_cpu_time_ms value from bpf
# TYPE kepler_process_bpf_cpu_time_ms_total counter
kepler_process_bpf_cpu_time_ms_total{command="bash",container_id="system_processes",pid="4097293",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="PanGPA",container_id="system_processes",pid="10220",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="pmdalinux",container_id="system_processes",pid="1958",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="ps",container_id="system_processes",pid="4097278",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="sh",container_id="system_processes",pid="4097275",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="sh",container_id="system_processes",pid="4097283",source="bpf",vm_id=""} 0
kepler_process_bpf_cpu_time_ms_total{command="cuda-EvtHandlr",container_id="system_processes",pid="3785672",source="bpf",vm_id=""} 0.046
kepler_process_bpf_cpu_time_ms_total{command="PanGPS",container_id="system_processes",pid="1133",source="bpf",vm_id=""} 0.058
kepler_process_bpf_cpu_time_ms_total{command="tokio-runtime-w",container_id="system_processes",pid="4075904",source="bpf",vm_id=""} 0.163
kepler_process_bpf_cpu_time_ms_total{command="cuda-EvtHandlr",container_id="system_processes",pid="3760629",source="bpf",vm_id=""} 0.185
kepler_process_bpf_cpu_time_ms_total{command="sshd",container_id="system_processes",pid="4075833",source="bpf",vm_id=""} 0.243
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4075991",source="bpf",vm_id=""} 1.532
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4076019",source="bpf",vm_id=""} 1.738
kepler_process_bpf_cpu_time_ms_total{command="redis-server",container_id="system_processes",pid="1354",source="bpf",vm_id=""} 1.883
kepler_process_bpf_cpu_time_ms_total{command="node",container_id="system_processes",pid="4075941",source="bpf",vm_id=""} 3.701
kepler_process_bpf_cpu_time_ms_total{command="ps",container_id="system_processes",pid="4097286",source="bpf",vm_id=""} 4.021
kepler_process_bpf_cpu_time_ms_total{command="pmproxy",container_id="system_processes",pid="2651",source="bpf",vm_id=""} 9.215
kepler_process_bpf_cpu_time_ms_total{command="top",container_id="system_processes",pid="4080966",source="bpf",vm_id=""} 10.923
kepler_process_bpf_cpu_time_ms_total{command="pmdaproc",container_id="system_processes",pid="1869",source="bpf",vm_id=""} 19.466
kepler_process_bpf_cpu_time_ms_total{command="swapper/10",container_id="kernel_processes",pid="0",source="bpf",vm_id=""} 50115.935

This from top output
image

@dave-tucker
Copy link
Collaborator

@rootfs I'm with @marceloamaral on this one.
The BPF probes are now counting the time spent oncpu for each task. Time waiting for CPU shouldn't be attributed as it's not consuming power.
It would be interesting to know more about how top is calcuating CPU utilization and TIME+ from the stats in /proc/$pid/stat. There are some details in man top.

@rootfs
Copy link
Contributor Author

rootfs commented Jun 27, 2024

@dave-tucker @marceloamaral @sunya-ch First we should take ebpf changes with care. The current power models built by model server doesn't have the process state check. If we change the way ebpf counts cpu time, that all the models must be re-trained to use the same cpu accounting method. Second, I am not convinced checking the last process state is necessary, since processes can change states during runtime and we don't have a way to tell if the last state is the only state that the prev task is before being scheduled out.

@rootfs
Copy link
Contributor Author

rootfs commented Jun 27, 2024

@rootfs I'm with @marceloamaral on this one. The BPF probes are now counting the time spent oncpu for each task. Time waiting for CPU shouldn't be attributed as it's not consuming power. It would be interesting to know more about how top is calcuating CPU utilization and TIME+ from the stats in /proc/$pid/stat. There are some details in man top.

@dave-tucker I agree we only count oncpu tasks. But oncpu doesn't mean task state is TASK_RUNNING. In fact, according to some references (e.g. the Linux Kernel Development book):

TASK_RUNNING: The process is runnable; it is either currently running or on a runqueue waiting to run. This is the only possible state for a process executing in user-space; it can also apply to a process in kernel-space that is actively running.

Even TASK_RUNNING doesn't guarantee the task is on CPU.

@rootfs
Copy link
Contributor Author

rootfs commented Jun 27, 2024

@marceloamaral @dave-tucker For reference, this is what on/off cpu distribution is counted in bcc. It is consistent with what we did before. The process state is never used in cpu time accounting.
https://github.com/iovisor/bcc/blob/master/tools/cpudist.py

@dave-tucker
Copy link
Collaborator

dave-tucker commented Jun 27, 2024

@marceloamaral @dave-tucker For reference, this is what on/off cpu distribution is counted in bcc. It is consistent with what we did before. The process state is never used in cpu time accounting. https://github.com/iovisor/bcc/blob/master/tools/cpudist.py

I counter you with the same tool, but ported to libbpf... in the same repo at the same revision 🤯
https://github.com/iovisor/bcc/blob/master/libbpf-tools/cpudist.bpf.c#L100

@dave-tucker I agree we only count oncpu tasks. But oncpu doesn't mean task state is TASK_RUNNING. In fact, according to some references (e.g. the Linux Kernel Development book)

If you are looking at tasks in isolation this is true. However when the sched_switch tracepoint is hit we're given prev_task - the task that was previously scheduled... and the next_task that's being switched in. If we're checking prev_task->state and it's TASK_RUNNING I think it's pretty safe to assume that it was indeed executing previously on the CPU. It could be though that the previous task on the CPU was sleeping (i.e in state TASK_INTERRUPTIBLE) where it wouldn't make sense to account for its cycles there.

@rootfs
Copy link
Contributor Author

rootfs commented Jun 27, 2024

@dave-tucker what about this transition:

stateDiagram
    direction LR
    Fork() --> TASK_RUNNING
    TASK_RUNNING --> TASK_INTERRUPTIBLE
    TASK_INTERRUPTIBLE --> tp/sched_switch()
Loading

By the time tp/sched_switch() checks the prev->state, it sees only TASK_INTERRUPTIBLE then ignores all the cpu time while in the TASK_RUNNING

Copy link
Collaborator

@dave-tucker dave-tucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 201:215 in kepler.bpf.c can also be deleted:

struct task_struct___o {
	unsigned int state;
};
struct task_struct___x {
	unsigned int __state;
};
static unsigned int get_task_state(void *task)
{
	struct task_struct___x *t = task;
	if (bpf_core_field_exists(t->__state))
		return t->__state;
	return ((struct task_struct___o *)task)->state;
}

@dave-tucker
Copy link
Collaborator

By the time tp/sched_switch() checks the prev->state, it sees only TASK_INTERRUPTIBLE then ignores all the cpu time while in the TASK_RUNNING

For those following along here's a bpftrace script you can run:

BEGIN {
    @states[0x0] = "TASK_RUNNING";
    @states[0x1] = "TASK_INTERRUPTIBLE";
    @states[0x2] = "TASK_UNINTERRUPTIBLE";
    @states[0x4] = "TASK_STOPPED";
    @states[0x8] = "TASK_TRACED";
    @states[0x10] = "EXIT_DEAD";
    @states[0x20] = "EXIT_ZOMBIE";
    @states[0x40] = "TASK_PARKED";
    @states[0x80] = "TASK_DEAD";
    @states[0x100] = "TASK_WAKEKILL";
    @states[0x200] = "TASK_WAKING"; 
    @states[0x400] = "TASK_NOLOAD";
    @states[0x800] = "TASK_NEW";
    @states[0x1000] = "TASK_RTLOCK_WAIT";
    @states[0x2000] = "TASK_FREEZABLE";
    @states[0x8000] = "TASK_FROZEN";
}

rawtracepoint:sched_switch {
    $prev = (struct task_struct *)arg1;
    $next = (struct task_struct *)arg2;
    printf("%s\nSwitch on CPU %d\nOut: %d %s\nIn: %d %s\n\n", strftime("%H:%M:%S.%f", nsecs), cpu,$prev->pid, @states[$prev->__state], $next->pid, @states[$next->__state]);
}

This shows the following:

15:56:25.396583
Switch on CPU 1
Out: 0 TASK_RUNNING
In: 2317812 TASK_RUNNING

15:56:25.396605
Switch on CPU 1
Out: 2317812 TASK_INTERRUPTIBLE
In: 0 TASK_RUNNING

15:56:25.396612
Switch on CPU 1
Out: 0 TASK_RUNNING
In: 2317812 TASK_RUNNING

15:56:25.396632
Switch on CPU 1
Out: 2317812 TASK_INTERRUPTIBLE
In: 0 TASK_RUNNING

... some event omitted and then finally

15:56:25.404417
Switch on CPU 1
Out: 2317812 TASK_RUNNING
In: 25 TASK_RUNNING

In the specific case that @rootfs mentioned you can see outlined here ☝️

What I'm observing on my system is that when a task gets marked as sleeping the CPU is then idle (TGID 0 and TASK_RUNNING is used to represent the CPU being Idle). That sleeping process then wakes up on the same CPU a little while later.... but (as seen above) then goes straight back to sleep... until eventually it gets swapped out for another task.

Without understanding the intricacies of the Linux Scheduler more I'm not really sure what the "right" answer is.

But the difference is:

  • The current code won't account any CPU time/cycles/instructions to that process until it goes offcpu. When a task gets switched out in TASK_INTERRUPTIBLE state it's ignored - in other words we still count it as being running.
  • This PR updates CPU time/cycles/instructions on every task switch. When a task gets switched out in TASK_INTERRUPTIBLE state, we consider it no longer running and if the CPU is going idle then that gets accounted to PID 0

🤔

@rootfs
Copy link
Contributor Author

rootfs commented Jul 2, 2024

Per 07/02 community meeting, the task state transition is not tracked during context switch. Tracking CPU states in sched_switch is not feasible. Future enhancement will include IRQ features in CPU models.

@rootfs rootfs force-pushed the bpf-cpu-time branch 3 times, most recently from 847704d to b851c10 Compare July 2, 2024 15:17
@dave-tucker
Copy link
Collaborator

@rootfs needs rebasing on top of #1554
then if you can squash your commits I think this is ready to merge 🎉

@rootfs rootfs merged commit 9c80387 into sustainable-computing-io:main Jul 3, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants