-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(bpf): account all running state processes #1546
Conversation
🤖 SeineSailor Here's a concise summary of the pull request changes: Summary: The pull request refactors the Kepler eBPF scheduler's state-tracking logic to accurately account for processes in running, interruptible, and uninterruptible states. Key changes include:
Impact: These changes improve the accuracy and granularity of process state tracking, enabling more effective performance optimization and debugging. The removal of Observations/Suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this changes will introduce accuracy problems in the counters.
bpfassets/libbpf/src/kepler.bpf.c
Outdated
@@ -201,6 +201,18 @@ struct task_struct { | |||
SEC("tp_btf/sched_switch") | |||
int kepler_sched_switch_trace(u64 *ctx) | |||
{ | |||
// Skip some samples to minimize overhead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not skip resetting the counters on a CPU during the schedule switch. Let’s consider an example to show why this is crucial:
Suppose task1 is running on CPU1, and we decide to skip the metric collection that resets the counters. Then, task2 starts running on CPU1, and we collect the counters at this point. The collected data will now include the hardware counters and CPU time for both task1 and task2. This will lead to inaccurate metrics, as the counters will reflect cumulative data for multiple tasks.
Therefore, we must reset the counters before skipping the schedule switch to ensure that each task's metrics are correctly isolated. Moving the counter reset to before collect_metrics_and_reset_counters would invalidate this approach, leading to inaccurate and misleading data.
bpfassets/libbpf/src/kepler.bpf.c
Outdated
} | ||
|
||
if (prev_state == TASK_RUNNING) { | ||
if (prev_state | TASK_RUNNING) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must only collect metrics if the task was running on the CPU. We want to gather CPU metrics (hardware counters and CPU time) exclusively when the task was actively running on the CPU.
This is important because if task1 goes from idle to running (in sched_switch
) and we collect metrics from CPU1 while task3 was actually running on the CPU, we will mistakenly attribute the metrics of task3 to task1. To ensure accuracy, metrics collection should occur only for tasks that were running on the CPU.
with this change:
Without this change
|
e8bd1d0
to
0f7dfa1
Compare
@rootfs I'm with @marceloamaral on this one. |
@dave-tucker @marceloamaral @sunya-ch First we should take ebpf changes with care. The current power models built by model server doesn't have the process state check. If we change the way ebpf counts cpu time, that all the models must be re-trained to use the same cpu accounting method. Second, I am not convinced checking the last process state is necessary, since processes can change states during runtime and we don't have a way to tell if the last state is the only state that the |
@dave-tucker I agree we only count oncpu tasks. But oncpu doesn't mean task state is TASK_RUNNING. In fact, according to some references (e.g. the Linux Kernel Development book):
Even |
@marceloamaral @dave-tucker For reference, this is what on/off cpu distribution is counted in bcc. It is consistent with what we did before. The process state is never used in cpu time accounting. |
I counter you with the same tool, but ported to libbpf... in the same repo at the same revision 🤯
If you are looking at tasks in isolation this is true. However when the |
@dave-tucker what about this transition: stateDiagram
direction LR
Fork() --> TASK_RUNNING
TASK_RUNNING --> TASK_INTERRUPTIBLE
TASK_INTERRUPTIBLE --> tp/sched_switch()
By the time tp/sched_switch() checks the prev->state, it sees only TASK_INTERRUPTIBLE then ignores all the cpu time while in the TASK_RUNNING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 201:215 in kepler.bpf.c can also be deleted:
struct task_struct___o {
unsigned int state;
};
struct task_struct___x {
unsigned int __state;
};
static unsigned int get_task_state(void *task)
{
struct task_struct___x *t = task;
if (bpf_core_field_exists(t->__state))
return t->__state;
return ((struct task_struct___o *)task)->state;
}
For those following along here's a bpftrace script you can run:
This shows the following:
In the specific case that @rootfs mentioned you can see outlined here ☝️ What I'm observing on my system is that when a task gets marked as sleeping the CPU is then idle (TGID 0 and TASK_RUNNING is used to represent the CPU being Idle). That sleeping process then wakes up on the same CPU a little while later.... but (as seen above) then goes straight back to sleep... until eventually it gets swapped out for another task. Without understanding the intricacies of the Linux Scheduler more I'm not really sure what the "right" answer is. But the difference is:
🤔 |
Per 07/02 community meeting, the task state transition is not tracked during context switch. Tracking CPU states in sched_switch is not feasible. Future enhancement will include IRQ features in CPU models. |
847704d
to
b851c10
Compare
Signed-off-by: Huamin Chen <[email protected]>
The process state can have multiple states, running state is one of them, yet most of the user space processes are spending CPU on interruptible state.
The process can voluntarily changes its state fro running to interruptible, so by the time the sched_switch tracepoint enters, the state may not be in running.